Wandering Thoughts

2016-07-11

Why Python can't have a full equivalent of Go's gofmt

I mentioned in passing here that people are working on Python equivalents of Go's gofmt and since then I've played around a bit with yapf, which was the most well developed one that I could find. Playing around with yapf (and thinking more about how to deal with my Python autoindent problem) brought home a realization, which is that Python fundamentally can't have a full, true equivalent of gofmt.

In Go, you can be totally sloppy in your pre-gofmt code; basically anything goes. Specifically, you don't need to indent your Go code in any particular way or even at all. If you're banging out a quick modification to some Go code, you can just stuff it in with either completely contradictory indentation or no indentation at all. More broadly, you can easily survive an editor with malfunctioning auto-indentation for Go code. Sure, you code will look ugly before you gofmt it, but it'll still work.

With Python, you can't be this free and casual. Since indentation is semantically meaningful, you must get the indentation correct right from the start; you can't leave it out or be inconsistent. A Python equivalent of gofmt can change the indentation level you use (and change some aspects of indentation style), but it can't add indentation for you in the way that gofmt does. This means that malfunctioning editor auto-indent is quite a bit more damaging (as is not having it at all); since indentation is not optional, you must correct or add it by hand, all the time. In Python, either you or your editor are forced to be less sloppy than you can be in Go.

(Sure, Go requires that you put in the { and } to denote block start and end, but those are easy and fast compared to getting indentation correct.)

Of course, you can start out with minimal, fast to create indentation; Python will let you do one or two space indents if you really want. But once you run yapf on your initial code, in many cases you're going to be stuck matching it for code changes. Python will tolerate a certain amount of indentation style mismatches, but not too much (Python 3 is less relaxed here than Python 2). Also, I'm confident that I don't know just how much sloppiness one can get away with here, so in practice I think most people are going to be matching the existing indentation even if they don't strictly have to. I know that I will be.

I hadn't thought about this asymmetry before my editor of choice started not getting my Python auto-indentation quite right, but it's now rather more on my mind.

PythonNoFullGofmt written at 00:02:48; Add Comment

2016-07-03

An irritating little bug in the latest GNU Emacs Python autoindent code

I really like having smart autoindent in my editor when writing code, Python code included. When it works, autoindent does exactly what I would do by hand, does it easier, and in the process shows me errors in my code (if the autoindent is 'wrong', it is a signal that something earlier is off). But the flipside of this is that when autoindent goes wrong it can be a screaming irritation, as I flip from working with my editor to actively fighting it.

Unfortunately the latest official version of GNU Emacs has such an issue in its Python autoindent code, under conditions that are probably rare. To see the bug, set Emacs up with python-indent-offset set to 8 and indent-tabs-mode set to t, and then enter:

def abc():
	if d in e:
		pass
	# Hit return here:

If you put your cursor on the end of the comment and hit return, autoindent doesn't add any indentation at all. It should add one level of indentation. Also, once you have this code in a .py file you don't need to set anything in Emacs; Emacs will auto-guess that the indent offset is 8 and then the mere presence of tabs will cause things to explode. This makes this issue especially annoying and/or hazardous.

Some people will say that this serves me right for still using tabs for indentation in my Python code. I'm aware that there's been a general movement in the Python community to indent all Python code with only spaces, regardless of how much you indent it by, but for various reasons I have strongly resisted this. One of them is that I edit Python code in multiple editors, not all of them ones with smart autoindentation, and space-based indenting is painful in an editor that doesn't do it for you. Well, at least using generous indents with manual spaces is painful, and I'm not likely to give that up any time soon.

(I like generous indents in code. Small indent levels make everything feel crammed together and it's less obvious if something is misindented when everything is closer. Of course Python's many levels of nesting doesn't necessarily make this easy; by the time I'm writing an inner function in a method in a class, I'm starting to run out of horizontal space.)

PS: I suspect that I'm going to have to give up my 'indent with tabs' habits some day, probably along with my 8-space indents. The modern Python standard seems to be 4-space indent with spaces and there's a certain amount to be said for the value of uniformity.

(People are apparently working on Python equivalents of Go's gofmt, eg yapf. This doesn't entirely make my issues go away, but at least it would give me some tools to more or less automatically convert existing code over so that I don't have to deal with a mismash of old and new formatting in different files or projects.)

EmacsPythonAutoindentBug written at 23:02:17; Add Comment

2016-06-30

Some advantages of using argparse to handle arguments as well as options

I started using Python long enough ago that there was only the getopt module, which was okay because that's what I was used to from C and other Unix languages (shell, Perl, etc), and then evolved for a bit through optparse; I only started using argparse relatively recently. As a result of all of this background, I'm used to thinking of 'argument handling' as only processing command line switches and their arguments for you, and giving you back basically a list of the remaining arguments, which is your responsibility to check how many there are, parse, and so on.

Despite being very accustomed to working this way, I'm starting to abandon it when using argparse. Part of this is what I discovered the first time I used argparse, namely that it's the lazy way. But I've now used argparse a second time and I'm feeling that there are real advantages to letting it handle as many positional arguments as possible in as specific a way as possible.

For instance, suppose that you're writing a Python program that takes exactly five positional arguments. The lazy way to handle this is simply:

parser.add_argument("args", metavar="ARGS", nargs=5)

If you take exactly five arguments, they probably mean different things. So the better way is to add them separately:

parser.add_argument("eximid", metavar="EXIMID")
parser.add_argument("ctype", metavar="CONTENT-TYPE")
parser.add_argument("cdisp", metavar="CONTENT-DISPOSITION")
parser.add_argument("mname", metavar="MIMENAME")
parser.add_argument("file", metavar="FILE")

Obviously this gives you easy separate access to each argument in your program, but the really nice thing this does is that it adds some useful descriptive context to your program's usage message. If you choose the metavar values well, your usage message will strongly hint to what needs to be supplied as each argument. But we can do better, because argparse is perfectly happy to let you attach help to positional arguments as well as to switches (and it will then print it out again in the usage message, all nicely formatted and so on).

You can do the same thing by hand, of course; there's nothing preventing you from writing the same documentation with manual argument parsing and printing it out appropriately (although argparse does do a good job of formatting it). But it feels easier with argparse and it feels more natural, because argparse lets me put everything to do with a positional argument in one spot; I can name the internal variable, specify its visible short name, and then add help, all at once. If nothing else, this is likely to keep all of these things in sync with each other.

(And I'm not going to underestimate the importance of automatic good formatting, because that removes a point of friction in writing the help message for a given positional argument.)

The result of all of this is that using argparse for positional arguments in my latest program has effortlessly given me not just a check for having the right number of positional arguments but a bunch of useful help text as well. Since I frequently don't touch programs for a year or two, I foresee this being a useful memory jog for future me.

In summary, if I can get argparse to handle my positional arguments in future Python programs, I'm going to let it. I've become convinced that it's not just the lazy way, it's the better way.

(This is where some Python people may laugh at me for having taken so long to start using argparse. In my vague defense, we still have some machines without Python 2.7.)

ArgparseForArgsToo written at 23:22:40; Add Comment

2016-06-27

Today's lesson on the value of commenting your configuration settings

We have a relatively long-standing Django web application that was first written for Django 1.2 and hadn't been substantially revised since. Earlier this year I did a major rework in order to fully update it for Django 1.9; not just to be compatible, but to be (re)structured into the way that Django now wants apps to be set up.

Part of Django's app structure is a settings.py file that contains, well, all sorts of configuration settings for your system; you normally get your initial version of this by having Django create it for you. What Django wants you to have in this file and how it's been structured has varied over Django versions, so if you have a five year old app its settings.py file can look nothing like what Django would now create. Since I was doing a drastic restructuring anyways, I decided to deal with this issue the simple way. I'd have Django write out a new stock settings.py file for me, as if I was starting a project from scratch, and then I would recreate all of the settings changes we needed. In the process I would drop any settings that were now quietly obsolete and unnecessary.

(Since the settings file is just ordinary Python, it's easy to wind up setting 'configuration options' that no longer exist. Nothing complains that you have some extra variables defined, and in fact you're perfectly free to define your own settings that are used only by your app, so Django can't even tell.)

In the process of this, I managed to drop (well, omit copying) the ADMINS setting that makes Django send us email if there's an uncaught exception (see Django's error reporting documentation). I didn't spot this when we deployed the new updated version of the application (I'm not sure I even remembered this feature). I only discovered the omission when Derek's question here sent me looking at our configuration file to find out just what we'd set and, well, I discovered that our current version didn't have anything. Oops, as they say.

Looking back at our old settings.py, I'm pretty certain that I omitted ADMINS simply because it didn't have any comments around it to tell me what it did or why it was there. Without a comment, it looked like something that old versions of Django set up but new versions didn't need (and so didn't put into their stock settings.py). Clearly if I'd checked what if anything ADMINS meant in Django 1.9 I'd have spotted my error, but, well, people take shortcuts, myself included.

(Django does have global documentation for settings, but there is no global alphabetical index of settings so you can easily see what is and isn't a Django setting. Nor are settings grouped into namespaces to make it clear what they theoretically affect.)

This is yet another instance of some context being obvious to me at the time I did something but it being very inobvious to me much later. I'm sure that when I put the ADMINS setting into the initial settings.py I knew exactly what I was doing and why, and it was so obvious I didn't think it needed anything. Well, it's five years later and all of that detail fell out of my brain and here I am, re-learning this lesson yet again.

(When I put the ADMINS setting and related bits and pieces back into our settings.py, you can bet that I added a hopefully clear comment too. Technically it's not in settings.py, but that's a topic for another blog entry.)

DjangoCommentConfigSettings written at 22:48:16; Add Comment

2016-06-25

What Python 3 versions Django supports, and when this changes

I was idly skimming the in-progress release notes for Django 1.10 when one of the small sections that I usually ignore jumped out at me instead:

Like Django 1.9, Django 1.10 requires Python 2.7, 3.4, or 3.5. [...]

Since I've recently been thinking about running Django on Python 3, the supported Python 3 versions caught my eye. More exactly, that it was a short list. This made me wonder what Django versions will support what Python 3 versions, and for how long.

In mid 2015, the Django project published a roadmap and said:

We will support a Python version up to and including the first Django LTS release whose security support ends after security support for that version of Python ends. For example, Python 3.3 security support ends September 2017 and Django 1.8 LTS security support ends April 2018. Therefore Django 1.8 is the last version to support Python 3.3.

So we need to look at both the Django release schedule and the Python release and support schedule. On the Django side, Django's next LTS release is '1.11 LTS', scheduled to release in April 2017 and be supported through April 2020 (and it's expected to be the last version supporting Python 2.7, since official Python 2.7 security support ends in 2020). After that is Django 2.2 in April 2019, supported through April 2022. On the Python side, the Python team appears to be doing 3.x releases roughly every 18 months (see eg PEP 494 on Python 3.6's release schedule) and giving them security support for five years after their initial release. If this is right, Python 3.4 will be supported through March 2019 and Python 3.5 through September 2020; 3.6 is expected in December 2016 (supported through December 2021) and thus 3.7 in roughly May of 2018 (supported through May 2023).

Putting all of this together, I get an expected result of:

  • Python 3.4 will be supported through Django 1.11; Django 2.0 (nominally due December 2017) will drop support for it.

  • Python 3.5 and 3.6 will probably be supported through Django 2.2. Django 1.11 will almost certainly be the first release to support Python 3.6.

  • Python 3.7's exact Django support range is up in the air since at this point I'm projecting both Python and Django release schedules rather far into the misty future.

Ubuntu 14.04 LTS has Python 3.4 and Ubuntu 16.04 LTS has 3.5. Both will be supported long enough to run into the maximum likely Django version that supports their Python 3 version, although only more or less at the end of each Ubuntu LTS's lifespan.

(I'm going to have to mull over what this means for Python 3 migration plans for our Django app. Probably a real Python 3 migration attempt is closer than I thought it would be.)

Python3VersionsDjangoSupports written at 01:56:12; Add Comment

2016-06-22

I need to cultivate some new coding habits for Python 3 ready code

We have a Django application, and because of various aspects of Django (such as Django handling much of the heavy lifting), I expect that it's our most Python 3 ready chunk of code. Since I was working on it today anyways, I took a brief run at seeing if it would at least print help messages if I ran it under Python 3. As it turns out, making the attempt shows me that I need to cultivate some new coding habits in order to routinely write code that will be ready for Python 3.

What I stumbled over today is that I still like to write except clauses in the old way:

try:
   ....
except SomeErr, err:
   ....

The new way to write except clauses is the less ambiguous 'except SomeErr as err:'. Python 3 only supports the new style.

Despite writing at least some of the code in our Django application relatively recently, I still wrote it using the old style for except. Of course this means I need to change it all. I'm pretty certain that writing except clauses this way is not something that I think about; it's just a habit of how I write Python, developed from years of writing Python before 'except CLS as ERR' existed or at least was usable by me.

What I take away from this is that I'm going to need to make new Python coding habits, or more exactly go through the more difficult exercise of overwriting old habits with new ones. I'm sure to be irrationally annoyed at some of the necessary changes, especially turning 'print' statements into function calls.

(If I was serious about this, what I should do is force myself to write only in Python 3 for a while. Unfortunately that's not very likely.)

The good news is that I checked some code I wrote recently, and I seem to have deliberately used the new style except clauses in it. Now if I can remember to keep doing that, I might be in okay shape.

(Having thought about it, what would be handy is a Python linter that complains about such unnecessary Python 3 incompatibilities. Then I'd at least have a chance of catching my issues here right away.)

PS: Modernizing an old code base is a related issue, of course. I need to modernize both code and habits to be ready for Python 3 in both current and future code.

Sidebar: The timing and rationality of using old-style except

New style except was introduced in Python 2.6, which dates back to 2008. However, the new version of Python didn't propagate into things like Linux distributions immediately; it took two years to get it into an Ubuntu LTS release, for example (in 10.04). Looking back at various records, it seems that the initial version of our Django application was deployed on an Ubuntu 8.04 machine that would have had only Python 2.5. In fact I may have written the first version of all of the substantial code in the application while we were still using 8.04 as the host machine and so new-style except would have been unavailable to us.

This is of course no longer the case. Although not everything we run today has Python 2.7 available (cf), it all has at least Python 2.6. So I should be writing all my code with new style except clauses and probably some other modernizations.

NewHabitsForPython3 written at 23:53:05; Add Comment

Moving from Python 2 to Python 3 calls for a code inventory

I was all set to write a blog entry breaking down what sort of Python code we had, how much it was exposed to security issues and other threats, and how much work it would probably be to migrate it to or towards Python 3. I even thought it was going to be a relatively short and simple entry. Then, as I was writing things down, I kept remembering more and more bits of Python code we're using in different contexts, and I'm pretty sure I'm still forgetting some.

So, here's my broad moral for today: if you have Python code, and you're either thinking of migrating at least some of it to Python 3 or considering whether you can ignore the alleged risks of continuing to use Python 2, your first step is (or should be) to get a code inventory. Expect this to probably take a while; you don't want just the big obvious applications, you also care about the little things in the corners.

Perhaps we're unusual, but we don't have our Python code in one or more big applications, where it's easy and obvious to look at things. Instead, we have all sorts of things written in Python, everything from a modest Django application through system management subsystems to little command line things (and not so little ones). These have accumulated over what is almost a decade by now, and if they work quietly we basically forget about them (and most of them do). It's clearly going to take me some work to find them all, categorize them, and probably in some cases discover that they're now unnecessary.

Having written this, I don't know if I'm actually going to do such an inventory any time soon. The problem is that the work is a boring slog and the issue is not particularly urgent, even if we accept a '2020 or earlier' deadline on Python 2 support. Worse, if I do an inventory now and then do nothing with it, it's probably going to get out of date (wasting the work). I'd still like to know, though, if only for my own peace of mind.

CodeInventoryForPython3 written at 00:37:02; Add Comment

2016-05-21

Please stop the Python 2 security scaremongering

Let's start with Aaron Meurer's Moving Away from Python 2 in which I read, in passing:

  • Python 2.7 support ends in 2020. That means all updates, including security updates. For all intents and purposes, Python 2.7 becomes an insecure language to use at that point in time.

There is no nice way to put it: this is security scaremongering.

It is security scaremongering for three good reasons. First, by 2020 Python 2.7 is very likely to be an extremely stable piece of code that has already been picked over heavily for security issues. Even today Python 2.7 security issues are fairly rare, and we still have four more years for people to apply steadily improving analysis and fuzzing tools to Python 2.7 to find anything left. As such, the practical odds that people will find any significant security issues in Python 2.7 after it stops being supported seems fairly low.

Second, it is not as if Python 2.7 will be unsupported in 2020. Oh, sure, the main Python team will not support it, but there are plenty of OS vendors (especially Linux vendors) that either do have or likely will have supported OS versions with officially supported Python 2.7 versions. These vendors themselves are going to fix any security issues found in 2.7. As 2020 approaches, it's very likely that you'll be using a vendor version of 2.7 and so be covered by their security teams. If you're building 2.7 yourself, well, you can copy their work.

(By the way, this means that a bunch of security teams have a good motive to fuzz and attack Python 2.7 now, while the Python core team will still fix any problems they find.)

Finally, a potentially significant amount of Python code is not even running in a security sensitive setting in the first place. If your Python code is processing trusted input in a trusted environment, any potential security issues in Python 2.7 are basically irrelevant. Not all Python code is running websites, to put it one way.

To imply that using Python 2.7 after support ends in 2020 will immediately endanger people is scaremongering. The reality is that it's extremely likely that Python 2.7 after 2020 will be just as secure and stable as it was before 2020, and it's very likely that any issues found after 2020 will be promptly fixed by OS vendors.

(A much more likely security issue with Python 2.7 even before 2020 is framework, library, and package authors abandoning all support for 2.7 versions of their code. If Django is no longer getting security fixes on 2.7, it doesn't really matter that the CPython interpreter itself is still secure.)

By the way, I'm entirely neglecting alternate Python implementations here. These have historically targeted Python 2, not Python 3, and their status of supporting Python 3 (only) is often what you could call 'uncertain'. It seems entirely possible that, say, PyPy might wind up supporting Python 2.7.x well after the main CPython team drops support for it, and of course PyPy would likely fix any security issues that were uncovered in their implementation.

Sidebar: Vendor support periods and Python 2.7

In already released Linux distributions, Ubuntu 16.04 LTS has just been released with Python 2.7.11; it will be supported for five years, until April 2021 or so. Red Hat Enterprise Linux 7 (and CentOS 7) has Python 2.7 and will be supported until midway through 2024 (cf).

(Which version of Python 2.7 RHEL 7 has is sort of up in the air. It is officially '2.7.5', but it has additional RHEL patches and RHEL does backport security fixes as needed and so on.)

In future releases, it seems pretty likely that Ubuntu will release 18.04 LTS in April 2018, it will come with a fully supported Python 2.7, and be supported for five years, through 2023. Red Hat will probably release a new version of RHEL before 2020, will likely include Python 2.7, and if so will be supporting it for ten years from the release, which will take practical 2.7 support well into the late 2020s.

Python2SecurityScaremongering written at 01:01:02; Add Comment

2016-05-20

Some notes on abusing the pexpect Python module

What you are theoretically supposed to use pexpect for is to have your program automatically interact with interactive programs. When they produce certain sorts of output, you recognize it and take action; when you see prompts, you can automatically answer them. Pexpect is often used this way to automate things that expect to be operated manually by a real person. This is not what I'm using pexpect for. What I'm using it for is to start a program in what it thinks is an interactive environment, capture its output if all goes well, and if things go wrong allow a human operator to step in and interact with the program (all the while still capturing the output). This means that I'm ignoring almost all of pexpect's functionality and abusing parts of the rest in ways that it was probably not designed for.

Before I start, I need to throw in a disclaimer. There are multiple versions of pexpect out there; my impression is that development stalled for a while and then picked up recently. As I write this, the pexpect documentation talks about 4.0.1, but what I've used is no later than 3.1. Pexpect 4 may fix some of the issues I'm going to grumble about.

Supposing that my case is what you want to do, you start out by spawning a command:

child = pexpect.spawn(YOURCOMMAND, args=args, timeout=None)

It's important to set a timeout of None as the starting timeout. If you want to have a timeout at all, for example to detect that the remote end has gone silent, you want to control it on a call by call basis.

Now you want to collect output from the child command:

res = []
while not child.closed and child.isalive():
   try:
      r = child.read_nonblocking(size = 16*1024, timeout=YOURTIMEOUT)
      res.append(r)
   except pexpect.EOF:
      # expected, just stop
      break
   except pexpect.TIMEOUT:
      # do whatever you want to recover
      return recover_child(child, res)

You might as well set size to large here. Although the documentation doesn't tell you this, it is just the maximum amount of data your read can ever return; it doesn't block until that much data is available. My principle is 'if the command generates a lot of output, let's read it in big blocks'.

We're not done once pexpect has raised an EOF. We need to do some cleanup to make sure that the child's exit status is available:

 # Some of this is probably superstition
 if not child.closed and child.isalive():
    child.wait()

 return (res, child.status)

Pexpect 3.1's documentation is not entirely clear on what you have to check when in order to see if the child is alive or not. Note that .isalive() has the (useful) side effect of harvesting the child's exit status if the child is not alive. It's helpfully not valid to call .wait() on a dead child, at least in 3.1, so you have to check carefully first.

As pexpect documents, it splits the actual OS process exit status into child.exitstatus and child.signalstatus (and various things return one or the other). The whole status is available as child.status, but you may find one or the other variant more useful (for example if you're really only interested in 'did the command exit with status 0 or did something go boom').

Allowing the user to interact with the child is somewhat more involved. Fundamentally we call child.interact() repeatedly, but there is a bunch of things that you need to do around this.

def talkto(child):
   # Set up to log interactive output
   res = []
   def save_output(data):
      if data: res.append(data)
      return data

   while not child.closed and child.isalive():
      try:
         child.interact(output_filter=save_output)
      except OSError as e:
         # Usually an EOF from the command.
         # Complain somehow.
         break

      # If the child is alive here, the user has
      # typed a ^] to escape from interact().
      # What happens next is up to you.

Yes, you read that right. Uniquely, pexpect's child.interact() does not raise pexpect.EOF on EOF from the child; instead it generally passes through an underlying OSError that it got (my notes don't say what that OSError usually is). In general, if you get an OSError here you have to assume that the session is dead, although pexpect doesn't necessarily know it yet.

Usefully, child.interact() sets things up so that control characters and so on that the user types are normally passed through directly to the child process instead of affecting your Python program. This means that under normal circumstances, if you type eg ^C your Python code won't get hit with a SIGINT; it'll go through to the child program and the child program will do whatever it does in reaction.

What you do if the user chooses to use ^[ to exit from child.interact() is up to you. Note that you can allow them to resume the interaction; just go back through your loop to call child.interact() again. If you allow the user to abandon the child and exit your talkto() function (you probably want to), you need to do some more cleanup of the child:

# after interact() returns, try to
# read anything left over, then close the child.
try:
   r = child.read_nonblocking(size=128*1024, timeout=0)
   res.append(r)
except (pexpect.EOF, pexpect.TIMEOUT, OSError):
   pass

child.close(force=True)

Calling read_nonblocking with_timeout=0_ means what you think it does; it's a non-blocking read of whatever (final) data is available right now, with no waiting for anything more to come in from the child.

At least in pexpect 3.1, you basically should call child.close() with force=True or you will get a pexpect error if the child stays alive, which it may. Setting force winds up hitting the child with a SIGKILL if nothing else seems to work, which is relatively sure.

(Although the documentation doesn't mention it, if the child is alive it always gets sent SIGHUP and then SIGINT first. Well, this happens in older versions of pexpect; the 4.0.1 code is a bit different and I haven't dug through it.)

Possibly there is a better Python module for this sort of interaction in general. If so, it is too late for me; I've already written all of this code and I hope to not have to touch it again before we have to port it to Python 3 (if ever).

(My impression is that you should try to use pexpect 4 if you can, as the code has been overhauled and the documentation at least somewhat improved.)

PExpectNotes written at 01:50:53; Add Comment

2016-04-13

How I'm trying to do durable disk writes here on Wandering Thoughts

A while back, Wandering Thoughts lost a just posted comment when the host went down abruptly and unexpectedly. Fortunately it was my comment and I was able to recover it, but the whole thing made me realize that I should be doing better about durable writes in DWiki. Up until that point, my code for writing comments to disk had basically been ignoring the issue; I opened a new file, I wrote things, I closed the file, I assumed it was all good. Well, no, you can't really make that assumption.

(Many normal blogs and wikis store things in a database, which worries about doing durable disk writes for you. I am crazy or perverse, so DWiki only uses the filesystem. Each comment is a little file in a particular directory.)

You can do quite a lot of worrying about disk durability on Unix if you try hard (database people have horror stories about how things go wrong). I decided not to worry that much about it, but I did want to do a reasonably competent job. The rules for this on Unix are very complicated and somewhat system dependent, so I am not going to try to cover them; I'm just going to cover my situation. I am writing new files infrequently and I'm willing for this to be a bit slower than it otherwise would be.

Omitting error checking and assuming that you have a fp file and a dirname variable already, what you want in my situation looks like this:

fp.flush()
os.fsync(fp.fileno())

dfd = os.open(dirname, os.O_RDONLY)
os.fsync(dfd)
os.close(dfd)

The first chunk of code flushes the data I've written to the file out to disk. However, for newly created files this is not enough; a filesystem is allowed to not immediately write to disk the fact that the new file is in the directory, and some will do just that. So to be sure we must poke the system so it 'flushes' the directory to disk, ie makes sure that our new file is properly recorded as being present.

(My actual code is already using raw os module writes to write the file, so I can just do 'os.fsync(fd)'. But you probably have a file object, since that's the common case.)

Some sample code you can find online will use os.O_DIRECTORY when opening the directory here. This isn't necessary but doesn't do any harm on most systems as far as I can tell; however, note that there are some uncommon Unix systems that don't even have an os.O_DIRECTORY.

(Although I haven't read through the whole thing and thus can't vouch for it, you may find Reliable file updates with Python to be useful for a much larger discussion of various file update patterns and how to make them reliable. The overall summary here is that reliable file durability on Unix is a minefield. I'm not even sure I have it right, and I try to stay on top of this stuff.)

Sidebar: Why you don't want to use os.sync()

If you're using Python 3.3+, you may be tempted to reach for the giant hammer of os.sync() here; after all, the sync() system call flushes everything to disk and you can't get more sure than that. There are two problems with doing this. First, a full sync() may take a significant length of time, up in the tens of seconds or longer. All you need is for something else to have written but not flushed a bunch of data and the disks to be busy (perhaps your system has backups running) and then boom, your sync() is delaying your program for a very long time.

Second, a full sync() may cause you to trip over unrelated disk problems. If the server you're running on has a completely different filesystem that may be having some disk IO problems, well, your sync() is going to be waiting for the overall system to flush out data to that filesystem even though you don't care about it in the least. I'm unusually sensitive to this issue because I work in an environment with a lot of NFS mounted filesystems that come from an number of different servers, and that means a lot of ways that just one or a few filesystems can be more than a little slow at the moment.

HowISyncDataDWiki written at 02:12:08; Add Comment

(Previous 10 or go back to April 2016 at 2016/04/06)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.