2016-07-11
Why Python can't have a full equivalent of Go's gofmt
I mentioned in passing here that people
are working on Python equivalents of Go's gofmt and since then
I've played around a bit with yapf,
which was the most well developed one that I could find. Playing
around with yapf (and thinking more about how to deal with my
Python autoindent problem) brought home
a realization, which is that Python fundamentally can't have a full,
true equivalent of gofmt.
In Go, you can be totally sloppy in your pre-gofmt code; basically
anything goes. Specifically, you don't need to indent your Go code
in any particular way or even at all. If you're banging out a quick
modification to some Go code, you can just stuff it in with either
completely contradictory indentation or no indentation at all. More
broadly, you can easily survive an editor with malfunctioning
auto-indentation for Go code. Sure, you code will look ugly before
you gofmt it, but it'll still work.
With Python, you can't be this free and casual. Since indentation
is semantically meaningful, you must get the indentation correct
right from the start; you can't leave it out or be inconsistent. A
Python equivalent of gofmt can change the indentation level you
use (and change some aspects of indentation style), but it can't
add indentation for you in the way that gofmt does. This means
that malfunctioning editor auto-indent is quite a bit more damaging
(as is not having it at all); since indentation is not optional,
you must correct or add it by hand, all the time. In Python, either
you or your editor are forced to be less sloppy than you can be in
Go.
(Sure, Go requires that you put in the { and } to denote block
start and end, but those are easy and fast compared to getting
indentation correct.)
Of course, you can start out with minimal, fast to create indentation;
Python will let you do one or two space indents if you really want.
But once you run yapf on your initial code, in many cases you're
going to be stuck matching it for code changes. Python will tolerate
a certain amount of indentation style mismatches, but not too much
(Python 3 is less relaxed here than Python 2). Also, I'm confident
that I don't know just how much sloppiness one can get away with
here, so in practice I think most people are going to be matching
the existing indentation even if they don't strictly have to. I
know that I will be.
I hadn't thought about this asymmetry before my editor of choice started not getting my Python auto-indentation quite right, but it's now rather more on my mind.
2016-07-03
An irritating little bug in the latest GNU Emacs Python autoindent code
I really like having smart autoindent in my editor when writing code, Python code included. When it works, autoindent does exactly what I would do by hand, does it easier, and in the process shows me errors in my code (if the autoindent is 'wrong', it is a signal that something earlier is off). But the flipside of this is that when autoindent goes wrong it can be a screaming irritation, as I flip from working with my editor to actively fighting it.
Unfortunately the latest official version of GNU Emacs has such an
issue in its Python autoindent code, under conditions that are
probably rare. To see the bug, set Emacs up with python-indent-offset
set to 8 and indent-tabs-mode set to t, and then enter:
def abc(): if d in e: pass # Hit return here:
If you put your cursor on the end of the comment and hit return,
autoindent doesn't add any indentation at all. It should add one
level of indentation. Also, once you have this code in a .py file
you don't need to set anything in Emacs; Emacs will auto-guess that
the indent offset is 8 and then the mere presence of tabs will cause
things to explode. This makes this issue especially annoying and/or
hazardous.
Some people will say that this serves me right for still using tabs for indentation in my Python code. I'm aware that there's been a general movement in the Python community to indent all Python code with only spaces, regardless of how much you indent it by, but for various reasons I have strongly resisted this. One of them is that I edit Python code in multiple editors, not all of them ones with smart autoindentation, and space-based indenting is painful in an editor that doesn't do it for you. Well, at least using generous indents with manual spaces is painful, and I'm not likely to give that up any time soon.
(I like generous indents in code. Small indent levels make everything feel crammed together and it's less obvious if something is misindented when everything is closer. Of course Python's many levels of nesting doesn't necessarily make this easy; by the time I'm writing an inner function in a method in a class, I'm starting to run out of horizontal space.)
PS: I suspect that I'm going to have to give up my 'indent with tabs' habits some day, probably along with my 8-space indents. The modern Python standard seems to be 4-space indent with spaces and there's a certain amount to be said for the value of uniformity.
(People are apparently working on Python equivalents of Go's gofmt,
eg yapf. This doesn't entirely
make my issues go away, but at least it would give me some tools
to more or less automatically convert existing code over so that I
don't have to deal with a mismash of old and new formatting in
different files or projects.)
2016-06-30
Some advantages of using argparse to handle arguments as well as options
I started using Python long enough ago that there was only the getopt module, which was okay because that's what I was used to from C and other Unix languages (shell, Perl, etc), and then evolved for a bit through optparse; I only started using argparse relatively recently. As a result of all of this background, I'm used to thinking of 'argument handling' as only processing command line switches and their arguments for you, and giving you back basically a list of the remaining arguments, which is your responsibility to check how many there are, parse, and so on.
Despite being very accustomed to working this way, I'm starting to abandon it when using argparse. Part of this is what I discovered the first time I used argparse, namely that it's the lazy way. But I've now used argparse a second time and I'm feeling that there are real advantages to letting it handle as many positional arguments as possible in as specific a way as possible.
For instance, suppose that you're writing a Python program that takes exactly five positional arguments. The lazy way to handle this is simply:
parser.add_argument("args", metavar="ARGS", nargs=5)
If you take exactly five arguments, they probably mean different things. So the better way is to add them separately:
parser.add_argument("eximid", metavar="EXIMID")
parser.add_argument("ctype", metavar="CONTENT-TYPE")
parser.add_argument("cdisp", metavar="CONTENT-DISPOSITION")
parser.add_argument("mname", metavar="MIMENAME")
parser.add_argument("file", metavar="FILE")
Obviously this gives you easy separate access to each argument in
your program, but the really nice thing this does is that it adds
some useful descriptive context to your program's usage message.
If you choose the metavar values well, your usage message will
strongly hint to what needs to be supplied as each argument. But
we can do better, because argparse is perfectly happy to let you
attach help to positional arguments as well as to switches (and it
will then print it out again in the usage message, all nicely
formatted and so on).
You can do the same thing by hand, of course; there's nothing preventing you from writing the same documentation with manual argument parsing and printing it out appropriately (although argparse does do a good job of formatting it). But it feels easier with argparse and it feels more natural, because argparse lets me put everything to do with a positional argument in one spot; I can name the internal variable, specify its visible short name, and then add help, all at once. If nothing else, this is likely to keep all of these things in sync with each other.
(And I'm not going to underestimate the importance of automatic good formatting, because that removes a point of friction in writing the help message for a given positional argument.)
The result of all of this is that using argparse for positional arguments in my latest program has effortlessly given me not just a check for having the right number of positional arguments but a bunch of useful help text as well. Since I frequently don't touch programs for a year or two, I foresee this being a useful memory jog for future me.
In summary, if I can get argparse to handle my positional arguments in future Python programs, I'm going to let it. I've become convinced that it's not just the lazy way, it's the better way.
(This is where some Python people may laugh at me for having taken so long to start using argparse. In my vague defense, we still have some machines without Python 2.7.)
2016-06-27
Today's lesson on the value of commenting your configuration settings
We have a relatively long-standing Django web application that was first written for Django 1.2 and hadn't been substantially revised since. Earlier this year I did a major rework in order to fully update it for Django 1.9; not just to be compatible, but to be (re)structured into the way that Django now wants apps to be set up.
Part of Django's app structure is a settings.py file that contains,
well, all sorts of configuration settings for your system; you
normally get your initial version of this by having Django create
it for you. What Django wants you to have in this file and how it's
been structured has varied over Django versions, so if you have a
five year old app its settings.py file can look nothing like what
Django would now create. Since I was doing a drastic restructuring
anyways, I decided to deal with this issue the simple way. I'd have
Django write out a new stock settings.py file for me, as if I was
starting a project from scratch, and then I would recreate all of
the settings changes we needed. In the process I would drop any
settings that were now quietly obsolete and unnecessary.
(Since the settings file is just ordinary Python, it's easy to wind up setting 'configuration options' that no longer exist. Nothing complains that you have some extra variables defined, and in fact you're perfectly free to define your own settings that are used only by your app, so Django can't even tell.)
In the process of this, I managed to drop (well, omit copying) the
ADMINS setting that makes Django send us email if there's an
uncaught exception (see Django's error
reporting documentation).
I didn't spot this when we deployed the new updated version of the
application (I'm not sure I even remembered this feature). I only
discovered the omission when Derek's question here
sent me looking at our configuration file to find out just what
we'd set and, well, I discovered that our current version didn't
have anything. Oops, as they say.
Looking back at our old settings.py, I'm pretty certain that I omitted
ADMINS simply because it didn't have any comments around it to tell
me what it did or why it was there. Without a comment, it looked like
something that old versions of Django set up but new versions didn't
need (and so didn't put into their stock settings.py). Clearly if I'd
checked what if anything ADMINS meant in Django 1.9 I'd have spotted
my error, but, well, people take shortcuts, myself included.
(Django does have global documentation for settings, but there is no global alphabetical index of settings so you can easily see what is and isn't a Django setting. Nor are settings grouped into namespaces to make it clear what they theoretically affect.)
This is yet another instance of some context being obvious to me
at the time I did something but it being very inobvious to me much
later. I'm sure that when I put the ADMINS setting into the initial
settings.py I knew exactly what I was doing and why, and it was
so obvious I didn't think it needed anything. Well, it's five years
later and all of that detail fell out of my brain and here I am,
re-learning this lesson yet again.
(When I put the ADMINS setting and related bits and pieces back
into our settings.py, you can bet that I added a hopefully clear
comment too. Technically it's not in settings.py, but that's
a topic for another blog entry.)
2016-06-25
What Python 3 versions Django supports, and when this changes
I was idly skimming the in-progress release notes for Django 1.10 when one of the small sections that I usually ignore jumped out at me instead:
Like Django 1.9, Django 1.10 requires Python 2.7, 3.4, or 3.5. [...]
Since I've recently been thinking about running Django on Python 3, the supported Python 3 versions caught my eye. More exactly, that it was a short list. This made me wonder what Django versions will support what Python 3 versions, and for how long.
In mid 2015, the Django project published a roadmap and said:
We will support a Python version up to and including the first Django LTS release whose security support ends after security support for that version of Python ends. For example, Python 3.3 security support ends September 2017 and Django 1.8 LTS security support ends April 2018. Therefore Django 1.8 is the last version to support Python 3.3.
So we need to look at both the Django release schedule and the Python release and support schedule. On the Django side, Django's next LTS release is '1.11 LTS', scheduled to release in April 2017 and be supported through April 2020 (and it's expected to be the last version supporting Python 2.7, since official Python 2.7 security support ends in 2020). After that is Django 2.2 in April 2019, supported through April 2022. On the Python side, the Python team appears to be doing 3.x releases roughly every 18 months (see eg PEP 494 on Python 3.6's release schedule) and giving them security support for five years after their initial release. If this is right, Python 3.4 will be supported through March 2019 and Python 3.5 through September 2020; 3.6 is expected in December 2016 (supported through December 2021) and thus 3.7 in roughly May of 2018 (supported through May 2023).
Putting all of this together, I get an expected result of:
- Python 3.4 will be supported through Django 1.11; Django 2.0
(nominally due December 2017) will drop support for it.
- Python 3.5 and 3.6 will probably be supported through Django 2.2.
Django 1.11 will almost certainly be the first release to support
Python 3.6.
- Python 3.7's exact Django support range is up in the air since at this point I'm projecting both Python and Django release schedules rather far into the misty future.
Ubuntu 14.04 LTS has Python 3.4 and Ubuntu 16.04 LTS has 3.5. Both will be supported long enough to run into the maximum likely Django version that supports their Python 3 version, although only more or less at the end of each Ubuntu LTS's lifespan.
(I'm going to have to mull over what this means for Python 3 migration plans for our Django app. Probably a real Python 3 migration attempt is closer than I thought it would be.)
2016-06-22
I need to cultivate some new coding habits for Python 3 ready code
We have a Django application, and because of various aspects of Django (such as Django handling much of the heavy lifting), I expect that it's our most Python 3 ready chunk of code. Since I was working on it today anyways, I took a brief run at seeing if it would at least print help messages if I ran it under Python 3. As it turns out, making the attempt shows me that I need to cultivate some new coding habits in order to routinely write code that will be ready for Python 3.
What I stumbled over today is that I still like to write except
clauses in the old way:
try: .... except SomeErr, err: ....
The new way to write except clauses is the less ambiguous 'except
SomeErr as err:'. Python 3 only supports the new style.
Despite writing at least some of the code in our Django application
relatively recently, I still wrote it using the old style for
except. Of course this means I need to change it all. I'm pretty
certain that writing except clauses this way is not something
that I think about; it's just a habit of how I write Python, developed
from years of writing Python before 'except CLS as ERR' existed
or at least was usable by me.
What I take away from this is that I'm going to need to make new
Python coding habits, or more exactly go through the more difficult
exercise of overwriting old habits with new ones. I'm sure to be
irrationally annoyed at some of the necessary changes, especially
turning 'print' statements into function calls.
(If I was serious about this, what I should do is force myself to write only in Python 3 for a while. Unfortunately that's not very likely.)
The good news is that I checked some code I wrote recently, and I
seem to have deliberately used the new style except clauses in
it. Now if I can remember to keep doing that, I might be in okay
shape.
(Having thought about it, what would be handy is a Python linter that complains about such unnecessary Python 3 incompatibilities. Then I'd at least have a chance of catching my issues here right away.)
PS: Modernizing an old code base is a related issue, of course. I need to modernize both code and habits to be ready for Python 3 in both current and future code.
Sidebar: The timing and rationality of using old-style except
New style except was introduced in Python 2.6, which dates back
to 2008. However, the new version of Python didn't propagate into
things like Linux distributions immediately; it took two years to
get it into an Ubuntu LTS release, for example (in 10.04). Looking
back at various records, it seems that the initial version of our
Django application was deployed on an
Ubuntu 8.04 machine that would have had only Python 2.5. In fact I may have written the first version
of all of the substantial code in the application while we were
still using 8.04 as the host machine and so new-style except would
have been unavailable to us.
This is of course no longer the case. Although not everything we
run today has Python 2.7 available (cf),
it all has at least Python 2.6. So I should be writing all my code
with new style except clauses and probably some other modernizations.
Moving from Python 2 to Python 3 calls for a code inventory
I was all set to write a blog entry breaking down what sort of Python code we had, how much it was exposed to security issues and other threats, and how much work it would probably be to migrate it to or towards Python 3. I even thought it was going to be a relatively short and simple entry. Then, as I was writing things down, I kept remembering more and more bits of Python code we're using in different contexts, and I'm pretty sure I'm still forgetting some.
So, here's my broad moral for today: if you have Python code, and you're either thinking of migrating at least some of it to Python 3 or considering whether you can ignore the alleged risks of continuing to use Python 2, your first step is (or should be) to get a code inventory. Expect this to probably take a while; you don't want just the big obvious applications, you also care about the little things in the corners.
Perhaps we're unusual, but we don't have our Python code in one or more big applications, where it's easy and obvious to look at things. Instead, we have all sorts of things written in Python, everything from a modest Django application through system management subsystems to little command line things (and not so little ones). These have accumulated over what is almost a decade by now, and if they work quietly we basically forget about them (and most of them do). It's clearly going to take me some work to find them all, categorize them, and probably in some cases discover that they're now unnecessary.
Having written this, I don't know if I'm actually going to do such an inventory any time soon. The problem is that the work is a boring slog and the issue is not particularly urgent, even if we accept a '2020 or earlier' deadline on Python 2 support. Worse, if I do an inventory now and then do nothing with it, it's probably going to get out of date (wasting the work). I'd still like to know, though, if only for my own peace of mind.
2016-05-21
Please stop the Python 2 security scaremongering
Let's start with Aaron Meurer's Moving Away from Python 2 in which I read, in passing:
- Python 2.7 support ends in 2020. That means all updates, including security updates. For all intents and purposes, Python 2.7 becomes an insecure language to use at that point in time.
There is no nice way to put it: this is security scaremongering.
It is security scaremongering for three good reasons. First, by 2020 Python 2.7 is very likely to be an extremely stable piece of code that has already been picked over heavily for security issues. Even today Python 2.7 security issues are fairly rare, and we still have four more years for people to apply steadily improving analysis and fuzzing tools to Python 2.7 to find anything left. As such, the practical odds that people will find any significant security issues in Python 2.7 after it stops being supported seems fairly low.
Second, it is not as if Python 2.7 will be unsupported in 2020. Oh, sure, the main Python team will not support it, but there are plenty of OS vendors (especially Linux vendors) that either do have or likely will have supported OS versions with officially supported Python 2.7 versions. These vendors themselves are going to fix any security issues found in 2.7. As 2020 approaches, it's very likely that you'll be using a vendor version of 2.7 and so be covered by their security teams. If you're building 2.7 yourself, well, you can copy their work.
(By the way, this means that a bunch of security teams have a good motive to fuzz and attack Python 2.7 now, while the Python core team will still fix any problems they find.)
Finally, a potentially significant amount of Python code is not even running in a security sensitive setting in the first place. If your Python code is processing trusted input in a trusted environment, any potential security issues in Python 2.7 are basically irrelevant. Not all Python code is running websites, to put it one way.
To imply that using Python 2.7 after support ends in 2020 will immediately endanger people is scaremongering. The reality is that it's extremely likely that Python 2.7 after 2020 will be just as secure and stable as it was before 2020, and it's very likely that any issues found after 2020 will be promptly fixed by OS vendors.
(A much more likely security issue with Python 2.7 even before 2020 is framework, library, and package authors abandoning all support for 2.7 versions of their code. If Django is no longer getting security fixes on 2.7, it doesn't really matter that the CPython interpreter itself is still secure.)
By the way, I'm entirely neglecting alternate Python implementations here. These have historically targeted Python 2, not Python 3, and their status of supporting Python 3 (only) is often what you could call 'uncertain'. It seems entirely possible that, say, PyPy might wind up supporting Python 2.7.x well after the main CPython team drops support for it, and of course PyPy would likely fix any security issues that were uncovered in their implementation.
Sidebar: Vendor support periods and Python 2.7
In already released Linux distributions, Ubuntu 16.04 LTS has just been released with Python 2.7.11; it will be supported for five years, until April 2021 or so. Red Hat Enterprise Linux 7 (and CentOS 7) has Python 2.7 and will be supported until midway through 2024 (cf).
(Which version of Python 2.7 RHEL 7 has is sort of up in the air. It is officially '2.7.5', but it has additional RHEL patches and RHEL does backport security fixes as needed and so on.)
In future releases, it seems pretty likely that Ubuntu will release 18.04 LTS in April 2018, it will come with a fully supported Python 2.7, and be supported for five years, through 2023. Red Hat will probably release a new version of RHEL before 2020, will likely include Python 2.7, and if so will be supporting it for ten years from the release, which will take practical 2.7 support well into the late 2020s.
2016-05-20
Some notes on abusing the pexpect Python module
What you are theoretically supposed to use pexpect for is to have your program automatically interact with interactive programs. When they produce certain sorts of output, you recognize it and take action; when you see prompts, you can automatically answer them. Pexpect is often used this way to automate things that expect to be operated manually by a real person. This is not what I'm using pexpect for. What I'm using it for is to start a program in what it thinks is an interactive environment, capture its output if all goes well, and if things go wrong allow a human operator to step in and interact with the program (all the while still capturing the output). This means that I'm ignoring almost all of pexpect's functionality and abusing parts of the rest in ways that it was probably not designed for.
Before I start, I need to throw in a disclaimer. There are multiple versions of pexpect out there; my impression is that development stalled for a while and then picked up recently. As I write this, the pexpect documentation talks about 4.0.1, but what I've used is no later than 3.1. Pexpect 4 may fix some of the issues I'm going to grumble about.
Supposing that my case is what you want to do, you start out by spawning a command:
child = pexpect.spawn(YOURCOMMAND, args=args, timeout=None)
It's important to set a timeout of None as the starting timeout.
If you want to have a timeout at all, for example to detect that
the remote end has gone silent, you want to control it on a call
by call basis.
Now you want to collect output from the child command:
res = []
while not child.closed and child.isalive():
try:
r = child.read_nonblocking(size = 16*1024, timeout=YOURTIMEOUT)
res.append(r)
except pexpect.EOF:
# expected, just stop
break
except pexpect.TIMEOUT:
# do whatever you want to recover
return recover_child(child, res)
You might as well set size to large here. Although the documentation
doesn't tell you this, it is just the maximum amount of data your
read can ever return; it doesn't block until that much data is
available. My principle is 'if the command generates a lot of output,
let's read it in big blocks'.
We're not done once pexpect has raised an EOF. We need to do some cleanup to make sure that the child's exit status is available:
# Some of this is probably superstition
if not child.closed and child.isalive():
child.wait()
return (res, child.status)
Pexpect 3.1's documentation is not entirely clear on what you have
to check when in order to see if the child is alive or not. Note
that .isalive() has the (useful) side effect of harvesting the
child's exit status if the child is not alive. It's helpfully not
valid to call .wait() on a dead child, at least in 3.1, so you
have to check carefully first.
As pexpect documents, it splits the actual OS process exit status
into child.exitstatus and child.signalstatus (and various things
return one or the other). The whole status is available as
child.status, but you may find one or the other variant more
useful (for example if you're really only interested in 'did the
command exit with status 0 or did something go boom').
Allowing the user to interact with the child is somewhat more
involved. Fundamentally we call child.interact() repeatedly,
but there is a bunch of things that you need to do around this.
def talkto(child):
# Set up to log interactive output
res = []
def save_output(data):
if data: res.append(data)
return data
while not child.closed and child.isalive():
try:
child.interact(output_filter=save_output)
except OSError as e:
# Usually an EOF from the command.
# Complain somehow.
break
# If the child is alive here, the user has
# typed a ^] to escape from interact().
# What happens next is up to you.
Yes, you read that right. Uniquely, pexpect's child.interact()
does not raise pexpect.EOF on EOF from the child; instead it
generally passes through an underlying OSError that it got (my
notes don't say what that OSError usually is). In general, if you
get an OSError here you have to assume that the session is dead,
although pexpect doesn't necessarily know it yet.
Usefully, child.interact() sets things up so that control characters
and so on that the user types are normally passed through directly
to the child process instead of affecting your Python program. This
means that under normal circumstances, if you type eg ^C your Python
code won't get hit with a SIGINT; it'll go through to the child
program and the child program will do whatever it does in reaction.
What you do if the user chooses to use ^[ to exit from child.interact()
is up to you. Note that you can allow them to resume the interaction;
just go back through your loop to call child.interact() again.
If you allow the user to abandon the child and exit your talkto()
function (you probably want to), you need to do some more cleanup
of the child:
# after interact() returns, try to # read anything left over, then close the child. try: r = child.read_nonblocking(size=128*1024, timeout=0) res.append(r) except (pexpect.EOF, pexpect.TIMEOUT, OSError): pass child.close(force=True)
Calling read_nonblocking with_timeout=0_ means what you think
it does; it's a non-blocking read of whatever (final) data is
available right now, with no waiting for anything more to come in
from the child.
At least in pexpect 3.1, you basically should call child.close()
with force=True or you will get a pexpect error if the child stays
alive, which it may. Setting force winds up hitting the child
with a SIGKILL if nothing else seems to work, which is relatively
sure.
(Although the documentation doesn't mention it, if the child is
alive it always gets sent SIGHUP and then SIGINT first. Well,
this happens in older versions of pexpect; the 4.0.1 code is a bit
different and I haven't dug through it.)
Possibly there is a better Python module for this sort of interaction in general. If so, it is too late for me; I've already written all of this code and I hope to not have to touch it again before we have to port it to Python 3 (if ever).
(My impression is that you should try to use pexpect 4 if you can, as the code has been overhauled and the documentation at least somewhat improved.)
2016-04-13
How I'm trying to do durable disk writes here on Wandering Thoughts
A while back, Wandering Thoughts lost a just posted comment when the host went down abruptly and unexpectedly. Fortunately it was my comment and I was able to recover it, but the whole thing made me realize that I should be doing better about durable writes in DWiki. Up until that point, my code for writing comments to disk had basically been ignoring the issue; I opened a new file, I wrote things, I closed the file, I assumed it was all good. Well, no, you can't really make that assumption.
(Many normal blogs and wikis store things in a database, which worries about doing durable disk writes for you. I am crazy or perverse, so DWiki only uses the filesystem. Each comment is a little file in a particular directory.)
You can do quite a lot of worrying about disk durability on Unix if you try hard (database people have horror stories about how things go wrong). I decided not to worry that much about it, but I did want to do a reasonably competent job. The rules for this on Unix are very complicated and somewhat system dependent, so I am not going to try to cover them; I'm just going to cover my situation. I am writing new files infrequently and I'm willing for this to be a bit slower than it otherwise would be.
Omitting error checking and assuming that you have a fp file and
a dirname variable already, what you want in my situation looks
like this:
fp.flush() os.fsync(fp.fileno()) dfd = os.open(dirname, os.O_RDONLY) os.fsync(dfd) os.close(dfd)
The first chunk of code flushes the data I've written to the file out to disk. However, for newly created files this is not enough; a filesystem is allowed to not immediately write to disk the fact that the new file is in the directory, and some will do just that. So to be sure we must poke the system so it 'flushes' the directory to disk, ie makes sure that our new file is properly recorded as being present.
(My actual code is already using raw os module writes to write
the file, so I can just do 'os.fsync(fd)'. But you probably have
a file object, since that's the common case.)
Some sample code you can find online will use os.O_DIRECTORY
when opening the directory here. This isn't necessary but doesn't
do any harm on most systems as far as I can tell; however, note
that there are some uncommon Unix systems that don't even have an
os.O_DIRECTORY.
(Although I haven't read through the whole thing and thus can't vouch for it, you may find Reliable file updates with Python to be useful for a much larger discussion of various file update patterns and how to make them reliable. The overall summary here is that reliable file durability on Unix is a minefield. I'm not even sure I have it right, and I try to stay on top of this stuff.)
Sidebar: Why you don't want to use os.sync()
If you're using Python 3.3+, you may be tempted to reach for the
giant hammer of os.sync() here; after all, the sync() system
call flushes everything to disk and you can't get more sure than
that. There are two problems with doing this. First, a full sync()
may take a significant length of time, up in the tens of seconds
or longer. All you need is for something else to have written but
not flushed a bunch of data and the disks to be busy (perhaps your
system has backups running) and then boom, your sync() is delaying
your program for a very long time.
Second, a full sync() may cause you to trip over unrelated disk
problems. If the server you're running on has a completely different
filesystem that may be having some disk IO problems, well, your
sync() is going to be waiting for the overall system to flush out
data to that filesystem even though you don't care about it in the
least. I'm unusually sensitive to this issue because I work in an
environment with a lot of NFS mounted filesystems that come from
an number of different servers,
and that means a lot of ways that just one or a few filesystems can
be more than a little slow at the moment.