Wandering Thoughts archives

2015-08-21

What's going on with a Python assignment puzzle

Via @chneukirchen, I ran across this tweet:

Armin just came up with this puzzle, how well do you know obscure Python details? What's a after this statement?:
(a, b) = a[b] = {}, 5

This is best run interactively for maximum head-scratching. I had to run it in an interpreter myself and then think for a while, because there are several interesting Python things going on here.

Let's start by removing the middle assignment. That gives us:

(a, b) = {}, 5

This is Python's multiple variable assignment ('x, y = 10, 20') written to make the sequence nature of the variable names explicit (hence why the Python tutorial calls this 'sequence unpacking'). Writing the list of variables as an explicit tuple (or list) is optional but is something even I've done sometimes, although I think writing it this way has fallen out of favour. Thus it's equivalent to:

t = ({}, 5)
(a, b) = t

The next trick is that (somewhat to my surprise) when you're assigning a tuple to several variables at once (as 'x = y = 10') and doing sequence unpacking for one of those assignments, Python doesn't require you to do sequence unpacking for every assignment. The following is valid:

(a, b) = x = t

Here a and b become the individual elements of the tuple t while x is the whole tuple. I suppose this is a useful trick to remember if you sometimes want both the tuple and its elements for different purposes.

The next trick happening is that Python explicitly handles repeated variable assignment (sometimes called 'chained assignment' or 'serial assignment') in left to right order. So first the leftmost set of assignments are handled, and second the next leftmost, and so on. Here we only have two sets of assignments, so the entire statement is equivalent to the much more verbose form:

t = ({}, 5)
(a, b) = t
a[b] = t

(When you do this outside of a function, the first (leftmost) assignment also creates a and b as names, which means that the second (right) assignment then has them available to use and doesn't get a 'name is not defined' error.)

The final 'trick' is due to what variables mean in Python, which creates the recursion in a[b]'s value. The tuple t that winds up assigned to a[b] contains a reference to the dictionary that a becomes another reference to, which means that the tuple contains a dictionary that contains the tuple again and it's recursion all the way down.

(When you combine Python's name binding behavior with serial assignment like this, you can wind up with fun bugs.)

AssignmentPuzzleUnpacked written at 01:57:27; Add Comment

2015-08-20

Using abstract namespace Unix domain sockets and SO_PEERCRED in Python

Linux has a special version of Unix domain sockets where the socket address is not a socket file in the filesystem but instead in an abstract namespace. It's possible to use them from Python without particular problems, including checking permissions with SO_PEERCRED, but it's not completely obvious how.

(For general information on using Unix domain sockets from Python, see UnixDomainSockets.)

With a normal Unix domain socket, the address you give is the path to a socket file. Per the Linux unix(7) manpage, an abstract socket address is simply your abstract name with a 0 byte on the front. This is trivial in Python and works exactly as you'd hope:

import socket
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.bind("\0" + sname)
s.listen(10)
# or s.connect(...) to talk to a server
....

This works in both Python 2 and Python 3. Somewhat to my surprise, Python 3 converts the Unicode null 'byte' codepoint to a 0 byte without complaints. How Python 3 converts any non-ASCII in sname to bytes depends on your locale, as usual, which means that under some circumstances you may need to do explicit conversion to bytes and handle conversion errors. You can call .bind() or .connect() with a bytes address instead of a Unicode one.

Sockets in the abstract namespace have no permissions, unlike regular Unix domain sockets (which are protected by file and/or directory permissions). If you want to add a permissions system, you can obtain the UID, GID, and PID of the other end with SO_PEERCRED like so:

import struct
SO_PEERCRED = getattr(socket, "SO_PEERCRED", 17)
creds = s.getsockopt(socket.SOL_SOCKET, SO_PEERCRED, struct.calcsize("3i"))
pid, uid, gid = struct.unpack("3i", creds)

This comes from a 2011 Stackoverflow answer, more or less (I have added my own little modifications to it).

The situation with the definition for SO_PEERCRED turns out to be a little bit complicated. The Python 3 socket module has had a definition for it for some time (it looks like since 2011 or so). Most versions of Python 2.x don't have a SO_PEERCRED constant defined in the socket module; the exception is the Fedora version of Python, which apparently has had this patched in for a very long time now. In addition, the '17' here is only correct on mainstream Linux architectures; some oddball ones like MIPS have other values. You may have to check in Python 3 or compile a little C program to get the correct value. Yes, this is irritating and you can see why the Fedora people patched Python (and why it got added to Python 3).

As you might suspect, SO_PEERCRED can be used by either end of a Unix domain socket connection (and it works on any Unix domain socket, not just ones in the abstract namespace). It's merely most useful for a server to find out what the client is, since clients usually trust servers.

(Trusting the server may or may not be wise when you're dealing with Unix domain sockets in the abstract namespace, since anyone can grab any name in it. For my purposes I don't really care; my use is a petty little hack on my own personal machine and it doesn't involve anything sensitive.)

AbstractUnixSocketsAndPeercred written at 01:19:05; Add Comment

2015-07-15

Eating memory in Python the easy way

As a system administrator, every so often I need to put a machine under the stress of having a lot of its memory used. Sometimes this is for testing how things respond to this before it happens during live usage; sometimes this is because putting a system under memory stress can cause it to do important things it doesn't otherwise do (such as reclaim extra memory). The traditional way to do this is with a memory eater program, something that just allocates a controlled amount of memory and then (usually) puts actual data in it.

(If you merely allocate memory but don't use it, many systems don't consider themselves to be under memory stress. Generally you have to make them use up actual RAM.)

In the old days, memory eater programs tended to be one-off things written in C; you'd malloc() some amount of memory then carefully write data into it to force the system to give you RAM. People who needed this regularly might keep around a somewhat more general program for it. As it turns out, these days I don't need to go to all of that work because interactive Python will do just fine:

$ /usr/bin/amd64/python2.6
[...]
>>> GB = 1024*1024*1024
>>> a = "a" * (10 * GB)

Voila, 10 GB eaten. Doing this interactively gives me great flexibility; for instance, I can easily eat memory in smaller chunks, say 1 GB at a time, so that I have more control over exactly when the system gets pushed hard (instead of perhaps throwing it well out of its comfort zone all at once).

There are some minor quibbles you can make here; for example I'm not using only exactly 10 GB of memory, since Python has some small overhead for objects and so on. And you probably want to specifically use bytestrings in Python 3, not the default Unicode strings.

In practice I don't care about the quibbles because this is close enough for me and it's really convenient (and flexible), far more so than writing a C program or re-finding the last one I wrote for this.

(If CPython allocates much additional internal memory to create this 10 GB string, it's not enough to be visible on the scale of GBytes of RAM usage. I tried a smaller test and didn't see more than perhaps a megabyte or two of surprising memory usage, but in general if you need really fine control over memory eating you're not going to want to use Python for it.)

PS: It makes me unreasonably happy to able to use Python interactively for things like this, especially when they're things I might have had to write a C program for in the past. It's just so neat to be able to just type this stuff out on the fly, whether it's eating memory or testing UDP behavior.

EatingMemory written at 23:01:16; Add Comment

2015-07-06

My Django form field validation and cleanup pain

Our Django based account request system has quite a number of (HTTP) forms that all reflect and manipulate the same underlying model data. Because these are different forms (and some of them are in complex dynamic situations), they of course all have different form (Python) classes. Some of you may already be seeing my problem here: a certain number of these fields need to be cleaned up and validated.

There are two problems here. The first is that in Django, form field validation and cleanup is attached to the form, not to the field. If you have five different forms all using the same field, that means five different forms need a clean_<field> method, even if all of these methods call the same code. The state of the code right now is actually that most of the forms do not have field cleanup and validation because they were restricted to administrative users and I was either overlooked the issue originally or was lazy. The second problem is that some of the forms want to generate somewhat different validation error messages for certain validation failures.

(Specifically, normal outside people submitting account requests need different error messages than staff who are basically doing data entry.)

I was going to say that Django doesn't support clean_<field> functions on models, but that's kind of incorrect. You don't get individual field methods, but you can clean and adjust data in an overall model clean() method. This deals with some but not all of my issues (eg the different error messages problem still remains) and it creates new ones; I'm essentially enforcing user interface restrictions at the database layer. I'm also not sure how happy Django will be if I do database lookups in a model clean() method.

(My view layer is actually deliberately more restrictive than the model layer right now. To put it one way, the view layer is concerned with saving people from errors while the model layer is concerned with hard data integrity constraints.)

All of this suggests that what I need to do is pull out most of the current .clean_<field> validation code into standalone functions and then find some easy way to add them to forms while handling error messages. Probably I will experiment with mixin classes that just have appropriate .clean_<field> methods.

(Now I wonder what happens if you have a .clean_<field> method for a field not defined in your form. Probably nothing good can come of that in the long run even if it works today in current Django.)

On the whole I wish that Django allowed you to attach form field cleanup logic directly to the field and then easily reuse field definitions between forms (right now this probably causes issues). Overall validation is part of the form, but individual field validation and cleanup really feels like it belongs to the field instead, and validation is already partly a field responsibility.

(Possibly you actually can hijack form field validators in order to also do cleanup, but if so I can't find any documentation on it so I'm not going to touch it.)

DjangoFormCleanupPain written at 00:25:41; Add Comment

2015-06-17

Exploring the irritating thing about Python's .join()

Let's start out with the tweets:

@Twirrim:
list_object.join(',')

AttributeError: 'list' object has no attribute 'join'
*facepalm*

','.join(list_object)

@thatcks: It's quite irritating that you can't ask lists to join themselves w/ a string, you have to ask a string to join a list with itself.

Python has some warts here and there. Not necessarily big warts, but warts that make you give it a sideways look and wonder what people were thinking. One of them is how you do the common operation of turning a sequence of strings into a single string, with the individual strings separated by some common string like ','. As we see here, a lot of people expect this to be a list operation; you ask the list 'turn yourself into a string with the following separator character'. But that's not how Python does it; instead it's a string operation where you do the odd thing of asking the separator string to assemble a list around itself. This is at least odd and some people find it bizarre. Arguably the logic is completely backwards.

There are two reasons Python wound up here. The first is that back in the old days there was no .join() method on strings and this was just implemented as a function in the string module, string.join(). This makes perfect sense as a place to put this operation, as it's a string-making operation. But when Python did its great method-ization of various module functions, it of course made most of the string module functions into methods on the string type, so we wound up with the current <str>.join(). Since then it's become Python orthodoxy to invoke list to string joining as 'sep.join(lst)' instead of 'string.join(lst, sep)'.

The other reason can be illuminated by noting that if Python did it the other way around you wouldn't have just lst.join(), you'd also have to have tuple.join() and in fact a .join() method on every sequence compatible type or even iterators. Anything that you wanted to join together into a string this way would have to implement a .join(), which would be a lot of types even in the standard library. And because of how both CPython and Python are structured, a lot of this would involve re-implementation and duplication of identical or nearly identical code. If you have to have .join() as a method on something, putting it on the few separator types means that you have far less code duplication and that any new sequence type automatically supports doing this in the correct orthodox way.

(I'm sure that people would write iterator or sequence types that didn't have a .join() method if it was possible to do so, because sooner or later people leave out every method they don't think they're going to use.)

Given the limitations of Python, I'll reluctantly concede that the current .join() approach is the better alternative. I don't think you can even get away with having just string.join() and no string .join() method (however much an irrational bit of me would like to throw the baby out with the bathwater here). Even ignoring people's irritation with having to do 'import string' just to get access to string.join(), there would be some CPython implementation challenges.

Sidebar: The implementation challenges

String joining is a sufficiently frequent operation that you want it to be efficient. Doing it efficiently requires doing it in C so that you can do tricks like pre-compute the length of the final string, allocate all of the memory once, and then memcpy() all of the pieces into place. However, you also have both byte strings and Unicode strings, and each needs their own specialized C level string joining implementation (especially as modern Unicode strings have a complex internal storage structure).

The existing string module is actually a Python level module. So how do you go from an in-Python string.join() function to specific C code for byte strings or Unicode strings, depending on what you're joining? The best mechanism CPython has for this is actually 'a method on the C level class that the Python code can call', at which point you're back to strings having a .join() method under some name. And once you have the method under some name, you might as well expose it to Python programmers and call it .join(), ie you're back to the current situation.

I may not entirely like .join() in its current form, but I have to admit that it's an impeccably logically assembled setup where everything is basically the simplest and best choice I can see.

JoinDesignDecisions written at 02:08:42; Add Comment

2015-06-07

You won't get people off Python 2 by making their lives worse

This is one of those times when I'm just going to quote someone, but hey, it's Guido van Rossum (via):

However this talk of "wasting our time with Python 2" needs to stop, and if you think that making Python 2 less attractive will encourage people to migrate to Python 3, think again. [...]

What he said, with all the emphasis you can imagine.

If the Python developers really think that, it's rather sad. Of course they wouldn't be the first people to believe that; the trick has a long history in computing, even if it often backfires.

I also personally think that it is stupid at this point in Python 3's life cycle. By now, there are probably two major classes of people who are still using Python 2: the people who are waiting for dependencies to get ported and the people who have decided that it is not a worthwhile expenditure of their time to port their code. Deliberately screwing these people does nothing to get them to move to Python 3. To the extent that they are aware that they are getting deliberately screwed by Python developers, it is more likely to encourage them to port their code to something else, anything else.

(Probably there is a third class of people, namely people who wrote some Python a while back and haven't touched it since because it works and what's this Python 3 thing and why should they care? These people are ignoring the whole mess, but in practice they are probably lost to Python 3 for good; you might as well consider them 'will never port'.)

In short: harming the remaining Python 2 users will not get them to migrate to Python 3 any faster than they already are, it just pisses them off. They are not migrating because it is impossible (at least currently) or too hard or too risky or the like.

(I could blather about what Python 3 'should' do to push for more migration, but it doesn't matter on several levels and anyways, I would be speaking from an uninformed and purely personal position. But in general, if the rate of Python 3 migration is not pleasing the Python developers, I prescribe a mirror.)

Sidebar: Why I say that people who are ignoring this are probably lost

I'm sure that there's a bunch of people out in the world who haven't heard about the Python 2 to Python 3 commotion; they have some Python 2 code, it works, they don't care about anything else. Due to being out of the loop, the first time they're likely to come into contact with this issue is when Python 2 isn't there on some new system and their old code immediately stops working.

(This can be either through /usr/bin/python disappearing or through it becoming Python 3.)

At this point, I think the most likely reaction of these people will be to discard their now-ancient (Python 2) system. If what it does is still needed, they'll probably rewrite from scratch using whatever is their current language and environment (which is not Python 3, because remember, they're out of the loop). If porting to Python 3 is easy they might do that instead, but I suspect it's not; they're going to basically be dealing with legacy code.

We're very close to being in this boat ourselves at work. While we have some Python code and not all of it's written by me, I think I'm the only person who's really following the Python 2 vs Python 3 issue. In my absence our Python code would run until it fell over and couldn't be easily patched, and then my co-workers might well pick another language they like better (whatever it would be at the time).

Python2NoBeatings written at 01:52:51; Add Comment

2015-05-24

A mod_wsgi problem with serving both HTTP and HTTPS from the same WSGI app

This is kind of a warning story. It may not be true any more (I believe that I ran into this back in 2013, probably with a 3.x version of mod_wsgi), but it's probably representative of the kind of things that you can run into with Python web apps in an environment that mixes HTTP and HTTPS.

Once upon a time I tried converting my personal site from lighttpd plus a CGI based lashup for DWiki to Apache plus mod_wsgi serving DWiki as a WSGI application. At the time I had not yet made the decision to push all (or almost all) of my traffic from HTTP to HTTPS; instead I decided to serve both HTTP and HTTPS along side each other. The WSGI configuration I set up for this was what I felt was pretty straightforward. Outside of any particular virtual host stanza, I defined a single WSGI daamon process for my application and said to put everything in it:

WSGIDaemonProcess cspace2 user=... processes=15 threads=1 maximum-requests=500 ...
WSGIProcessGroup cspace2

Then in each of the HTTP and HTTPS versions of the site I defined appropriate Apache stuff to invoke my application in the already defined WSGI daemon process. This was exactly the same in both sites, because the URLs and everything were the same:

WSGIScriptAlias /space ..../cspace2.wsgi
<Directory ...>
   WSGIApplicationGroup cspace2
   ...

(Yes, this is what is by now old syntax and may have been old even back at the time; today you'd specify the process group and/or the application group in the WSGIScriptAlias directive.)

This all worked and I was happy. Well, I was happy for a while. Then I noticed that sometimes my HTTPS site was serving URLs that had HTTP URLs in links and vice versa. In fact, what was happening is that some of the time the application was being accessed over HTTPS but thought it was using HTTP, and sometimes it was the other way around. I didn't go deep into diagnosis because other factors intervened, but my operating hypothesis was that when a new process was forked off and handled its first request it then latched whichever of HTTP or HTTPS the request had been made through and used that for all of the remaining requests it handled.

(This may have been related to my mistake about how a WSGI app is supposed to find out about HTTP versus HTTPS.)

This taught me a valuable lesson about mixing WSGI daemon processes and so on across different contexts, which is that I probably don't want to do that. It's tempting, because it reduces the number of total WSGI related processes that are rattling around my systems, but even apart from Unix UID issues it's clear that mod_wsgi has a certain amount of mixture across theoretically separate contexts. Even if this is a now-fixed mod_wsgi issue, well, where there's one issue there can be more. As I've found out myself, keeping things carefully separate is hard work and is prone to accidental slipups.

(It's also possible that this is a mod_wsgi configuration mistake on my part, which I can believe; I'm not entirely sure I understand the distinction between 'process group' and 'application group', for example. The possibility of such configuration mistakes is another reason to keep things as separate as possible in the future.)

ModWsgiDualSchemaProblem written at 01:04:59; Add Comment

2015-05-23

The right way for your WSGI app to know if it's using HTTPS

Suppose that you have a WSGI application that's running under Apache, either directly as a CGI-BIN through some lashup or perhaps through an (old) version of mod_wsgi (such as Django on an Ubuntu 12.04 host, which has mod_wsgi version 3.3). Suppose that you want to know if you're being invoked via a HTTPS URL, either for security purposes or for your own internal reasons (for example, you might need separate page caches for HTTP versus HTTPS requests). What is the correct way to do this?

If you're me, for a long time you do the obvious thing; you look at the HTTPS environment variable that your WSGI application inherits from Apache (or the web server of your choice, if you're also running things under an alterate). If it has the value on or sometimes 1, you've got a HTTPS connection; if it doesn't exist or has some other value, you don't.

As I learned recently by reading some mod_wsgi release notes, this is in practice wrong (and probably wrong even in theory). What I should be doing is checking wsgi.url_scheme from the (WSGI) environment to see if it was "https" or "http". Newer versions of mod_wsgi explicitly strip the HTTPS environment variable and anyways, as the WSGI PEP makes clear, including a HTTPS environment variable was always a 'maybe' thing.

(You can argue that mod_wsgi is violating the spirit of the 'should' in the PEP here, but I'm sure it has its reasons for this particular change.)

Not using wsgi.url_scheme was always kind of conveniently lazy; I was pretending that WSGI was still basically a CGI-BIN environment when it's not really. I always should have been preferring wsgi. environment variables where they were available, and wsgi.url_scheme has always been there. But I change habits slowly when nothing smacks me over the nose about them.

(This may have been part of an mod_wsgi issue I ran into at one point, but that's another entry.)

WSGIandCheckingHTTPS written at 00:35:54; Add Comment

2015-04-26

My complicated feelings on abandoning old but good code

Yesterday I wrote about some twelve year old Python code I have that's still running unmodified, and commented about both how it's a bit sad that Python 2 has changed so little that this code doesn't even emit warnings and that this is because Python has moved on to Python 3. This raises the obvious question: am I going to move portnanny (this code) to Python 3? My current answer is that I'm not planning to, because portnanny is basically end of life and abandoned.

I don't consider the program end of life because it's bad code or code that I would do differently if I was rewriting it today. It's EOL for a simpler reason, namely that I don't really have any use for what it does any more. This particular code was first written to be in front of an NNTP server for Usenet and then actually mostly used to be the SMTP frontend of a peculiar MTA. I haven't had any NNTP servers for a long time now and the MTA that portnanny sits in front of is obsolete and should really be replaced (the MTA lingers on only because it's still simpler to leave it alone). If and when portnanny or the MTA break, it probably won't be worth fixing them; instead that will be my cue to replace the whole system with a modern one that works better.

All of this makes me sad, partly because portnanny handles what used to be an interesting problem but mostly because I think that portnanny is actually some of the best Python code I've written. It's certainly the best tested Python code I've written; nothing else comes close. When I wrote it, I tried hard to do a good job in both structure and implementation and to follow the mantras of test driven development for Python, and I'll probably never again write Python code that is as well done. Turning my back on the best Python code I may ever write isn't entirely a great feeling and there's certainly part of me that doesn't want to, however silly that is.

(It's theoretically possible I'll write Python code this good in the future, but something significant would have to change in my job or my life in order to make writing high quality Python an important part of it.)

There's a part of me that now wants to move portnanny to Python 3 just because. But I've never been able to get really enthused about programming projects without a clear use and need, and this would be just such a 'no clear use' make work project. Maybe I'll do it anyways someday, but the odds are not good.

(Part of the reason that the odds are not good is that I think the world has moved on from using tcpwrappers like user level filtering for access control, especially when it's embodied in external programs. So not only do I not really have a use for portnanny any more, I don't think anyone else would even if they knew about it.)

AbandoningOldGoodCode written at 01:50:03; Add Comment

2015-04-25

I'm still running a twelve year old Python program

I've been thinking for a while about the interesting and perhaps somewhat remarkable fact that I'm still running a twelve year old Python program (although not as intensely as I used to be). When I say twelve year old, I don't mean that the program was first written twelve years ago and is still running; I mean that the code hasn't been touched since 2004. For bonus points, it uses a compiled module and the source of the module hasn't been changed since 2004 either (although I have periodically rebuilt it, including moving it from 32-bit to 64-bit).

(I was going to say that the tests all still pass, but it turns out that I burned a now-obsolete IP address into one of them. Oops. Other than that, all the tests still pass.)

While the code uses old idioms (it entirely uses the two argument form of raise, for example), none of them are so old and questionable that Python 2 emits deprecation warnings for them. I'm actually a little bit surprised by that; even back in 2004 I was probably writing old fashioned code. Apparently it's still not too old fashioned.

Some of the long life of this old code can be attributed to the fact that Python 2 has been moving slowly. In 2004 I was writing against some version of Python 2.3, and the Python 2 line stopped really changing no later than Python 2.7 in 2010. Really, I doubt anyone was in a mood to deprecate very much in Python 2 after 2007 or so, and maybe earlier (I don't know when serious Python 3 development started; 3.0 was released at the end of 2008).

(With that said, Python 2.6 did deprecate some things, and there were changes in Python 2.4 and 2.5 as well.)

My impression is that this is a reasonably surprising lifespan for an unchanged set of source code, especially in an interpreted language. Even in a compiled language like C, I'd expect to have to update some function prototypes and stuff (never mind a move from 32 bits to 64 bits). While it's certainly been convenient for me in that I haven't had to pay any attention to this program and it just worked and worked even as I upgraded my system, I find myself a little bit sad that Python 2 has moved so slowly that twelve years later my code doesn't even get a single deprecation warning.

(The flip side argument is that my code would get plenty of warnings and explosions if I ran it on Python 3. In this view the language of Python as a whole has definitely moved on and I have just chosen to stay behind in a frozen pocket of oldness that was left for people like me, people who had old code they didn't want to bother touching.)

PS: It turns out that the Github repo is somewhat ahead of the state of the code I have running on my system. Apparently I did some updates when I set up the Github repo without actually updating the installed version. Such are the hazards of having any number of copies of code directories.

TwelveYearOldPythonProgram written at 01:15:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.