Wandering Thoughts archives

2015-12-06

The (peculiar) freedom of having a slow language

Back in my entry on why speeding up (C)Python matters, I said in an aside that there was a peculiar freedom in having a slow language. Today I'm going to explain what I meant in that cryptic aside.

The peculiar freedom of having a slow language is the mirror image of the peculiar constraint of having a fast language, which is that in a fast language there is usually a (social) pressure to write fast code. Maybe not the very fastest code that you could, that's premature optimization, but at least code that is not glaringly under-performant. When the language provides you a fast way to do what your code needs to do, you're supposed to use it. Usually this means using a 'narrow' feature, one that is not particularly more powerful than you need.

In a slow language like (C)Python, you are free of this constraint. You don't have to feel guilty about using an 'expensive' feature or operation to deal with a small problem instead of carefully writing some narrow efficient code. The classical example of this is various sorts of simple parsing. In many languages, using a regular expression to do most parsing is vastly indulgent because it's comparatively slow, even if it leads to simple and short code; there is great social pressure to write hand-rolled character inspection code and the like. In CPython you can use regexps here without any guilt; not only are they comparatively fast, they're probably faster than hand written code that does it the hard way.

The result of this is that in CPython I solve a lot of problems with simple brute force using builtins, regular expressions, and other broad powerful features, while in languages like Go I wind up writing more complicated, more verbose code that is more narrow and more efficient because it only does what's strictly necessary.

(I came to really be aware of this after recently writing some Go code to turn newlines into CR NL sequences as I was writing output to the network. In Python this is a one-liner piece of code; in Go, the 'right' Go-y way involves a carefully efficient hand-rolled loop, even though you could theoretically do it in exactly the same way that Python does.)

SlowLanguageFreedom written at 02:07:37; Add Comment

2015-11-25

Why speeding up (C)Python matters

The simple answer to why speeding up Python matters is that speed creates freedom.

Right now there are unquestionably a lot of things that you can't do in (C)Python because they're too slow. Some of these are large macro things like certain sorts of programs; others are more micro things like use complicated data structures or do versions of things implemented in C. Speeding up Python creates the straightforward freedom to do more and more of these things.

In turn, this creates another freedom, the freedom to write code without having to ask yourself 'is this going to be too slow because of Python?'. Again, this is both at the macro level of what problems you tackle in Python and the micro level of how you attack problems. With speed comes the freedom to write code in whatever way is natural, to use whatever data structure is right for the problem, and so on. You don't have to contort your code to optimize for what your Python environment makes fast (in CPython, C level modules; in PyPy, whatever the JIT can do a good job recognizing and translating).

There is a whole universe of Python code that reasonably experienced CPython programmers know you simply don't write because it would be too slow, create too much garbage, and so on. Speed creates the freedom to use more and more of that universe too, in addition to the Python universe that we already program in. This freedom matters even if none of the programs we write are ever CPU-bound to any particular degree.

(Admittedly there's a peculiar kind of freedom that comes from having a slow language, but that's another entry.)

(This is not particularly novel and I've orbited the core of this thought in earlier entries. I just feel like writing it down explicitly today after writing about why I care about how fast CPython is, partly because a rejoinder to that entry is 'why do we care if CPython isn't really fast?'.)

WhySpeedMatters written at 01:26:52; Add Comment

2015-11-23

Why I care about how fast CPython is

Python is not really a fast language. People who use Python (me included) more or less accept this (or else it would be foolish to write things in the language), but still we'd often like our code to be faster. For a while the usual answer to this has been that you should look into PyPy in order to speed up Python. This is okay as far as it goes, and PyPy certainly can speed up some things, but there are reasons to still care about CPython's speed and wish for it to be faster.

The simple way of putting the big reason is to say that CPython is the universal solvent of Python. To take one example, Apache's mod_wsgi uses CPython; if you want to use it to deploy a WSGI application in a relatively simple, hands-off way, you're stuck with however fast CPython is. Another way that CPython is a universal solvent is that CPython is more or less everywhere; most Unix systems have a /usr/bin/python, for example, and it's going to be some version of CPython. Finally, CPython is what most people develop with, test against, write deployment documentation for, and so on; this is both an issue of whether a package will work at all and an issue of whether it's doing that defeats much of PyPy's speedups.

Thus, speeding up CPython speeds up 'all' Python in a way that improvements to PyPy seem unlikely to. Maybe in the future PyPy will be so pervasive (and so much a drop in replacement for CPython) that this changes, but that doesn't seem likely to happen any time soon (especially since PyPy doesn't yet support Python 3 and that's where the Python world is moving).

(Some people will say that speeding up things like Django web apps is unimportant, because web apps mostly don't use CPU anyways but instead wait for IO and so on. I disagree with this view on Python performance in general, but specifically for Django web apps it can be useful if your app uses less CPU in order to free up more of it for other things, and that's what 'going fast' translates to.)

CPythonSpeedMatters written at 00:35:35; Add Comment

2015-11-22

I should find some good small packages for templates and forms for CGIs

I recently wrote about how we shouldn't be making up new (web) templating packages but instead should just use existing ones. This leads me to admit that I have a little problem here with Python, in that I don't currently have one of these that I'd use. Well, sort of.

If I was to write another relatively substantial web application, I'd unquestionably use Django (and thus I'd use Django's templating system and form handling). It's what I know, it works, and we already have one Django web app so adding a second is relatively low overhead. But the problem with Django (and other big systems) is that their deployment story is kind of heavyweight and a pain in the ass. We have a number of small, simple things that are easiest to handle as just CGIs, and we'll probably get more.

(This may be over-estimating the hassles of deploying a second mod_wsgi based thing when we already have one, but wrangling Django kind of makes me grumpy in the first place.)

This means that I should really find an existing Python templating system that is right-sized for use in CGIs, meaning that it is not too big itself, does not have big dependencies, starts up relatively fast, and ideally can take its templates from embedded strings in the program instead of loading them from the filesystem. I haven't previously looked in this area (partly because Django met what I thought of as all of my needs here), so I'm not familiar with what's available.

(For simple CGI-like things, embedded the templates in the CGI's Python code makes for easier deployments, by which I mean that the whole CGI will be just one file and I'll copy it around.)

In the same spirit of 'don't roll your own half-solution when other people have done better ones', I should also get an existing package to handle forms for my CGIs. This is likely to interact with the choice of templating system, since you often want to use your template system to automatically render forms in clever ways instead of basically embedding fixed HTML for them.

Probably I want to start my investigation with, say, the Python wiki's entry on web frameworks. I'm going to not mention any project names here for now, since what comes to mind is basically influenced by the project's PR (or lack of it), not any actual knowledge on my part.

PS: This walks back from my views on templates versus simple HTML. I still believe them, but I also believe that it's probably not worth fighting city hall in most cases. A full template system may be vast overkill for many things, but there's a great virtue in standardizing on a single solution that can be learned once and is then familiar to people.

(Templates also have the virtue that the HTML and the fixed text content are both obvious, so it's easy for people to find it and then make simple brute force modification because, for example, we've decided to change some names we use to refer to our systems in order to confuse users less. You don't even really need to know the templating system for this, you just have to be able to find the text.)

CGITemplatesAndForms written at 02:17:42; Add Comment

2015-11-18

Increasingly, I no longer solidly and fully know Python

There once was a time where I felt that I more or less solidly knew and understood Python. Oh, certainly there were dark corners that I wasn't aware of or that I knew little or nothing about, but as far as the broad language went I felt that I could say that I knew it. Whether or not this was a correct belief, it fed my confidence in both writing and reading Python code. Of course the wheels sort of started to come off this relatively early, but still I had that feeling.

For various reasons, it's clear to me that those days are more and more over. Python is an increasingly complicated language (never mind the standard library) and I have not been keeping up with its growth. With Python 3 I'm basically hopeless; I haven't been attempting to follow it at all, which means that there's whole large areas of new stuff that I have no idea about. But even in Python 2 I've fallen out of touch with new areas of the language and core areas of the standard library. Even when I know they're there, I don't know the detailed insides of how they work in the way that I have a relatively decent knowledge of, say, the iterator protocol.

(One example here is context managers and the "with" statement.)

I'm pretty sure that this has been part of my cooling and increased ambivalence with Python. Python has advanced and I haven't kept up with it; there's a distance there now that I didn't feel before. Python 3 is an especially large case because it feels that lots has changed and learning all about it will be a large amount of work. Part of me wonders if maybe Python (at least Python 3) is now simply too large for me to really know in the way that I used to.

You might ask if actually knowing a language this way is even important. My answer is that it is for me, because knowing a language is part of how I convince myself that I'm not just writing Fortran in it (to borrow the aphorism that you can write Fortran in any language). The less I know a language, the less I'm probably writing reasonably idiomatic code in it and the more I'm writing bad code from some other language. This especially matters for me and Python 3, because if I'm going to write Python 3 code, I want to really write in Python 3; otherwise, what's the point?

(I don't have any answers here, in part because of a circular dependency issue between my enthusiasm for writing stuff in Python and my enthusiasm for coming back up to full speed on Python (2 or 3).)

MissingFullKnowledge written at 01:47:50; Add Comment

2015-10-22

CPython's trust of bytecode is not a security problem

Yesterday I wrote about how CPython trusts bytecode so much that you can use it to read or write arbitrary memory. In comments, Ewen McNeil had a typical reaction to this:

It appears this means if you can get arbitrary Python execution (eg, unwisely trusting YAML, XML, pickle, etc...), then you can probably get arbitrary memory read/write in the Python process, which is a fairly short step away from arbitrary assembly code execution.

This is true, but it also misunderstands the security situation of Python bytecode. Even without this issue, it is game over in general if an attacker can load arbitrary bytecode into your Python process. The obvious weakness is that ctypes is part of the standard library these days and it can also be used to give you this level of access to memory without any need to corrupt the bytecode interpreter. But even without ctypes an attacker has plenty of options to achieve binary code execution. They can transfer a binary, write it out, and then execute it. They can transfer a native code Python module (in .so form), manipulate the Python load path, and then import it (which gets them code execution even in the Python process). They can run other existing vulnerable binaries on your system and exploit their bugs. And so on.

You can certainly try to stop this by creating a Python environment that blocks access to the Python features necessary for this. The problem is that there have proven to be many features that can be exploited to help here and many paths through Python to reach them. The runtime environment of Python is a complex, intertangled thing, and all attackers need is one crack that lets them bootstrap a reference to, say, the os module. And there are a lot of potential cracks.

(Python used to have a restricted execution module. As you can see, it was disabled in Python 2.3 because it had basically unfixable holes.)

The simple truth is Python is not a safe execution environment for untrusted code. The only important thing about bytecode being able to read and write arbitrary memory all by itself is that it shows how impossible the job of securing CPython is. Even if you managed to reliably cut off all access to modules and code that could be used to escape your sandbox at the Python level, you would have to audit and fix the innards of the bytecode interpreter itself to be safe.

This is why I say that this trust of bytecode is not a security problem; it doesn't really make the situation any worse than it already is. It's just an amusingly baroque alternate path to a security issue that is already there in general.

BytecodeIsTrustedII written at 00:12:41; Add Comment

2015-10-21

Python bytecode is quite heavily trusted by CPython

I've written before that Python bytecode is not secure, and at the time I said:

[...] I wouldn't be surprised if hand-generating crazy instruction sequences could do things like crash CPython (in fact, I'm pretty confidant that doing this is relatively trivial) and lead to arbitrary code execution. [...]

It turns out that I was exactly correct here, and it's actually been both found and demonstrated. Start with this tweet:

Python devs will hate you for it! One weird trick to directly access python's memory from the interpreter: [gist]

There's a brief explanation and then you can read the details of how CPython bytecode can be used to read and write arbitrary memory.

As that article notes, this is not a bug or at least not something the Python developers consider a bug. And for what it's worth, I agree with them. The CPython bytecode interpreter deliberately chooses to gain some extra speed by omitting checks that are only necessary if either something has gone terribly wrong with bytecode generation or you are loading malicious bytecode. LOAD_CONST is a hot path in a very important optimization and there are undoubtedly any number of other issues lurking in the undergrowth here; closing this hole would probably not make loading untrusted CPython bytecode materially safer and it probably would exact a slowdown.

(At a start, if you're even going to consider doing that it's clear that you need to at least audit the CPython bytecode interpreter to try to find other issues. You probably also want a pre-loading bytecode validation pass, too.)

One corollary of this is that bytecode rewriting is potentially dangerous (even if you have good intentions). A sufficiently badly rewritten bytecode sequence may not merely malfunction at the Python level, it's possible that it could crash or corrupt the CPython interpreter.

(On the other hand, if you're rewriting bytecode and running the result in production you probably really need whatever your rewriting enables. Test thoroughly, but if you've got to rewrite bytecode, well, you've got to. At least CPython gives you the freedom if you absolutely need it.)

BytecodeIsTrusted written at 02:08:07; Add Comment

2015-09-26

I've decided I'll port DWiki to Python 3, but not any time soon

At this point I have only one significant Python 2 program that I care deeply about and that is DWiki, the pile of code that underlies Wandering Thoughts. What to do about DWiki in light of Python 3 has been something that has been worrying and annoying me for some time, because doing a real port (as opposed to a quick bring-up experiment) is going to involve a lot of wrestling with old code and Unicode conversion issues. Recently I've come around to a conclusion about what I plan to do about the whole issue (perhaps an obvious one).

In the end neither grimly staying on Python 2 forever or rewriting DWiki in something else (eg in Go) are viable plans, which leaves me with the default: sooner or later I'll port DWiki to Python 3. However I don't expect to do this any time soon, for two reasons. The first is that Python 3 itself is still being developed and in fact the Python landscape as a whole is actively evolving. As a result I'd rather defer a port until things have quieted down and gotten clearer in a few years (who knows, perhaps I'll want to explicitly revise DWiki to be PyPy-friendly by then). As far as I'm concerned the time to port to Python 3 is when it's gotten boring, because then I can port once and not worry about spending the next few years keeping up with exciting improvements that I'd like to revise my code to take advantage of.

The second reason is more pragmatic but is related to the rapid rate of change in Python 3, and it is that the systems I want to run DWiki on are inevitably going to be behind the times on Python 3 versions. Right now, the rapid rate of improvements in Python 3 means that being behind the times leaves you actively missing out on desirable things. In a few years hopefully that will be less so and a Python 3 version that was frozen a year or three ago will not be so much less attractive than a current version. This too is part of Python 3 slowing down and becoming boring.

(If you are saying 'who freezes Python 3 versions at something a few years old?', you haven't looked at long term support Linux distributions or considered how long people will run eg older FreeBSD versions. There is a long and slow pipeline from the latest Python 3 release to when it appears in OS distributions that many people are using, as I've covered before.)

I don't have any particular timeline on DWiki's Python 3 port except that I don't intend or expect to do this within, oh, the next three years. Probably I'll start looking at this seriously about the time the Python developers start clearing their throats and trying to once again persuade everyone that 2.7 support will be dropped soon, this time for sure. A clear slowdown in Python 3 development plus OS distros catching up to current versions might push that to sooner, but probably not much sooner.

Hopefully thinking through all of this and writing it down means that I can stop worrying about DWiki's future every so often. I may not be doing anything about it, but at least I now have a reasonable plan (and I've kind of made my peace with the idea of going through all the effort to get a production quality version of DWiki running under Python 3 (and yes, the amount of effort it's going to take still irritates me and probably always will)).

(Although every so often I toy with the idea of a from-scratch rewrite of DWiki in Go that addresses various things I'd do differently this time around, the reality is that DWiki's creation took place in unusual circumstances that I'm unlikely to repeat any time soon.)

DWikiPython3Someday written at 00:57:00; Add Comment

2015-09-14

Tweaking code when I'm faced with the urge to replace it entirely

The one of the core parts of DWiki (the program behind all of what you're reading) is the code that turns DWikiText into HTML (and in many ways it is the most important component, since it's what ultimately renders all my content and all the comments). I spent a significant chunk of today tweaking a test version in an attempt to improve the conversion process for a number of corner cases that I care about and would like to make better.

There are two problems with this. The first is that one of the consequences of having what is now a long-running block is that I have a lot of content written in my wikitext dialect and pretty much all of it had better keep coming out just the same. I have no interest in trying to go through all of my entries to revise them for some bright new wikitext idea; backwards compatibility is quite important, warts and all. This is unfortunate in practice because I made some mistakes in my wikitext dialect way back when. It's sometimes possible to do little things around the corner of these mistakes, but that creates hacks and special rules and special magic code.

The other problem is that DWiki the program is also old and by now rather tangled, especially in the DWikiText renderer. Part of this tangle is just history, part of it is that it has been heavily optimized for speed, and part of it is that I made some fundamental structural mistakes in the beginning that have been carried through ever since then (one of them is not parsing to an AST, but it's not the only one). Faced with a complex set of code that I don't work with regularly, I descend to tweaking as carefully as I can rather doing anything deeper, which of course builds up the accumulated layers of hacks in the code and makes it harder to do anything except more tweaks and hacks around the edges.

(Python makes it surprisingly easy to do this sort of tweaking for various reasons, including its support for optional function arguments.)

At the same time, the more I worked on this code today the more clearly I saw how I wanted a modern version of the code to work. The more I have to stick in hacks and make tweaks, the more I also want to raze the whole complicated mess to the ground and redo it from scratch with a much better restructured version (or at least what I think would be such a thing, in my current state of not having written it and thus not having been faced by any messy gaps in my current grand vision).

(Any grand rewrite immediately starts to run into Python 3 thoughts, which lead to other thoughts I'm not going to try to cover here.)

I don't have any answers and I'm not even sure I'm going to deploy the tweaked version (I have a history of this sort of indecision), although I probably will now that I've written this up. But at least I had a reasonably enjoyable time fiddling around in the depths of DWiki once again and perhaps a bit more impetus towards doing some significant cleanups someday.

(While in the past I've lamented that I don't have a test suite for DWiki, I do actually have one for this sort of change; I can render all of this thing in both the old and new versions of the code and see what's different. This can be an interesting (re)learning experience, but that's another entry.)

TweakingVersusReplacement written at 01:37:09; Add Comment

2015-08-21

What surprised me about the Python assignment puzzle

Yesterday I wrote about a Python assignment puzzle and how it worked, but I forgot to write about what was surprising about it for me. The original puzzle is:

(a, b) = a[b] = {}, 5

The head-scratching bit for me was the middle, including the whole question of 'how does this even work'. So the real surprise here for me is that in serial assignments, Python processes the assignments left to right.

The reason this was a big surprise is due to what was my broad mental model of serial assignment, which comes from C. In C, assignment is an expression that yields the value assigned (ie the value of 'a = 2' is 2). So in C and languages like this, serial assignment is a series of assignment expressions that happen right to left; you start out with the actual expression producing a value, you do the rightmost assignment which yields the value again, and you ripple leftwards. So a serial assignment groups like this:

a = (b = (c = (d = <expression>)))

Python doesn't work this way, of course; assignment is not an expression and doesn't produce a value. But I was still thinking of serial assignment as proceeding right to left by natural default and was surprised to learn that Python has chosen to do it in the other order. There's nothing wrong with this and it's perfectly sensible; it's just a decision that was exactly opposite from what I had in my mind.

(Looking back, I assumed in this entry that Python's serial assignment order was right to left without bothering to look it up.)

How did my misapprehension linger for so long? Well, partly it's that I don't use serial assignment very much in Python; in fact, I don't think anyone does much of it and I have the vague impression that it's not considered good style. But it's also that it's quite rare for the assignment order to actually matter, so you may not discover a mistaken belief about it for a very long time. This puzzle is a deliberately perverse exercise where it very much does matter, as the leftmost assignment actively sets up the variables that the next assignment then uses.

AssignmentPuzzleSurprise written at 21:58:55; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.