Wandering Thoughts archives

2013-01-18

More on my favorite way of marking continued lines

A commentator on my first entry on this both correctly noted that I had mis-attributed the RFC that originated this (I learned it from RFC 822, but it originally was invented in RFC 724) and had some reactions to my idea, which means that I need to clarify it and add some additional comments. They wrote:

Comment-folding whitespace? Please, no. No. :( Comment-folding whitespace is the bane of people handling email.

In a sense, I entirely agree with this comment. Implementing full RFC 724/RFC 822 style parsing in your language is not what you want to do because it's too complex and perverse (mail headers have some crazy rules). But I was unclear in my original entry, especially about comments.

In a 'leading whitespace continues the logical line' environment, my usual approach to comments is that they occupy whole physical lines (ie you cannot have a line that is part-content and part-comment) and are silently removed in low-level parsing. As an example:

# this is a comment
abc # this is not
this is
   # a comment
   some text

this is some text

The last two things result in the same logical line ('this is some text') because the (indented) comment line is removed as part of assembling logical lines. There are many equally good variants on comment handling (eg disallow them in continued lines); I just find it convenient to be able to write comments for parts of anything that gets long enough to be split over multiple physical lines.

(As implied by the reassembled line, my approach is to replace all of the trailing whitespace, the newline, and the leading whitespace with a single space.)

As implied by how I prefer to handle comments, this is all designed for simple situations, for configuration files and small DSLs with grammars that are as simple as possible (often simply 'space separated words' with some meaning layered on top). It's my strong belief that all of these languages already want to avoid language features that might make this sort of line continuation a problem (although I'm not sure what they would be). Yes, people can break logical lines up in perverse ways with this, but they can do that with any line continuation scheme (and you still want a line continuation scheme).

(As I have found out the hard way repeatedly, line continuations are something you almost always want to have, much like comments.)

If you're doing this as part of a real lexer and tokenizer, you will have to decide what happens with a single token that gets split over multiple physical lines, such as:

a = "some
     text"

Because I do this before any tokenization gets its hands on the result, my answer is 'what you see is what you get', ie the language tokenizer and parser gets handed 'a = "some text"' and may do with it whatever it wishes. This is not necessarily suitable for sophisticated languages which may sometimes want to retain newlines and leading whitespace as actual elements of eg strings, but as I said this is a design for simple languages.

FavoriteLineContinuationII written at 02:13:22; Add Comment

2013-01-16

My favorite way of marking continued lines

One of the things you often want when designing configuration files and little domain specific languages is some way of splitting a single long logical line into several physical ones. In other words you want some way of marking line continuations. Over the years people have come up with a huge assortment of ways to do this; you can have a language with explicit terminators and just ignore newlines, you can put backslashes at the end of incomplete lines, and so on.

(Some languages have several different ways of continuing lines, depending on the specific context. Try not to do this in yours if you have a choice.)

As it happens I have a favorite way of doing this and I think it's the best way. It is the 'RFC 822' method (so named because it's how mail headers are handled), where a logical line is continued by indented physical lines. Here is an example:

This is a single
     logical line
     once everything is
     reassembled
This is a new logical line

The drawback of this approach is that it becomes harder to make indentation significant in your language. I'd argue that this is not an important drawback for configuration files or small DSLs since you should avoid generally significant indentation because it makes your language parser (much) harder.

The advantage of this approach to me is that it results in continued lines looking right or at least looking obvious. It's a very common formatting convention to indent continued lines anyways (even or especially when not required by the language) and making the indentation significant for this means that you can't wind up with indented lines that aren't actually continued (because, for example, you accidentally left out a \ at the end of the previous line; I've done this more than once in things like Makefiles).

Sidebar: parsing lines in this approach

I believe that the simplest way to parse the resulting language is in a two level process. At the first level you read physical lines, strip blank lines and comments, fold multiple physical lines into a single logical line, and deliver that line to the second level. The second level then parses your actual language. This requires a little bit of care in your first level and you'll need a little pushback stack for lines (since you're going to over-read by one physical line when reading a logical line and the physical line won't always be something you can just discard).

This is not quite a traditional lexer/parser split because your first level doesn't attempt to break up the logical lines into their components, but I try to avoid writing any sort of actual lexer for configuration files and small DSLs. If your situation is complex enough for a real lexer you probably want to handle the entire process in the lexer.

FavoriteLineContinuation written at 22:53:28; Add Comment

2013-01-11

A thought about static linking and popularity

This all started with my entry that touched on packaging with Go. Go is attractive partly because it what you get at the end is a basically self-contained program with no dependencies (unlike in, say, interpreted languages). One part of this is that Go static-links everything.

The major advantage of static linking is no runtime dependencies. The disadvantages of static linking are that updates to packages (in the Go sense, ie a component or a library) are more of a pain, that the accumulated collection of everything using a package takes up more disk space, and that that a bunch of things running at the same time that use the same package can't share its memory (each gets its own copy).

As I was thinking about this, it occurred to me that the more popular your static-linked environment is the more that the drawbacks of static linking are a problem. If you only have a few programs written in Go, it's not a particularly big issue to rebuild and reinstall things if (or when) a bug or security issue comes up, and you wouldn't save much disk space or memory anyways. But imagine if Go becomes a popular language for writing programs in, to the point where a typical Linux system has a couple of hundred static-linked Go programs; all of these drawbacks would be much larger. A bugfix to a core package would result in a huge update, for example, because many programs use it and would have to be rebuilt.

(I suspect that the authors of Go don't really care about it being popular in this way and don't expect anything like it to happen any time soon, so I doubt that this issue is something that even crosses their radar.)

Another way to put this is that most of the time, static versus dynamic linking is a system engineering issue instead of a language or a programming one. The system administrators and system builders want it, while programmers don't care (except as a neat technical challenge).

(There are uses for being able to dynamically load code, even just at startup, but many programs don't need anything like that. Even when they do, these days things like embeddable JIT'd 'interpreted' languages may offer alternate solutions.)

StaticLinkingAndPopularity written at 00:20:40; Add Comment

2013-01-03

An alterate pattern for polymorphism in C

As I mentioned in yesterday's entry, CPython (the C-based main implementation of Python) uses an interesting variant on struct-at-start based polymorphism. To put it simply, it uses #defines instead of a struct. This probably sounds odd, so let me show you the slightly simplified CPython 2.7.x code:

#define PyObject_HEAD           \
   Py_ssize_t ob_refcnt;        \
   struct _typeobject *ob_type;

#define PyObject_VAR_HEAD       \
   PyObject_HEAD                \
   Py_ssize_t ob_size;

typedef struct _object {
    PyObject_HEAD
} PyObject;

typedef struct {
    PyObject_VAR_HEAD
} PyVarObject;

/* A typical actual Python object */
typedef struct {
    PyObject_VAR_HEAD
    int ob_exports;
    Py_ssize_t ob_alloc;
    char *ob_bytes;
} PyByteArrayObject;

(This is taken from Include/object.h in the CPython source.)

The #defines are used to construct generic 'object' structs (the typedef'd PyObject and PyVarObject) for use in appropriate code, but in actual Python objects the #defines are used directly instead of the object structs being embedded in them. Things are cast back and forth as necessary; in practice (and I believe perhaps in ANSI C theory) it's guaranteed that the actual memory layout of the start of a PyByteArrayObject and a PyVarObject are the same.

There are a number of advantages of this #define-based approach. The one that's visible here is that references to these polymorphic fields in actual structs do not require levels and levels of indirection through names that exist merely as containers. If p is a pointer to a PyByteArrayObject, you can directly refer to p->ob_refcnt instead of having to refer to p->b.a.ob_refcnt, where b and a are arbitrary names assigned to the PyVarObject and PyObject structs embedded in the PyByteArrayObject. This goes well with CPP macros to manipulate the various fields (actual functions, even inline ones, would require some actual casting). In particular it means that a CPP macro to manipulate ob_refcnt don't have to care whether you're dealing with a PyObject or a PyVarObject; with explicit structs, the former case would need p->a.ob_refcnt while the latter would need p->b.a.ob_refcnt.

(Some C compilers allow anonymous structs if the members are unique and this is now standardized in C11.)

CPolymorphicPatternsII written at 01:37:03; Add Comment

2013-01-02

Some patterns for polymorphism in C

As I've written about before, C programmers tend to (re)invent certain parts of OO programming on their own as the natural easiest way to write code. In that entry I mentioned that one thing C programs tend to have is a polymorphic object system. As it happens, I've seen several different ways of doing this in C (and I'm sure there are others, C programmers are inventive).

In theory the simplest way of doing polymorphism in C is just to start all of your structs with a common set of members; then you can just dereference a pointer to any struct to get at these without caring just what particular sort of struct you have. In practice, almost everyone does this by having a core struct type with all of those members because this gives your code a convenient base name for this set of common members.

(You can do without this base name but then you have to pick some actual struct type for the type of your pointer and the whole point of the polymorphism is that you don't care just what concrete type the pointer you have is.)

Once you have a struct with all your common members in it (call it an object struct), a question comes up: where does it go in your actual struct? I've seen three answers.

  • the object struct goes at the start of your larger struct, and you simply cast the pointer between the two types as needed. This leads naturally to a style where you have several levels of more or less common object structs nested inside each other like Matryoshka dolls, each adding a few more fields that each layer of polymorphism needs.

  • the object struct goes at some constant offset inside your larger structs. Going back and forth between a pointer to the object struct (when you need polymorphism) and a pointer to your actual struct requires some casting magic (generally wrapped up in CPP macros) and is somewhat more annoying than before.

    (I think this style is the least common.)

  • the object struct has a pointer to your overall struct and goes anywhere you like in your larger struct (and may not be at a constant position in various different structs); you simply initialize the pointer appropriately when you're setting up an instance of the overall struct. This costs you an extra pointer field but frees you from various issues.

The first approach is by far the most common one that I've seen. It's the one I've generally used in my code when I needed this kind of thing; the other two approaches tend to be for more esoteric situations where for some reason you can't put (this) object struct at the start of your overall struct.

There is an interesting variation on the first approaches that kind of sidesteps having an actual object struct at the start of your real structs, but explaining it (and talking about why you'd want to do it) requires quoting enough code from CPython that I'm going to make it a separate entry.

CPolymorphicPatterns written at 01:58:48; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.