An example sort that needs a comparison function

January 2, 2012

In reaction to my entry on Python 3 dropping the comparison function for sorting, some people may feel that a sorting order that is neither simple field-based nor based on a computed 'distance' (the two cases easily handled by a key function) is unrealistic. As it happens I can give you a great example of a sort order that cannot be handled in any other way: software package versions on Linux systems.

For simplicity (and because I know RPM best), I'm going to talk about RPM-based version numbers. RPM version numbers have three components, an epoch, a version, and a release, and ordering is based on comparing each successive component in turn. The epoch is a simple numeric comparison (higher epochs are more recent), but both the version and release can have sub-components and each sub-component must be compared piecewise using a relatively complex comparison for each piece (they can be all digits, letters, or mixed letters and digits). Something with extra sub-components is more recent than something without it, so version 1.6.1 is more recent than version 1.6. A full package version can look like '1:2.4.6-4.fc16.cks.0'; '1:' denotes the epoch, the version is '2.4.6', and the release is '4.fc16.cks.0'.

(Most RPM packages have an epoch of '1' '0', which is conventionally omitted when reporting package versions.)

In the presence of potential letter-based subcomponents and the complex comparison rules, you can't compare these version numbers using simple field-based rules, not even if you split sub-components up into tuples and then compare a tuple-of-tuples (it's possible if all sub-components are simple numbers). Nor can you compute some sort of single numerical 'distance' value for a particular version number, especially since version numbers are sort of like the rational numbers in that you can always add an essentially unlimited number of additional versions between any two apparently adjacent versions. The only real operation you have is a pure comparison, where you answer the question 'is X a higher version than Y', and this comparison requires relatively intricate code.

(Having said that, DanielMartin showed a nice way to transform things so that a key-function based sort can be used for a comparison function sort in comments on the earlier entry.)


Comments on this page:

From 72.14.228.1 at 2012-01-02 11:24:40:

In fact this example doesn't need a comparison function. Try this for the key function:

import re
import itertools

def keyf(ver):
  return map(lambda c: tuple(itertools.imap(
          lambda x,y,z: (int(y or 0), z or ''), 
          *[iter(re.split('([0-9]+)|([a-zA-Z]+)', c or ''))]*3)),
      re.match('(?:([0-9]+):)?([^-]*)(.*)', ver).groups())

Or clean that up if company's coming over. I'll admit that this version of the key function isn't terribly readable. Maybe this is more to your liking:

import re
import itertools

def keyf(ver):
  # Get epoch, version, release
  components = re.match('(?:([0-9]+):)?([^-]*)(.*)', ver).groups()
  k = []
  for c in components:
    component = c or ''    # treat None as ''
    verbits_text = re.split('([0-9]+)|([a-zA-Z]+)', component)
    # we now have a list of:
    # (punct., number or None, alpha or None, punct., num or None, 
    #  alpha or None, punct., etc.)
    # This is the idiom for "n at a time" from itertools docs:
    verbits = itertools.izip(*[iter(verbits_text)]*3)
    # Now turn each bit into a tuple (number, alpha)
    component_key = ((int(y or 0), z or '') for (x,y,z) in verbits)
    k.append(tuple(component_key))
  return k

-- DanielMartin

From 65.172.155.228 at 2012-01-04 14:37:38:

Actually it does, because if you aren't calling rpm.labelCompare() you are "doing it wrong"(tm).

Also, as a minor thing:

(Most RPM packages have an epoch of '1', which is conventionally omitted when reporting package versions.)

...here '1' should be '0', although "most" is > 50% not > 95% ... as you might hope.

By cks at 2012-01-05 11:07:49:

I think we're both wrong about the default RPM epoch; it appears that the default epoch is no epoch at all (printed by RPM as '(none)'). Almost all of the packages on machines running Fedora 15, Fedora 16, and RHEL 5 have a '(none)' epoch. After that, an epoch of '1' is the most common value.

I agree with you about rpm.labelCompare() for real code.

Written on 02 January 2012.
« Why CA-based SSL is not likely to be replaced any time soon
The drawback of full disk encryption »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Jan 2 01:49:41 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.