Messing with Python's sort function

There are many interesting things about python's sort function like how its adapts depending on the input, how its stable in-place, how its in other languages and how it could give the wrong result. This post is about none of those.

What and why

Instead I'm going to try to mess with Python's implementation of Timsort and try to make it give the wrong answer through no fault of its own. And I will fail.

The reason for doing this is that shortly after releasing my undoable library to the world, I found a bug in my replace function that emptied the list on an error. Here's that function.

def replace(self, newlist):
    oldlist = self[:]
    self.skiplog += 1
    del self[:]
    self.extend(newlist)
    self.skiplog -= 1
    self.callback(("replace", oldlist), ("replace", newlist))

Its supposed to replace a list with the content of another one while avoiding a reassignment of everything pointing to that list. But if the new "list" was actually an integer, del self[:] ran but self.extend(newlist) gives an error. (I wrapped the extend part in a try-except block as a fix but that's not what I want to talk about here.)

This got me thinking of Python's sort function. Since not only does it have many steps, each one of which could give an error, but its also in-place. If an error happens along the way, there's no copy of the original list to revert back to! Maybe there's an easier way to handle or prevent these errors? Let's look for the answer in Python guaranttee consistency.

Result

Let's see what comparisons are made first.

>>> l = [4, 2, 29, 1, 6, (1,)]
>>> def log_cmp(x, y):
...     print "comparing", x, y
...     return cmp(x, y)
...
>>> l.sort(cmp=log_cmp)
comparing 2 4
comparing 29 2
comparing 29 4
comparing 1 4
comparing 1 2
comparing 6 4
comparing 6 29
comparing (1,) 4
comparing (1,) 29
>>> l
[1, 2, 4, 6, 29, (1,)]

Ok, I'll try to trip it up now by adding an error when (1,) is compared since that only happens near the end.

>>> def error_cmp(x, y):
...     print "comparing", x, y
...     if (1,) in [x, y]:
...         raise Exception("No-go sort")
...     return cmp(x, y)
... 
>>> l = [4, 2, 29, 1, 6, (1,)]
>>> l.sort(cmp=error_cmp)
comparing 2 4
comparing 29 2
comparing 29 4
comparing 1 4
comparing 1 2
comparing 6 4
comparing 6 29
comparing (1,) 4
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 4, in error_cmp
Exception: No-go sort
>>> l
[1, 2, 4, 6, 29, (1,)]

Looks like it doesn't rollback on an error. In fact, the comment above listsort in listobject.c in the source code says

2037    /* An adaptive, stable, natural mergesort. See listsort.txt.
2038    * Returns Py_None on success, NULL on error. Even in case of error, the
2039    * list will be some permutation of its input state (nothing is lost or
2040    * duplicated).

Hmm, what if we tried to change the list midway then?

>>> def add_cmp(x, y):
...     print "comparing", x, y, l
...     l.append(100)
...     return cmp(x, y)
... 
>>> l = [4, 2, 29, 1, 6, (1,)]
>>> l.sort(cmp=add_cmp)
comparing 2 4 []
comparing 29 2 [100]
comparing 29 4 [100, 100]
comparing 1 4 [100, 100, 100]
comparing 1 2 [100, 100, 100, 100]
comparing 6 4 [100, 100, 100, 100, 100]
comparing 6 29 [100, 100, 100, 100, 100, 100]
comparing (1,) 4 [100, 100, 100, 100, 100, 100, 100]
comparing (1,) 29 [100, 100, 100, 100, 100, 100, 100, 100]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: list modified during sort
>>> l
[1, 2, 4, 6, 29, (1,)]

So the variable l temporarily loses its value while being sorted? This needs one more test.

>>> def undo_cmp(x, y):
...     print "comparing", x, y, l
...     l.append(100)
...     if (x, y) == ((1,), 29):
...         del l[:]
...         print "new value", l
...     return cmp(x, y)
... 
>>> l = [4, 2, 29, 1, 6, (1,)]
>>> l.sort(cmp=undo_cmp)
comparing 2 4 []
comparing 29 2 [100]
comparing 29 4 [100, 100]
comparing 1 4 [100, 100, 100]
comparing 1 2 [100, 100, 100, 100]
comparing 6 4 [100, 100, 100, 100, 100]
comparing 6 29 [100, 100, 100, 100, 100, 100]
comparing (1,) 4 [100, 100, 100, 100, 100, 100, 100]
comparing (1,) 29 [100, 100, 100, 100, 100, 100, 100, 100]
new value []
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: list modified during sort

Well, that was still caught. I suppose we could have just looked at the source and seen

2091    self->allocated = -1; /* any operation will reset it to >= 0 */
...
2180    if (self->allocated != -1 && result != NULL) {
2181    /* The user mucked with the list during the sort,
2182    * and we don't already have another error to report.
2183    */
2184    PyErr_SetString(PyExc_ValueError, "list modified during sort");
2185    result = NULL;

Now if only I tested my own stuff this much. Or find people I could trade program-testing with.

Unfortunately, the answer to my initial question of how to handle these cases seems to be "add checks for special cases". Maybe the complexity of the consistency/atomicity I want forces this kind of verbosity?

Blog index