Performance difference on insert-sort in C and python

I was curious about the performance of insert-sort using C and python but the results I've got just make me think if I've done something wrong. I suspected that C would be faster, but not that much.

I've profiled both codes and the insert-sort function is the place where the time is most spent.

Here is the C function:

insert_sort (vec_t * vec)
    int j;
    for (j = 1 ; j < vec->n ; j++){
        int key = vec->v[j];
        int i = j - 1;
        while (i >= 0 && vec->v[i] > key){
            vec->v[i+1] = vec->v[i];
        vec->v[i+1] = key;

Here is the python function:

def insert_sort (ln):
    for j in range(1, len(ln)):
        key = ln[j]
        i = j-1
        while i >= 0 and ln[i] > key:
            ln[i+1] = ln[i]
        ln[i+1] = key

The test was made with 10000 integers, each one randomly generated between 0 and 10000.

The results for the time spent in each function was:

  • C time: 0.13 seconds
  • python time: 8.104 seconds

Am I doing something wrong here? Like I said, I expected to see the C code being faster, but not that faster.

I don't want to use built-in functions or whatsoever. I would like to implement the algorithm. Is there a pythonic way to doing things that I could use in the insert-sort?

13.10.2009 17:04:50
The obvious answer is that the pythonic way is to use the built-ins :). Actually, though, I don't think there is any problem with the code. C is compiled, Python is interpreted, and few things are faster than a C for loop. It is not having to do nearly the work that Python is, since Python is reference counting, garbage collecting, acquiring/releasing the GIL, and a dozen other things while it's executing your code.
Jeff Ober 13.10.2009 17:12:13
Python code runs on a Virtual Machine (PVM). So in most cases plain Python code is at least an order of magnitude slower than the corresponding C code.
Nick Dandoulakis 13.10.2009 17:22:12
Python as an interpreted language is just roughly 60 times slower.. Thats in the ballpark of what I would expect. The timing looks fine.
Nils Pipenbrinck 13.10.2009 17:23:46
The power of Python is in it's built-in functions/types (Python still is, more or less, a glue language). Try writing an algorithm that extensively uses a hash table and you will have pretty hard time making it faster in C.
Lukáš Lalinský 13.10.2009 17:26:59
Please provide more details on your testing method. Sorting 10,000 random integers takes ~6 msec on my machine. Using python's built-in sort takes ~400 usec. Timed from command line using
foosion 13.10.2009 19:40:17

Python is a dynamic language and the standard implementation uses an interpreter to evaluate code. This means that where the compiled C code can escape with a single machine instruction, for instance assigning to vec->v[i+1], Python's interpreter has to look up the sequence variable from the local scope, look up its class, find the item setting method on the class, call that method. Similarly for the compare, addition. Not to mention that executing almost every bytecode results in an indirect branch mispredict in the CPU that causes a pipeline bubble.

This is the sort of code that would benefit a lot from JIT compilation to native code and runtime type specialization, like unladen-swallow and PyPy are starting to do.

Otherwise the code is pretty much pythonic in the sense that if one needs to implement insertion sort, this is how one would do it in Python. It's also very unpythonic because you should use the very efficient built-in sort.

13.10.2009 17:20:10

What's wrong with:

13.10.2009 17:12:09
That uses a tim-sort. The goal is obviously to implement/learn about the algorithm, not to use the result.
Jeff Ober 13.10.2009 17:13:34
The goal is obviously to implement/learn about the algorithm, not to use the result. – Jeff Ober Exactly.
didi 13.10.2009 17:16:12
It's not obvious at all. Python is a framework build around lots of native-coded functions. Trying to re-implement basic stuff like like in Python, and then comparing it with the C is ridiculous.
alex tingle 13.10.2009 17:17:26
He specifically stated he didn't want to use built-in functions. I wasn't going to vote you down, but -1 for calling his experiment ridiculous.
rlbond 13.10.2009 17:21:23
@rlbond - I disagree. Trying to compare inefficient algorithms in C to inefficient algorithms in Python (or any other dynamic language) is going to yield results like this. That's the nature of dynamic languages.
Chris Lutz 13.10.2009 17:23:16

What method did you use to measure the time?
Doing this sort of thing, I find python is at least 30 times slower than C
The C compiler may be able to use some optimisations that Python doesn't even attempt

If might be interesting to try psyco, that type of code is well suited to it.

building on Alex's answer, I tried cython. In his case cython turns the for loop and everything into pure C, so now I can compare C, python and psyco

now i have this

import psyco
import random
li = [random.randrange(10000) for _ in xrange(10000)]

def insort (ln):
    for j in range(1, len(ln)):
        key = ln[j]
        i = j-1
        while i >= 0 and ln[i] > key:
            ln[i+1] = ln[i]
        ln[i+1] = key


import pyximport; pyximport.install()
import pyxinsort

def pyx_setup():

def pyx_insort():

and this pyxinsort.pyx

cdef int ln[10000]

def insort(li):
    cdef int i,j,key
    for j in range(1, len(li)):
        key = ln[j]
        i = j-1
        while i >= 0 and ln[i] > key:
            ln[i+1] = ln[i]
        ln[i+1] = key

def setup(li):
    cdef int i
    for i in range(1, len(li)):

The code for insort is virtually identical. li is passed in for its length. ln is the array that is sorted and is prepopulated by setup, so I can isolate building the list from the sort


$ python2.5 -mtimeit -s'import inso' 'list('
10000 loops, best of 3: 84.5 usec per loop
$ python2.5 -mtimeit -s'import inso' 'inso.insort(list('
10 loops, best of 3: 21.9 sec per loop


$ python2.5 -mtimeit -s'import inso' 'list('
10000 loops, best of 3: 85.6 usec per loop
$ python2.5 -mtimeit -s'import inso' 'inso.insort(list('
10 loops, best of 3: 578 msec per loop

cython ( this is running exactly the same algorithm converted to C and compiled )

$ python2.5 -mtimeit -s'import inso' 'inso.pyx_setup()'
10000 loops, best of 3: 141 usec per loop
$ python2.5 -mtimeit -s'import inso' 'inso.pyx_setup();inso.pyx_insort()'
10 loops, best of 3: 46.6 msec per loop

cython beats psyco by a factor of 16 and Python by a factor of 470!

For completeness, i've included the corresponding piece of C code generated by cython

  for (__pyx_v_j = 1; __pyx_v_j < __pyx_1; __pyx_v_j+=1) {
    __pyx_v_key = (__pyx_v_9pyxinsort_ln[__pyx_v_j]);
    __pyx_v_i = (__pyx_v_j - 1);
    while (1) {
      __pyx_2 = (__pyx_v_i >= 0);
      if (__pyx_2) {
        __pyx_2 = ((__pyx_v_9pyxinsort_ln[__pyx_v_i]) > __pyx_v_key);
      if (!__pyx_2) break;
      (__pyx_v_9pyxinsort_ln[(__pyx_v_i + 1)]) = (__pyx_v_9pyxinsort_ln[__pyx_v_i]);
      __pyx_v_i -= 1;
    (__pyx_v_9pyxinsort_ln[(__pyx_v_i + 1)]) = __pyx_v_key;
14.10.2009 21:30:06

So, here are some lessons you should take away from this:

  • Interpreted Python is on the slow side. Don't try to write your own FFT, MPEG encoder, etc. in Python.

  • Even slow interpreted Python is probably fast enough for small problems. An 8 second run time is not horrible, and it would take you much longer to write and debug the C than the Python, so if you are writing something to run once, Python wins.

  • For speed in Python, try to rely on built-in features and C modules. Let someone else's C code do the heavy lifting. I worked on an embedded device where we did our work in Python; despite the slow embedded processor, the performance was decent, because C library modules were doing most of the work.

For fun and education, please repeat your Python test, this time using the built-in .sort() method on a list; it probably won't be quite as fast as the C, but it will be close. (Although for really large data sets, it will beat the C, because insertion sort sucks. If you rewrote the C to use the C library qsort() function, that would be the speed champ.)

A common Python design "pattern" is: first, write your app in Python. If it is fast enough, stop; you are done. Second, try to rewrite to improve speed; see if there is a C module you can use, for example. If it is still not fast enough, consider writing your own C module; or, write a C program, using the Python prototype code as the basis for your design.

13.10.2009 20:50:26
Insertion sort can be done O(nlogn) but this one is O(n^2) so at 10000 i expect timsort to beat it
John La Rooy 13.10.2009 21:09:39
I said for "really large data sets" that Python would win, because Timsort is awesome; but for only 10,000 items, I expected the startup time to dominate. However, I just tossed off a quick test on my computer, and found that my trivial Python program to allocate a list of 10,000 random integers and sort it with .sort() is in fact faster than my trivial C program to initialize an array of 10,000 integers and sort it with the above insertion sort. (I had to modify the insertion sort to work in bare C... I'm not sure what that "vec" type is. It doesn't look like std::vector<int>...)
steveha 13.10.2009 23:17:53
It looks like vec is just the array v bundled with it's length n
John La Rooy 14.10.2009 21:24:43
Sure, but I just wanted to toss off a quick test, so I just converted it to boring C. I didn't want to try to write a vec-compatible class to make it work, or convert it to std::vector<int>, just for a quick test.
steveha 14.10.2009 21:47:20

My first thought was that the laptop I have at hand right now, a Macbook Pro, must be comparable to but slightly better than your machine -- I don't have enough of your surrounding code to try your C example (what's a vec_t, etc, etc), but running the Python you coded gives me:

$ python -mtimeit -s'import inso' 'inso.insort('
10 loops, best of 3: 7.21 msec per loop

vs your 8.1 seconds. That's with you code put in, preceded by:

import random
li = [random.randrange(10000) for _ in xrange(10000)]

array doesn't help -- actually slows things down a bit. Then I installed psyco, the Python JIT helper (x86-only, 32-bit only), further added:

import psyco

and got:

$ python -mtimeit -s'import inso' 'inso.insort('
10 loops, best of 3: 207 usec per loop

so a speedup of about 7.21 / 0.000207 = 34830 times -- vs the 8.04 / 0.13 = 62 times that surprised you so much;-).

Of course, the problem is that after the first time, the list is already sorted, so insort becomes must faster. You didn't give us enough of the surrounding test harness to know exactly what you measured. A more realisting example (where the actual list isn't touched so it stays disordered, only a copy is sorted...), without psyco:

$ python -mtimeit -s'import inso' 'inso.insort(list('
10 loops, best of 3: 13.8 sec per loop

Oops -- so your machine's WAY faster than a Macbook Pro (remembers, core don't count: we're using only one here;-) -- wow... or else, you're mismeasuring. Anyway, WITH psyco:

$ python -mtimeit -s'import inso' 'inso.insort(list('
10 loops, best of 3: 456 msec per loop

So psyco's speedup is only 13.8 / 0.456, 30 times -- about half as much as the 60+ times you get with pure-C coding. IOW, you'd expect python + psyco to be twice as slow as pure C. That's a more realistic and typical assessment.

If you we writing reasonably high-level code, psyco's speedup of it would degrade from (say) 30 times down to much less -- but so would C's advantage over Python. For example,

$ python -mtimeit -s'import inso' 'sorted('
100 loops, best of 3: 8.72 msec per loop

without psyco (in this case, psyco actually -- marginally -- slows down the execution;-), so that's another factor of 52 over psyco, 1582 overall over non-psyco insort.

But, when for some reason or other you have to write extremely low-level algorithms in python, rather than using the wealth of support from the builtins and stdlib, psyco can help reduce the pain.

Another point is, when you benchmark, please post ALL code so others can see exactly what you're doing (and possibly spot gotchas) -- your "scaffolding" is as tricky and likely to hide traps, as the code you think you're measuring!-)

14.10.2009 03:13:19
Can you run timeit with just the list copy and subtract the result per loop from the list copy+insort? I realise the copy is quick compared to the sort of course.
John La Rooy 14.10.2009 04:21:25
@gnibbler, 179 usec -- in the noise compared to most of the other measurements, not just "quick";-)
Alex Martelli 14.10.2009 04:47:53
Alex I have run with cython and psyco and saw a factor of 16 difference. Does cython run on your mac?…
John La Rooy 14.10.2009 05:31:51
@gnibbler, sure, Cython runs great on the mac, and it's a very useful language, but it's NOT Python -- it's an "extended subset" (isn't everything an extended subset of everything else?-), which is what the OP was asking about!-)
Alex Martelli 14.10.2009 05:41:25
Alex, sorry I wasn't clearer. My intent was to make sure the inside of the function in cython is compiled to pure C, so the cython gives a way to benchmark the C in the same framework you used for the python. Python is taking ~19sec on my notebook, and C is over 400 times faster according to this experiment. I was just wondering if you could confirm the result for cython. 400 times seems a lot
John La Rooy 14.10.2009 07:24:20