Thursday, April 15, 2010

dbapiext.py - making dealing with SQL easier in Python

Sorry there was no post last week - I’ve been a little busy organising PyCon Australia.

This week I’d like to share dbapiext.py by Martin Blais with you. It provides really nice interface for making dealing with SQL easier in Python.

Introduction

In the Python world we have a standard specification of how database interfaces should look and we call it PEP 294, or more commonly DB-API 2.0. One of the things about DB-API 2.0 is that it’s a little loose in some areas - the stated goal of the PEP is “to encourage similarity between the Python modules that are used to access databases” (my emphasis) rather than to dictate a precise API.

In practical terms what this means is that you can’t just swap DB-API 2.0 modules like DCOracle2 for cx_Oracle, even for simple operations and even though they both run on top of Oracle. The core reason for this is that they use different parameter specifications. For example, a query in DCOracle2 might look like:

cursor.execute('SELECT * FROM customers WHERE name=:1', 'Richard Jones')

the same query in cx_Oracle might look like:

cursor.execute('SELECT * FROM customers WHERE name=:name', name='Richard Jones')

Of course one solution is to not use parameters and format the arguments directly into the SQL. But that’s just inviting disaster.

The dbapiext module attempts to normalise these interfaces and add some new convenience features. At its simplest it allows one to translate the above two statemens into a single form:

execute_f(cursor, 'SELECT * FROM customers WHERE name=%S', 'Richard Jones')

Where the “%S” indicates that the value being passed should be escaped and quoted as appropriate. If you don’t wish for the value to be treated as such you may use “%s” which formats the value directly into the SQL. This will work on top of either backend, given the following definition of execute_f:

from dbapiext import execute_f
import functools
if using_cx_Oracle:
    execute_f = functools.partial(execute_f, paramstyle='named')
else:
    execute_f = functools.partial(execute_f, paramstyle='numeric')

Other Capabilities

The query parsing and argument handling is quite flexible. You can mix positional arguments and keyword arguments - and refer to the keyword arguments by name in the SQL, especially if the underlying database connection implementation doesn’t offer the facility:

execute_f(cursor, '''SELECT * FROM account WHERE
     account.active = %S AND (
     account.number = %(number)S OR
     account.number = (
         SELECT account FROM mobile_account WHERE
         mobile_account.msisdn = %(number)S
     ))''', activated, number=number)

In this situation the “%S” SQL parameter will use the fixed argument “activated” and the “%(number)S” SQL parameter will use the keyword argument “number”. Pretty cool.

dbapiext doesn’t stop there - it introduces a bunch of other really neat extensions. How many times has your code included something like this?

# given some variable list of data to update
data = dict(name='Richard', size='medium', alignment='neutral')

# figure the SQL and values argument
columns = ', '.join('%s=:1'%k for k in data)
sql = 'UPDATE person set %s'%columns
values = [data[k] for k in data]

# update the information in a table
cursor.execute(sql, *values)

With dbapiext you can do the much more pythonic:

# given some variable list of data to update
data = dict(name='Richard', size='medium', alignment='neutral')
execute_f(cursor, 'UPDATE person %S', data)

The “%S” argument here renders the dictionary in the form suitable for the UPDATE statement. If your SQL was a SELECT instead you may use “%A” which joins the dictionary items with “AND” instead:

# given some variable list of data to match
data = dict(name='Richard', size='medium', alignment='neutral')
execute_f(cursor, 'SELECT * FROM person WHERE %A', data)

Note that “data” could also be a list of pairs instead of a dictionary. How convenient is that!

Thursday, April 1, 2010

cython - easier optimisations than writing C

For this Bit Of Cheese I thought I’d present a bit of an example using cython since I just learned how to use it myself for PyWeek.

For the unaware, cython is a neat little extension to Python that makes it much easier to write C-optimised modules for your Python code, including using 3rd-party C libraries.

For my PyWeek entry I need to generate a lot of cubic b├ęzier curves. I also wanted to experiment with mutating them - which basically means re-generating them many times over. The core curve generation code therefore had to be as fast as I could possibly make it.

This code looks something like this in Python:

def generate(p1x, p1y, p2x, p2y, cp1x, cp1y, cp2x,
         cp2y, step):
    '''Given the two points p1 and p2 and their control
     points cp1 and cp2 generate a cubic bezier curve with
     steps of "step".

    Return the list of (x, y) points on the curve.
    '''
    l = []
    # generate the cubic bezier points
    x1 = cp1x * 3; x2 = cp2x * 3
    y1 = cp1y * 3; y2 = cp2y * 3
    t = 0
    while t <= (1 + step):
        a = t; a2 = a**2; a3 = a**3
        b = 1 - t; b2 = b**2; b3 = b**3
        px = p1x*b3 + x1*b2*a + x2*b*a2 + p2x*a3
        py = p1y*b3 + y1*b2*a + y2*b*a2 + p2y*a3
        l.append((px, py))
        t += step
    return l

def speed_test():
    generate(0, 0, 5, 0, 0, 1, 5, -1, .0001)

For generating a curve with quite small steps on my MacBook the result is about 17.5 milliseconds per curve:

$ python -m timeit -s 'import curve' 'curve.speed_test()'
100 loops, best of 3: 17.5 msec per loop

The first thing I tried is simply copying my curve.py to _curve.pyx (the .pyx denoting it being a cython module) and adding some code to the end of the original curve.py to use the cython version:

try:
    import pyximport; pyximport.install()
    import _curve
    generate = _curve.generate
except Exception, e:
    print '_curve not available:', e
    pass

This is only one way to generate cython modules. The other is to use code in your setup.py file – but for this project I don’t actually have one. The pyximport module will compile the “pyrex” format cython files with the .pyx suffix on import. It detects changes to the original file and recompiles, so it’s incredibly convenient!

Re-running the test after that change I found the result was surprisingly similar:

$ python -m timeit -s 'import curve' 'curve.speed_test()'
100 loops, best of 3: 17.5 msec per loop

Hmm.

OK, well, cython has a bunch of hints I can give it like function arguments. These are just C types that I can add to the Python function signature.

Let’s try those to start with. Note that Python floats are actually C doubles:

def generate(double p1x, double p1y, double p2x,
        double p2y, double cp1x, double cp1y,
        double cp2x, double cp2y, double step):

This saves me a couple of seconds:

$ python -m timeit -s 'import curve' 'curve.speed_test()'
100 loops, best of 3: 15.5 msec per loop

So let’s declare all of my variable types (except the Python list):

cdef double x1, x2, y1, y2, t
cdef double a, a2, a3, b, b2, b3, px, py

This change bought me the improvement I was after - the function is now way faster than the original Python version:

$ python -m timeit -s 'import curve' 'curve.speed_test()'
100 loops, best of 3: 2.08 msec per loop

Cool eh?

The final code in _curve.pyx looks like this:

def generate(double p1x, double p1y, double p2x,
        double p2y, double cp1x, double cp1y,
        double cp2x, double cp2y, double step): 
    '''Given the two points p1 and p2 and their control 
    points cp1 and cp2 generate a cubic bezier curve 
    with steps of "step".

    Return the list of (x, y) points on the curve. 
    ''' 
    cdef double x1, x2, y1, y2, t 
    cdef double a, a2, a3, b, b2, b3, px, py 
    l = [] 
    # generate the cubic bezier points 
    x1 = cp1x * 3; x2 = cp2x * 3 
    y1 = cp1y * 3; y2 = cp2y * 3 
    t = 0 
    while t <= (1 + step): 
        a = t; a2 = a**2; a3 = a**3 
        b = 1 - t; b2 = b**2; b3 = b**3 
        px = p1x*b3 + x1*b2*a + x2*b*a2 + p2x*a3 
        py = p1y*b3 + y1*b2*a + y2*b*a2 + p2y*a3 
        l.append((px, py)) 
        t += step 
    return l

The important thing to note here is that the bulk of the function is untouched, pure Python. cython does all the smarts now it knows enough about the types of the variables. Clever cython!

OK, now the downside. It took me a lot of trial-and-error and banging my head against the cython docs to even do something as simple as that.

One of my early mistakes was to declare the function "cdef" like I saw in all the examples. This results in a function that's not actually exported to Python from the module though. It needs to remain a standard "def" (or, if it's to be used in both C and Python it may be a "cpdef"'ed function). This stumble cost me quite a large amount of time.

Eventually I found most of the useful information in a section marked “Old Cython Users Guide” which, though broken in places, still contains the most useful information for a newbie like me.

ps. apologies for the horizontally smooshed code - I've only just found a wider template. Future posts won't be quite as restricted.

Thursday, March 25, 2010

Regular expression ... expressions!

To kick off this blog I think I'll start with something a little wacky. Krister Hedfors has created a package called inrex which implements a bunch of regular expression "operators" ("inrex" is short for "infix regular expressions"). Here's how regular expressions are normally handled in Python:
>>> import re
>>> match = re.match(r'(\w+) (\d+)', 'asd 123')
>>> if match is not None:
...    print 'word is', match.group(1)
...    print 'digit is', match.group(2)
... 
word is asd
digit is 123
>>> match = re.match(r'(?P<word>\w+) (?P<digit>\d+)', 'asd 123')
>>> if match is not None:
...    print 'word is', match.group('word')
...    print 'digit is', match.group('digit')
... 
word is asd
digit is 123
>>> re.findall(r'\d+', 'asd 123 qwe 456')
['123', '456']
>>> re.split(r'\d+', 'asd 123 qwe 456')
['asd ', ' qwe ', '']
>>> re.split(r'\d+', 'asd 123 qwe 456', maxsplit=1)
['asd ', ' qwe 456']
Note that we need to have a statement to obtain the match object and a second statement to examine it. Pretty standard Python, but a little annoying sometimes. Here's how the same results are achieved in inrex:
>>> from inrex import  match, search, split, findall, finditer
>>> 
>>> if 'asd 123' |match| r'(\w+) (\d+)':
...   print 'word is', match[1]
...   print 'digit is', match[2]
... 
word is asd
digit is 123
>>> if 'asd 123' |match| r'(?P<word>\w+) (?P<digit>\d+)':
...   print 'word is', match['word']
...   print 'digit is', match['digit']
... 
word is asd
digit is 123
>>> 'asd 123 qwe 456' |findall| r'\d+'
['123', '456']
>>> 'asd 123 qwe 456' |split| r'\d+'
['asd ', ' qwe ', '']
>>> 'asd 123 qwe 456' |split(maxsplit=1)| r'\d+'
['asd ', ' qwe 456']
Working with the match object is clearly much easier. There's a limitation that it'll only work for an immediate result; unlike the standard re.match the inrex match object is a singleton, and thus you can only work with one result at a time. For simple cases (the most common) a singleton match object would suffice.