Thursday, March 25, 2010

Regular expression ... expressions!

To kick off this blog I think I'll start with something a little wacky. Krister Hedfors has created a package called inrex which implements a bunch of regular expression "operators" ("inrex" is short for "infix regular expressions"). Here's how regular expressions are normally handled in Python:
>>> import re
>>> match = re.match(r'(\w+) (\d+)', 'asd 123')
>>> if match is not None:
...    print 'word is', match.group(1)
...    print 'digit is', match.group(2)
... 
word is asd
digit is 123
>>> match = re.match(r'(?P<word>\w+) (?P<digit>\d+)', 'asd 123')
>>> if match is not None:
...    print 'word is', match.group('word')
...    print 'digit is', match.group('digit')
... 
word is asd
digit is 123
>>> re.findall(r'\d+', 'asd 123 qwe 456')
['123', '456']
>>> re.split(r'\d+', 'asd 123 qwe 456')
['asd ', ' qwe ', '']
>>> re.split(r'\d+', 'asd 123 qwe 456', maxsplit=1)
['asd ', ' qwe 456']
Note that we need to have a statement to obtain the match object and a second statement to examine it. Pretty standard Python, but a little annoying sometimes. Here's how the same results are achieved in inrex:
>>> from inrex import  match, search, split, findall, finditer
>>> 
>>> if 'asd 123' |match| r'(\w+) (\d+)':
...   print 'word is', match[1]
...   print 'digit is', match[2]
... 
word is asd
digit is 123
>>> if 'asd 123' |match| r'(?P<word>\w+) (?P<digit>\d+)':
...   print 'word is', match['word']
...   print 'digit is', match['digit']
... 
word is asd
digit is 123
>>> 'asd 123 qwe 456' |findall| r'\d+'
['123', '456']
>>> 'asd 123 qwe 456' |split| r'\d+'
['asd ', ' qwe ', '']
>>> 'asd 123 qwe 456' |split(maxsplit=1)| r'\d+'
['asd ', ' qwe 456']
Working with the match object is clearly much easier. There's a limitation that it'll only work for an immediate result; unlike the standard re.match the inrex match object is a singleton, and thus you can only work with one result at a time. For simple cases (the most common) a singleton match object would suffice.