Thursday, March 25, 2010

Regular expression ... expressions!

To kick off this blog I think I'll start with something a little wacky. Krister Hedfors has created a package called inrex which implements a bunch of regular expression "operators" ("inrex" is short for "infix regular expressions"). Here's how regular expressions are normally handled in Python:
>>> import re
>>> match = re.match(r'(\w+) (\d+)', 'asd 123')
>>> if match is not None:
...    print 'word is', match.group(1)
...    print 'digit is', match.group(2)
... 
word is asd
digit is 123
>>> match = re.match(r'(?P<word>\w+) (?P<digit>\d+)', 'asd 123')
>>> if match is not None:
...    print 'word is', match.group('word')
...    print 'digit is', match.group('digit')
... 
word is asd
digit is 123
>>> re.findall(r'\d+', 'asd 123 qwe 456')
['123', '456']
>>> re.split(r'\d+', 'asd 123 qwe 456')
['asd ', ' qwe ', '']
>>> re.split(r'\d+', 'asd 123 qwe 456', maxsplit=1)
['asd ', ' qwe 456']
Note that we need to have a statement to obtain the match object and a second statement to examine it. Pretty standard Python, but a little annoying sometimes. Here's how the same results are achieved in inrex:
>>> from inrex import  match, search, split, findall, finditer
>>> 
>>> if 'asd 123' |match| r'(\w+) (\d+)':
...   print 'word is', match[1]
...   print 'digit is', match[2]
... 
word is asd
digit is 123
>>> if 'asd 123' |match| r'(?P<word>\w+) (?P<digit>\d+)':
...   print 'word is', match['word']
...   print 'digit is', match['digit']
... 
word is asd
digit is 123
>>> 'asd 123 qwe 456' |findall| r'\d+'
['123', '456']
>>> 'asd 123 qwe 456' |split| r'\d+'
['asd ', ' qwe ', '']
>>> 'asd 123 qwe 456' |split(maxsplit=1)| r'\d+'
['asd ', ' qwe 456']
Working with the match object is clearly much easier. There's a limitation that it'll only work for an immediate result; unlike the standard re.match the inrex match object is a singleton, and thus you can only work with one result at a time. For simple cases (the most common) a singleton match object would suffice.

9 comments:

  1. I've not seen the syntax for defining new infix operators before -- I assume it's using the recipe from http://code.activestate.com/recipes/384122/?

    ReplyDelete
  2. Is it thread-safe?

    ReplyDelete
  3. Wow, awesome. Thanks for the tip!

    ReplyDelete
  4. From what I can see it's definitely not thread-safe.

    ReplyDelete
  5. You can’t define arbitrary infix operators in Python, but if you make a custom class, you can define how +, -, *, |, >>, ~, etc. work for that class using the __magic__ and __rmagic__ methods.

    ReplyDelete
  6. Just do a dir() on an int to learn more.

    ReplyDelete
  7. @Marius and @Richard

    Thread safety could be added by using threading.local for the result.

    ReplyDelete
  8. I was inspired by the overloaded operators here, and made a library to make such definitions easier. Find it at http://42017203.blogspot.com/2010/03/minioperators-adding-new-operators-to.html.

    ReplyDelete
  9. I wrote the inrex module. Please check out the brand new 'sqldict' module too!

    ReplyDelete