22 April 2011

Introduction to regular expressions

The technology that had the most impact on my workflow is definitely "regular expressions".

I discovered them at the end of the 90' when I was working on the conversion of a database output to a set of about 6000 static HTML pages. At the time, the editor of choice on the Mac was BBEdit from Barebones Software, but its free and "lite" version "BBEdit Lite" was also immensely popular. BBedit Lite has now been replaced by Textwrangler and just like its predecessor, Textwrangler can be used without paying a user license fee*.


  1. What are regular expressions?


    Regular expressions are a "search" function on steroïds. Regular expressions were created to find patterns in strings. They can find simple patterns like the word "pattern" in this text, or more complex patterns like "a string that starts with 'pa', followed by a letter that's repeated twice, followed by any three characters that are not 'space' or '@' or '^' and followed by a space".

    This document uses its two first paragraphs (the paragraphs in italics, above) as a test ground. Paste that paragraph in your favorite regular expressions supporting text editor (I use Textwrangler for all the descriptions so you might want to use it too) then call the search window, check the "grep" box at its bottom and search for:

    re[^ ]*

    You should see colors appearing while you type the search terms.

    What that expressions means is:
    r followed by e followed by a group of characters that are not a space, or by nothing.

    Hit Next and see what you get, then hit Cmd+G and see what you get. If you start from the top of the paragraph you should have 8 "matches".


  2. Normal characters


    Most characters represent themselves in regular expressions (regex), like a "normal" search.

     → r means r and e means e, " " means a space. In the same sequence. No magic here.


  3. Special effects


    Some characters have special effects:

     → [ starts a group of characters
     → ] ends that group
     → ^ means "not"
     → * means "zero or more of what just came"

    So, our simple regular expression means:

    "look for any string that has a r followed by a e followed by zero or more characters that are not a space."

    Now, what if you need to find characters like ^, [, ] or *?


  4. Cancelling special effects


    When you want to find characters that have a special effect without "triggering" that special effect, you put a "\" in front of them:

     → \* means the character *
     → \[ means the character [


    And since the character "\" has the special effect of removing the special effect of a character that has a special effect... then:

     → \\ means the character \

    etc.

    By the way, the character . has the special effect of matching "any one character" so if you're looking for a period, then you really want to look for the \. string...

    Examples:


    The regular expression ". " (. followed by space) will match any one character followed by a space. There are 78 strings that match this pattern in the paragraph.

    The regular expression "\. " (\ followed by . followed by space) will match any period followed by a space. There are only 2 strings that match this pattern in the paragraph.

    The regular expressions "re*." (re followed by * followed by .) will match any string that is composed of a r, followed by zero or more e, followed by any one character. There are 22 matches in the paragraph. Verify that you understand them all.

    The regular expression ".e\*\." (. followed by e followed by \ followed by * followed by \ followed by .) will match the 4 characters string ee*. that you find at the end of the paragraph.


  5. Triggering special effects


    Some characters work the other way round: by themselves they do not have a special effect but if you stick the \ character before them, then their special effect is triggered.

     → t means t but \t means tabulation
     → r means r but \r means line break (specifically "carriage return")
     → s means s but \s means all sorts of white space, which includes spaces, tabulations, line breaks etc.

    If the character does not have a special effect then using \ has no effect.

     → i means i and \i too means i

    Such sequences (\ followed by a character) are usually called "escape sequences".


  6. Remembering matches


    If you want to "memorize" a match, for later use in the expression or in the "replace" field, then you put the corresponding expressions between parenthesis:

     → (re)[^ ]+ will produce the same matches as above, but will memorize the re part and not the rest.

     → re([^ ]+) will produce the same matches as above, but will not memorize the re part and instead will memorize the rest.

     → (re)([^ ]+) will produce the same matches as above and will memorize the 2 parts separately.


  7. Using memorized matches


    Now that the matches are remembered, you can use them. Use \1 to refer to the first memorized string, \2 to refer to the second memorized string etc...

     → (e)\1\*\. will produce the "ee*." string that you find at the end of the text.

     → search for (re)([^ ]+) and put \2\1 in the Replace: field:

    (re) is the first group
    ([^ ]+) is the second group

    \2\1 will thus put the second group before the first group.

    The term "regular" matches the pattern: (re) matches re and ([^ ]+) matches gular. The replaced string will thus be "gularre".

     → search for (re)([^ ]+) and put \1\1_\[\2\] in the Replace: field:

    (re) is the first group
    ([^ ]+) is the second group

    \1\1_\[\2\] will put 2 instances of the first group, then an underbar, then [, then the second group, then ].

    In the case of "regular", we'd have the following replacement string:
    rere_[gular]


  8. That's only the beginning...


    What you need to check now is the special effects of some characters. If you've used Textwrangler it is all in the user manual, page 133, Chapter 8 (Searching with Grep), or you can call the Help with Cmd+? and you'll find a relevant link right away.

    Textwrangler's regex is pretty standard so once you're used to it there, you can use it in other editors too. If what works in Textwrangler does not work there, check the idiosyncrasies of the editor you use.

    Now, take a real world document and try to transform it by using a few regular expressions. A typical use case for a translator would be to convert a TMX file into a 2 column tab separated data set, or the opposite: to convert a 2 column tab separated data set into a TMX file. If you manage to do that you've created your first alignement based TMX converter!


* I try to use or discuss free software when possible because I think that is the way to go. People who want to use a free text editor on the Mac can use Aquamacs. It comes with all the goodness of emacs (including the same regular expressions) and looks and feels a lot like a "normal" Mac text editor.