NAME

re_graph.pl - Graph regular expression

SYNOPSIS

re_graph.pl [-d] [-o out_file] [-x min-x] [-y min-y] regular-expression [string]

DESCRIPTION

The re_graph.pl program graphs regular expressions. The guts of the regular expression engine is a simple state machine. The various states and operations in the regular expression parser can be displayed using a surprisingly simple diagram.

A few notes on what you are looking at:

The nodes Start and Stop denote the beginning and end of the regular expression.

The solid squares denote atoms. Lines indicate the next state. When a line splits, the state machine will take the top line first. If it's path is blocked it will backup and take the next lower line. This is repeated until it finds a path to the end or all paths are exhausted.

Brown boxes indicate a grouping operation, i.e. ().

Green boxes indicate a zero with test. The state machine will perform the test inside the box before moving ahead.

For more information, see the tutorial below.

OPTIONS

-d: Turn on debugging. The debugging output is printed on the console as regular expressions are compiled.
-o out_file: Specify the output file for the images. Default = re_graph_%02d.png. If only a regular expresion is specified, the output will be written to the given file.; If the system is used in graph / execution mode, a series of files will be written using the printf style file name specified.
-x min-x: Specify the minimum size of the image in pixels in the X direction.
-y min-y: Specify the minimum size of the image in pixels in the Y direction.

TUTORIAL

This tutorial shows what happens when a set of sample regular expressions are graphed. This set of regular expressions closely follows the Chapter 4 of ``Perl for C Programmers'' by Steve Oualline.

The set of regular expressions used for this tutorial is:

    test
    ^Test
    ^ *#
    ^[ \t]*#
    ^\s*#
    ([^#]*)(#.*)
    a|b
    ^(([^#"]*|".*")*)(#.*)
    ^((?:[^#"]*|".*?")*)(#.*)
    ^((?:[^#"]*|".*?(?<!\\)")*)(#.*)

Let's take a look at the graphs produced by these expressions.

/test/

This is a very simple expression. It matches ``test'' anywhere on the line. If you look at the graph of this expression, it consists of three nodes ``Start'', ``EXACT <test>'' and ``END''.

The ``Start'' node indicates the start of the regular expression.

The ``EXACT <test>'' node tells the engine that the text must match the text ``test'' exactly.

The ``END'' node indicates the end of the regular expression. If you reach the ``END'' node, a successful match was found.

Flow is a straight line from ``Start'', through the ``EXACT'' check, to end.

/^Test/

A new item was added with this expression, an anchor. It's named BOL (Beginning of line) and shows up as an additional node.

/^ *#/

Now we start having fun. This expression matches anything that consists of a start of line (^), a bunch of spaces ( *), and a sharp (#).

The way the state machine works it that it starts at ``Start'' and works it's way through the nodes. You'll notice that between ``BOL'' and ``EXACT < >'' there's a fork in the road.

The state machine will take the top branch if possible. So if the next character is a space, the system will take the top branch and match the ``EXACT < >'' node. If not, the bottom branch is taken and we wind up at the ``EXACT <#>'' node.

If there's no path to the ``END'', then we don't have a match.

/^[ \t]*#/

This is the same as the previous example, except the space was replaced by a character set. We call the set ``space and tab''. The system translates this into ``\11\40''. It's the same thing, suitable obfuscated for computer work.

/^\s*#/

Again, the middle as been replace by another token. In this case it's the SPACE token which matches any whitespace.

/([^#]*)(#.*)/

This expression introduces us to the grouping operators. They show as the big brown boxes.

The other change is that we use the expression [^#], which matches anything except a hash mark. Perl changes this to a ``ANYOF'' clauses which matches all characters except the single one we don't want.

Note: This ANYOF node overflows the size of the box. This is a know bug.

The graphing program can show how the regular expression excution process. Let's see what happens when we run the command:

    perl re_graph.pl -o tut_06_%d.png '([^#]*)(#.*)' 'text # Comment'

The first image show a yellow arrow pointing to the first set of (), indicating that the system is about to go into $1.

The next yellow arrow points at the ANYOF operator indicating that the regular expression engine is about to look at the [^#] part of the expression.

In the next screen, the yellow arrow has moved to the box representing the second set of (). That means that the first part of matching process is done. The string text is highlighted yellowing, indicating that that much of the string has be matched so far.

Next the yellow arrow points to the ``#'' node. The string is highlighted up to the just before the ``#''. This tells us that engine is about to match the ``#'' in the string against the ``#'' in the regular expression.

The next few images (not shown) show the engine matching the rest of the string.

/a|b/

Now we introduce the concept of a selection of two different atoms. Note that the branch arrows are drawn smaller to make them stand out.

/^(([^#``]*|''.*``)*)(#.*)/

See the book for what this regular expression tries to match.

This expression adds nested grouping, and some additional stuff that we've seen before.

/^((?:[^#``]*|''.*?``)*)(#.*)/

This is like the previous example, except what was the $2 grouping has been replaced by the ``Group no $'' operator (?:...). Notice that the line around the second group has disappeared and what was $3 is now $2.

(In future versions of this graphing tool we will graph the invisible group operator. We just did figure out how to do it yet.)

Also notice the use of the ``*?'' operator. Remember when going through the nodes, when a branch is encountered, the system will try and take the top one first.

/^((?:[^#``]*|''.*?(?<!\\)``)*)(#.*)/

The grand finale. One new type of node has been introduced: (?<!\\). This is the negative look-behind. It's the red box on the screen. When the state machine sees this, it matches the text behind the current location marker against the indicated text and if it fails then a match against the next node is possible. (Boy does this not translate into English well.)

Basically the clause in question looks for a double quoted string (``xxx''), but will ignore a double quote it's escaped (``xxx\''yyy``).

BUGS / LIMITATIONS

This will not graph all the regular expressions. Some of the more advanced features of the engine are just not handled.

We currently ``graph'' the ``group, no $1'' (?:..) operator by displaying nothing. A box should be put around the expression.

The boxes drawn by the program are a fixed with not related to the size of the text they contain. Text can easily overflow the box.

The system is UNIX/Linux specific. This is caused by only one small section of code should anyone want to port this to a braindamaged operating system.

Better use of color can be made. Specifically all the nodes do not have to be green. Come to think of it they call don't have to be rectangles either.

Sometimes the lines connecting one section to another take some strange twists.

LICENSE

Licensed under the GPL. (See the end of the source file for a copy).

AUTHOR

Steve Oualline (oualline (at) www.oualline.com)