The Unix "GREP" Utility

CS520 - Introduction to Formal Models

By

Ben Wellner (471 13 0453 7)
Michael Dant (390 80 0003 1)

Description of application and problem

We explored the Unix "grep" utility. This application is used to look for a string search pattern in one or more text files, displaying lines that contain the desired pattern.

The use of regular expressions and finite automata

Grep uses regular expressions to specify the search pattern and a finite automaton to implement the search itself.

The first grep, written by Ken Thompson used a nondeterministic finite automaton (see Hume[1]). In 1976, Al Aho implemented a more powerful (in terms of search patterns) grep which utilized a deterministic finite automaton. He called the new utility Egrep.

Egrep was faster then grep for simple patterns but for more complex searches it lagged behind because of the time it took to build a complete finite automaton for the regular expression before it could even start searching. Since grep used a nondeterministic finite automaton it took less time to build the state machine but more time to evaluate strings with it. Aho later used a technique called cached lazy evaluation to improve the performance of egrep. It took zero set-up time and just one additional test in the inner loop.

Example of regular expressions in application

The following example of a grep search looks through all the ".txt" files in the current directory for strings which contain words with the root "plant" (such as plant, planted, planting, etc).

grep " plant[a-z]* [ .]" *.txt

Specifically, the regular expression specifies strings that begin with a space, have the word "plant", any number of letters in any combination after that, and then end with a space or a period.

Limitations and advantages of grep's regular expressions

Grep provides no way of indicating the empty string so all classes of regular expressions that include an empty string may not be searchable.

An example of such an expression using book notation might be: "(epsilon U a)b". Of course a simple expression such as this could also be written as "b U ab" but for more complex expressions, such a conversion might not be tractable.

Grep and egrep do provide some advantages over standard regular expression, as well. For example, in the UNIX man page under regexp, there is a description of how to specify the number of occurrences of a substring. The number of occurrences is limited to 256, however. In theory this type of specification is really beyond the expressive capability of the formal theoretical model for regular expressions. This function adds to grep's ability to specify regular expressions in a much more concise manner.

Example:

grep "\(000\)\255" file.txt

This example will return any line in "file.txt" that had 256 occurrences of the substring "000".

Faster string search algorithms and finite automata

Much faster algorithms than those currently employed in grep have been developed for searching strings. Take, for example, the Boyer and Moore algorithm[2] which starts comparing at the end of the search pattern and works backwards. Though this technique has been shown to drastically improve string search performance, Boyer and Moore's paper indicates that it "is unadvisable" to use the algorithm for non-explicit search patterns such as those given by regular expressions. This is partially due to the fact that the search operates backwards and partially because it needs to skip portions of the input string that are as long as the search string itself. How long is a search string defined by a regular expression? Obviously the regular expression can represent strings of many sizes and herein lies the problem.

We've seen in our last homework how a DFA can be converted into an NFA that accepts LR (the reverse of L). This fact could be useful in implementing the Boyer and Moore algorithm using regular expressions to specify search strings. Now it only remains to be seen how the search string length problem might be solved.

Horspool[3] shows that search speed can be enhanced by comparing first the character in the search pattern that occurs least frequently. Standard search algorithms use the first letter in the pattern to compare first and the Boyer/Moore algorithm starts comparing with the last letter in the search pattern. For example, consider the word "EXAMPLE". When searching through a long text document, the standard approach would look for the first occurrence of "E" and then see if it is followed by "X" and then "A", etc. Horspool recommends starting with "X" since it occurs much less frequently in the English language than "E". When an "X" is found, then it can be determined if it is preceded by an "E" and followed by an "A" and so on. This would reduce the number of comparisons for the other letters in the search pattern.

It would seem very difficult to use a regular expression search pattern for a search algorithm using Horspool's approach since it would often mean starting somewhere in the expression besides the beginning. That would be like changing the DFA's start state to some internal state and then working towards the old start and accept states simultaneously. Perhaps it could be done, who knows?

Conclusion

In conclusion we see that regular expressions are a powerful way to specify search patterns and finite automata are useful in implementing the actual search. Unfortunately, many of the new high-speed string search algorithms do not seem to be well suited for use with regular expressions and finite automata. If we want high-speed searches using complex search patterns, we have some work to do.

References

1. A. Hume, 'Grep wars', 1988 Spring(London) EUUG Conference Proceedings, 237-245.

2. R.S. Boyer and J.S. Moore, 'A fast string searching algorithm', Comm. ACM, 20, (10), 762-772 (1977)

3. R.N. Horspool, 'Practical fast searching in strings", Software--Practice and Experience, 10, (3), 501-506 (1980).

4. Unix MAN pages for GREP and REGEXP