< Previous | Next >
November 2, 2006 1:41 AM CST by psilord in category Codecraft

Back When I Was Smart

So a good friend, Alan, wandered into my office today babbling about some analysis I had written about something or other in the late 90's or early 2000's. Knowing Alan for many years, I'm happy I got at least that much information out of him. After a few minutes of opening mental doors I thought long sealed by tequila and vomit, I stumbled in the dark over what the hell he had been talking about.

Primes.

Yes, those numbers which have driven people mad as they've ground their lives into meaningless dust trying to understand them. Those numbers which have broken people's souls and left them to rot in the hell of misery--thrice damned by the alienation of friends, family, and reason.

It was back in 2003 when another friend of mine named Mike Turner had asked a simple question about the Chomsky classification of a program he'd seen (which Alan had been referencing in my office):

On Sun, Feb 16, 2003 at 04:22:19PM -0800, Michael Turner wrote:
> Most people are familiar with the prime number 'regular' expression
> tester in perl.
> 
> perl -e 'print "PRIME" if (1 x shift) !~ /^(11+)\1+$/' #
>  (where '#' is any number to be tested)
>
> It can eaisly be proven that this is not a true regular language.
> My question is - what type of language is it?
> 
> The Chomsky hierarchy of languages (with ammedments) has several
> types of lanauages and various machines that recognize them:

etc. etc. etc.

I stared at that tiny perl program like a portrait of a person stares at the wall across from it. I was fascinated by it. I just couldn't believe that it would find primes. But test after test, it magically decided correctly for every number I gave it, well, until the perl interpreter segfaulted....

How in the hell did it work? I just had to know. Here is my analysis, which I've extensively edited from the original to explain some concepts better.

The Theory of Operation for:

perl -e 'print "PRIME" if (1 x shift) !~ /^(11+)\1+$/' #

Let's start off with a few substitutions for what happens when you run it with say, the number 15:

print "PRIME" if 111111111111111 ~! /^(11+)\1+$/;

So, each 1 on the left of the ~! represents exactly that, a one. The fact that there are fifteen of them implies a "1 group by 15 ones" grouping of those ones. This style of grouping will come back to help us out later.

What about the right hand side? The funny looking regex? It means to greedily match the (11+) against the input and then after you've produced the largest (11+) group, copy it next to itself (this is what the backreference does). In particular, pay attention to the + after the \1 backreference. This ensures that there will be "enough" groups of ones to potentially match the left hand side. Later in the examples, you'll see how this alters one of the factors being searched for the target number. For the number 15, this ends up starting at (111111111111111)(111111111111111). The left group is the greedy match of the first (11+) in the regex, and the right group is due to the \1 backreference.

Why (11+) and not something else? It is because the (11+) represents a factor of 2 or more (remember, the ones actually stand for ones, and any collection of ones in a ( ) group represents that actual integer). The number of (11+) groups, which starts at two, due to the backreference, may increase during the backing off of the greedy matcher via the \1 backreference, which will always add another group of the current size when it is determined that the greedy matcher had in fact backed off too far and now there isn't enough groups of ones to match the set of ones to the left of the ~!.

Specifically:

(11+)           represents this: (1 + 1 + [1 + 1 + ...]) >= 2
(11)(11)        represents this: (1 + 1) + (1 + 1) == 4 or (2 + 2) = 4
(1111)          represents this: (1 + 1 + 1 + 1) == 4
(111)(111)(111) == 3 + 3 + 3 = 9 == 3 * 3 = 9.

So you see that, say the number 15, can be represented like this:

(111111111111111)          1 group by 15 ones
(11111)(11111)(11111)      3 groups by 5 ones
(111)(111)(111)(111)(111)  5 groups by 3 ones

Now, a trick we must utilize:
2 * 2 = 4 means the same as 2 + 2 = 4. 

The trick is used for things like (111)(111)(111)(111)(111). This is not only 5 groups by 3 ones (5 * 3), but it is also 3 + 3 + 3 + 3 + 3 = 15. The latter form makes much more sense if you are representing each number in a group as a collection of ones.

A very important thing to realize is that in the regexp language the ( ) operator is merely a grouper so (111)(111) is equivalent to (111111). Of course, this isn't true if you are referencing a specific match, but in this case we are not.

Now some examples. In these examples, you'll see the initial state of the regexp for a particular input number, and then watch the backoff of the greedy + operator and the addition of the currently sized group via the backreference. As you get to see the algorithm play out, you'll gain a feel for how it detects primes. At each step, check to see if the pattern specified equals the pattern desired, if not, consider it false and either backtrack, or if finally there is nowhere to backtrack to, the whole thing becomes false:

Initial Number:        3 
Initial Number Group:  111

Regex Algorithm Start:

(111)(111) false, backtrack
(11)(11) false

Stop backtracking, each group can't get smaller than initially specified (11)!

Input Number             Final Groups
------------             ------------
(111)                    (11)(11)
 3                        4

They don't match so number is prime.

Factors Discovered: none

Initial Number:        4 
Initial Number Group:  1111

Regex Algorithm Start:

(1111)(1111) false, backtrack
(111)(111)   false, backtrack
(11)(11)     true

Stop backtracking, each group can't get smaller than initially specified (11)!

Input Number             Final Groups
------------             ------------
(1111)                   (11)(11)
 4                        4

They match so number is composite.

Factors Discovered: 2 * 2

Initial Number:        5 
Initial Number Group:  11111

Regex Algorithm Start:

(11111)(11111) false, backtrack
(1111)(1111)   false, backtrack
(111)(111)     false, backtrack and activate backreference
(11)(11)(11)   false

Stop backtracking, each group can't get smaller than initially specified (11)!

Input Number             Final Groups
------------             ------------
(11111)                  (11)(11)(11)
 5                        6

They don't match so number is prime.

Factors Discovered: none

Initial Number:        6 
Initial Number Group:  111111

Regex Algorithm Start:

(111111)(111111) false, backtrack
(11111)(11111)   false, backtrack
(1111)(1111)     false, backtrack
(111)(111)       true

Input Number             Final Groups
------------             ------------
(111111)                 (111)(111)
 6                        6

They match so the number is composite.

Factors Discovered: 2 * 3

Initial Number:        7
Initial Number Group:  1111111

Regex Algorithm Start:
Notice how the backtracking is really coming out of each group...

(1111111)(1111111) false, backtrack
(111111)(111111)   false, backtrack
(11111)(11111)     false, backtrack
(1111)(1111)       false, backtrack and activate backreference
(111)(111)(111)    false, backtrack and activate backreference
(11)(11)(11)(11)   false

Stop backtracking, each group can't get smaller than initially specified (11)!

Input Number             Final Groups
------------             ------------
(1111111)                (11)(11)(11)(11)
 7                        8

They don't match so the number is prime.

Factors Discovered: none

Initial Number:        8
Initial Number Group:  11111111

Regex Algorithm Start:

(11111111)(11111111) false, backtrack
(1111111)(1111111)   false, backtrack
(111111)(1111111)    false, backtrack
(11111)(11111)       false, backtrack
(1111)(1111)         true

Input Number             Final Groups
------------             ------------
(11111111)               (11)(11)(11)(11)
 8                        8

They match so the number is composite.

Factors Discovered: 2 * 4

What about a more interesting composite number that doesn't have 2 for any factor?

Initial Number:        9
Initial Number Group:  111111111

Regex Algorithm Start:

(111111111)(111111111) false, backtrack
(11111111)(11111111)   false, backtrack
(1111111)(1111111)     false, backtrack
(111111)(111111)       false, backtrack
(11111)(11111)         false, backtrack and activate backreference
(1111)(1111)(1111)     false, backtrack
(111)(111)(111)        true

Input Number             Final Groups
------------             ------------
(111111111)              (111)(111)(111)
 9                        9

They match so the number is composite.

Factors Discovered: 3 * 3

In pseudo-math that is closer to how people think, here is the explanation of number 9 with a different notation:

9 + 9 = 18 == 9? NO
8 + 8 = 16 == 9? NO
7 + 7 = 14 == 9? NO
6 + 6 = 12 == 9? NO
5 + 5 = 10 == 9? NO
oops 4 + 4 is less than 9, so add another 4 factor!
4 + 4 + 4 = 12 == 9? NO
3 + 3 + 3 = 9 == 9?  YES
3 groups * 3 ones in each group == 9? YES therefore COMPOSITE

And now here is a larger number for fun.

Initial Number:        15
Initial Number Group:  111111111111111

Regex Algorithm Start:

(111111111111111)(111111111111111) false, backtrack
(11111111111111)(11111111111111)   false, backtrack
(1111111111111)(1111111111111)     false, backtrack
(111111111111)(111111111111)       false, backtrack
(11111111111)(11111111111)         false, backtrack
(1111111111)(1111111111)           false, backtrack
(111111111)(111111111)             false, backtrack
(11111111)(11111111)               false, backtrack and activate backreference
(1111111)(1111111)(1111111)        false, backtrack
(111111)(111111)(111111)           false, backtrack
(11111)(11111)(11111)              true

Input Number             Final Groups
------------             ------------
(111111111111111)        (11111)(11111)(11111)
 15                       15

They match so the number is composite.

Factors Discovered: 3 * 5

So, it appears to me that the regexp deconstructs the number to see if it can be either constructed through addition of two identical numbers--n * 2 meaning 2 is a factor, therefore composite, or if that fails, then can it be constructed by a factorization of numbers where the number of ones in a group multiplied by the number of groups produces a factorization of the number. I predict this algorithm will always try to find the minimum number of factorizations required to determine compositness. And the minimum number of factors always found for composite number is two since one of the factors represents the groups, and the other the number of ones in a group. I also predict the smaller factor will be the number of groups.

So, if we used 30, then the first match found will be 15 * 2, and
never something like  2 * 3 * 5.

If we used 50, then the first match found will be 25 * 2, and never
something like  2 * 5 * 5.

Note: I've retconned this summary since the last time you read it. I'm some nobody with some dumb blog nobody reads so I don't feel bad about it. Though, thinking about it, I don't think I'd feel bad retconning this under any circumstances.

So what happens is that this algorithm basically divides the number by 2 at the first step, 3 at the first \1+ invocation, 4, 5, 6, 7, and so on, with the grouping of the ones being one of the factors and the number of ones in each group being the other factor. Since the algorithm really starts with 2*n due to the \1+ aspect of the regex, many steps are wasted performing backtracks even before the grouping system starts becomming mathematically effective.

End of Line.

< Previous | Next >