Evolving a better factorization algorithm

(See also my factor tables of "interesting" numbers")

Given to find a factor of N:

Suppose that there could exist some system of functions which when iterated would yield a sequence of values of Q such that a large percentage of Q's shared a factor with N. If such a system of functions could be found, and if values of Q for which 1 < GCD(N,Q) < N appear sufficiently often then this would constitute a useful method of factorization.

Such a hypothetical algorithm could be summarized as follows:

For each of some number of intermediate variables V[1], V[2], ... V[i] there would exist an initializing function, f[i](N) which would provide the starting value for each variable. The algorithm would thus begin:

V[1] = f[1](N)

V[2] = f[2](N)

...

V[i] = f[i](N)

For each of these variables there would also exist an iterated function which would be used to compute successive values of this variable. Thus, with each iteration the variable V[i] would be recomputed as a temporary variable X[i] = g[i](...), where the arguments passed to the function could be any or all of the intermediate variables.

Having computed new values of each of the intermediate variables, they are copied back into the variables V[i] and a final combining function is called to compute Q = c( v[1], V[2], ... V[i] ) to combine some or all of the intermediate variables, along with whatever constants are appropriate, into a new value for Q.

Finally, GCD( N, Q ) is computed to locate any common factor that might be present in Q.

The entire algorithm would then have this form:

V[1] = f[1](N)

V[2] = f[2](N)

...

V[i] = f[i](N)

repeat

     X[1] = g[1](...)
             X[2] = g[2](...)
             ...
             X[i] = g[i](...)

             for all i
                    V[i] = X[i]

             Q = c( V[1], V[2], ..., V[i] )

             if ( 1 < GCD( Q, N ) < N )
                        print factor and stop

until a factor is found

A trivial example (but one which sometimes yields very large factors in a very short time) would be:

V[1] = 3 // initializing functions
V[2] = 2

repeat

        X[1] = ( V[1] ^ V[2] ) mod N
        X[2] = V[2] + 1
        V[1] = X[1]
        V[2] = X[2]
        Q = V[1] - 1
        F = GCD( Q, N )
   if ( F is a factor, print it and stop )

until factor found

---------------------------------------------------

Given that there might exist such a system of functions which constituted a good general purpose factoring algorithm, how would one go about discovering such a set of functions?

The most obvious way would be to derive the required functions from some underlying theoretical foundation. The drawback with that approach is that such a derived system of functions could not be any better that the theory underlying it.

Another method would be to use genetic programming [ 1 ] to evolve a system of functions with the required properties. In principle such a genetically derived system of functions could not only perform well beyond our current level of theoretical understanding, but could provide us with the opportunity to gain new insights by analysing and understanding the functions that evolved.

A third method would be to generate random sets of functions in the hopes that one would be found which had the desired properties. In fact, this may be the only workable solution since the success of the genetic programming approach depends on there being a gradient in the solution space. If no such gradient exists then there is no way to approach the solution by a series of successively better approximations, and random trial and error is the only viable approach.

The only other method would to be exhaustively test every possible set of functions. This, of course, cannot be done since the number of possible functions is astronomical.

This leaves the genetic algorithm as the method of choice. What follows, then, is an examination of the specifics of implementing the above algorithm in a manner that makes such evolution possible.



Overview

--------

In keeping with the general principles of genetic programming described by Koza [ 1 & 2 ] each function will be represented by a tree of variable size and shape. A terminal node on the tree will represent the value of a variable, or any one of a number of constants. A branching node will represent some mathematical operation performed on the values of the subtrees of that operation. Each tree evaluates to a single value which is then stored in the intermediate variable associated with that tree. The tree which represents Q will be subjected to the GCD test and scored based on the number of iterations required to find a factor. Each system of functions will be given some number of test cases and the score will be the average number of iterations required to find a factor in each test case. A perfect score of 1 would indicate that every N was factored on the first iteration. That would constitute the mythical "instant factorization" method. The score for a given N divided by the size of N would, given enough samples, roughly describe a function that could be used to approximate the time to factor an N of some given size.

Program Architecture

---------------------

A given "program", or system of functions, will consist of some number of variables which will be fixed for any given genetic run. For each such variable there will exist two function trees. The first will be the initializing function, and the second will be the iterating function. In addition, there will be one single combining function for the entire system of functions.

Thus, a run in which the number of variables was set at 5 would have a total of 11 trees, while a system incorporating 10 variables would have 21 trees.

Each individual in the population will have its own unique set of trees.

The order of operation for each test case will be:

1. Apply the initializing function to each variable.

2. Apply the iteration function to each variable, holding

the result in a temporary variable until they are all

computed.

3. Apply the combining function using the new values for

all variables.

4. Try GCD(N,Q), and if no factor found, goto 2.

Test cases will be created by selecting two primes at random and setting N equal to their product.

Function Nodes

--------------

The function nodes in each tree will consist of one of the following functions:

Add: Evaluates to the sum of its two subtrees.

Sub: Evaluates to the difference of subtree 1 minus subtree 2.

Mlt: Evaluates to the product of the two subtrees, mod N (to prevent overflow)

Div: Evaluates to the integer quotient of subtree 1 divided by subtree 2.

If subtree two evaluates to zero, Div evaluates to zero.

Mod: Evaluates to the remainder of subtree 1 divided by subtree 2.

Pow: Evaluates to subtree 1 raised to the power of subtree 2. The result

is computed mod N.

Terminal Nodes

--------------

The terminal nodes in each tree will consist of one of the following values:

Some variable V[i]. There will be one such variable for each tree in the
system of functions.

Some small constant from the set { -3, -2, -1, +1, +2, +3 }

The value of the N being factored.

The integer square root of N.

The iteration counter i.