Prime Numbers, Diffie Hellman and RSA

Basics

RSA and Diffie Hellman are considered strong because it is thought very difficult or impossible to factor products of 2 very large prime numbers. "Very large" here means binary numbers with between 1000 and 8000 digits. A number with a thousand binary digits would have around 300 decimal digits when represented to base 10, because 210 = 1024 which is approximately 10 3. Alternatively, a 1000 digit binary number could be written as 250 hexadecimal characters. To multiply our 2 very large primes We will use the hardware binary integer multiplication instructions in our CPU combined with long multiplication, to multiply 2 1024 bit binary prime numbers together. We will need a programming library which can handle large integers, because these will overflow the normal sized 32 or 64 bit integers provided by the integer types in most programming languages. This isn't going to be as fast as multiplying 2 32 bit numbers together to make a 64 bit one, but it can be done quick enough.

The public and private RSA cryptographic keys use these 2 prime numbers together with some other numbers, chosen by convention or derived from these 2 primes. A prime number has no factors other than itself and 1. A factor F of number N divides N exactly such that the remainder you get on dividing N by F is zero, or N mod F = 0.

Factorisation

How do we factorise a number N ? One approach involves trying every number I smaller or equal to the square root of N, starting with 2, and seeing if N mod I = 0. We add one to I each time we try this, printing out any factors we find. We stop when I2 is greater than N. This approach wastes some time. We could speed the process up if we only tried dividing N by prime numbers starting with 2, and stopping at the first prime number greater than the square root of N. This is because e.g. if 2 and 3 are not factors of N, then multiples of 2 or 3 can't be either. So there is no point trying to factorise N with any non-primes, or compound numbers.

So if we have a list of primes to start with, this procedure can be made faster than if we have to check all the numbers we are going to try as factors to see if these are prime. This gets more difficult when we want to factorise very large numbers, because we won't have enough computer memory to store all primes greater than a few hundred billion.

How to make a million dollars

Find out a way to factorise the product of 2 randomly chosen primes, with each of these primes having about a thousand binary digits, so that the product is about 2000 bits long.

It sounds simple but no-one knows how to do this in an affordable amount of time on any likely computer or collection of computers.

The first person who solves this problem would, I imagine, easily be able to sell a book and public-speaking services for the amount suggested above, because the means used to secure e-commerce on the Internet depend upon this problem not being solved. Of course you might just happen to choose a random number and get very, very lucky and factorise an instance of such a product. But your chances of winning a lottery are astonishingly good compared to this, and you would only through unprecedented luck compromise one single cryptographic key, and not the entire RSA system.

Why mention the $1,000,000 ? The incentive to the person who could publicly show how to crack RSA at reasonable cost is great enough that we can assume that many very talented mathematicians are likely to have explored this problem in great depth, and failed to find and publish a solution. From this we can conclude either that whoever might conceivably have succeeded in this has somehow been given greater incentives to keep their discovery secret than what they would gain from publication, or that nobody has succeeded and so RSA remains unbroken.

Keeping something of this magnitude secret for a long time is very exceptional. One exception is the long period before knowledge of the World War 2 Enigma code break was publicly disclosed in the 1970ies so keeping an important development in cryptanalysis secret for a very long time is not entirely unknown. But the historical wartime circumstances surrounding the Enigma break were very different from the much more open environment in which most peacetime mathematical investigation occurs.

Another way to earn a million dollars would involve proving the Reimann Hypothesis and so winning prize on offer for this achievement from the Clay Mathematical Institute . By resolving the instances of primes to a computable function, the consequences of a Reimann Hypothesis proof might conceivably result in a way of cracking RSA. More on this below.

How come there are there so many primes ?

A simple and elegant proof that there are an infinite number of primes was known to the mathematician Euclid in 300 BC. Imagine that the number of primes were finite. Then there would be a highest prime, P. Given that we could then create a product of all the primes:

R = 2.3.5.7.11 ... .P 

( . here meaning multiply, and = meaning is equal to)

Then either R+1 is itself a prime, or it is a product of primes greater than P. How can we be sure ? We know that all of the primes between 2 and P are factors of R. If 2 divides into R, then the remainder you get when you divide 2 into R+1 is 1

( i.e. R mod 2 = 1)
. The same applies to all the prime factors of R, they must all give a remainder of 1 when divided into R+1. So R+1 can have no prime factors between 2 and P. A number that has no prime factors is itself a prime by definition.

But having a highest prime P means we known how we can either construct a higher prime R+1, or that if R+1 isn't prime this could only be because it has prime factors greater than those in the set of primes between 2 and P, in which case the number R+1 must be the product of primes higher than P. Either way, this proves that P can't be the highest prime. But this proof must still work whichever higher prime we choose to take the place of P. Therefore we have proved that there can be no highest prime, so that there are an infinite number of primes.

Frequency of primes

Wikipedia has an article on Prime numbers. This has a section on prime frequency which states:

Counting the number of prime numbers below a given number

Even though the total number of primes is infinite, one could still ask "Approximately how many primes are there below 100,000?", or "How likely is a random 20-digit number to be prime?".

The prime counting function π(x) is defined as the number of primes up to x. There are known algorithms to compute exact values of π(x) faster than it would be to compute each prime up to x. Values as large as π(1020) can be calculated quickly and accurately with modern computers. Thus, e.g., π(100000) = 9592, and π(1020) = 2,220,819,602,560,918,840.

For larger values of x, beyond the reach of modern equipment, the prime number theorem provides a good estimate: π(x) is approximately x/ln(x). Even better estimates are known.


Table of π(x), x / ln x, and Li(x)

(Source: http://en.wikipedia.org/wiki/Prime_number_theorem

The table compares exact values of π(x) to the two approximations x / ln x and Li(x). The last column, x / π(x), is the average prime gap below x.

x π(x) π(x) - x / ln x π(x) / (x / ln x) Li(x) - π(x) x / π(x)
10 4 -0.3 0.921 2.2 2.500
102 25 3.3 1.151 5.1 4.000
103 168 23 1.161 10 5.952
104 1,229 143 1.132 17 8.137
105 9,592 906 1.104 38 10.425
106 78,498 6,116 1.084 130 12.740
107 664,579 44,158 1.071 339 15.047
108 5,761,455 332,774 1.061 754 17.357
109 50,847,534 2,592,592 1.054 1,701 19.667
1010 455,052,511 20,758,029 1.048 3,104 21.975
1011 4,118,054,813 169,923,159 1.043 11,588 24.283
1012 37,607,912,018 1,416,705,193 1.039 38,263 26.590
1013 346,065,536,839 11,992,858,452 1.034 108,971 28.896
1014 3,204,941,750,802 102,838,308,636 1.033 314,890 31.202
1015 29,844,570,422,669 891,604,962,452 1.031 1,052,619 33.507
1016 279,238,341,033,925 7,804,289,844,393 1.029 3,214,632 35.812
1017 2,623,557,157,654,233 68,883,734,693,281 1.027 7,956,589 38.116
1018 24,739,954,287,740,860 612,483,070,893,536 1.025 21,949,555 40.420
1019 234,057,667,276,344,607 5,481,624,169,369,960 1.024 99,877,775 42.725
1020 2,220,819,602,560,918,840 49,347,193,044,659,701 1.023 222,744,644 45.028
1021 21,127,269,486,018,731,928 446,579,871,578,168,707 1.022 597,394,254 47.332
1022 201,467,286,689,315,906,290 4,060,704,006,019,620,994 1.021 1,932,355,208 49.636
1023 1,925,320,391,606,818,006,727 37,083,513,766,592,669,113 1.020 7,236,148,412 51.939

How do you find a very large and unpredictable prime number ?

Methods of generating all prime numbers with a range of interest, e.g. between 10295 and 10305 would run out of memory or time. Proving a very large random number to be prime turns out to be much easier than factorising the product of 2 very large primes.

Extrapolating the left and right columns of our prime frequency and gap table above, we can roughly estimate that the average gap between primes of 300 or so decimal digits will be around 660 - so we would need to test on average about 330 odd random numbers to find one prime of this magnitude.

We're going to need 2 primes of around this size when we generate a key, but they shouldn't be very close to each other, as this would make both of them close to the square root of their product. To do this for each prime, we are going to generate a large random number as our candidate prime, C, make sure its last bit is a one, so that it is an odd number, and then test if it is prime or not by using the Rabin Miller test. It would be insecure to accept a number C to test for primality from an untrusted source. The reason for this is that the Rabin Miller test is probabilistic. We can run it long enough quicky enough to give us a degree of confidence in a random number we have generated as good as we require, e.g. so that we know that there is a smaller chance of the random number not being prime than of our computer being hit by an asteroid within the next few seconds or so. However if we use a random number provided from an untrusted source, our attacker might have chosen a number to send us which has been carefully selected to pass an unusually large number of repeated attempts to find proof that the number isn't prime, as a very few composite numbers can take longer to be proved composite than others.

Of course the Rabin Miller test is very likely to tell us that our first candidate prime C is composite, in which case we have to choose another number and test that and continue until we have chosen our long and unpredictable random number which is prime. Fortunately there exist a great enough frequency of numbers which are prime and the test is fast enough that we can easily afford to run it often enough quickly to generate a very large prime number with unpredictability or entropy as good as that of our random number generator. One approach is to keep generating random numbers until we find one that is prime, or we might alternatively keep adding 2 to the first odd random number generated and testing it until we find a prime.

The Rabin Miller test depends upon testing a candidate prime with a number of small prime "witnesses" to the candidate being prime. Each witness used to test a candidate number which does not find the candidate to be composite gives the candidate a probability of 1/4 of it not being prime. 2 witnesses give probability of 1/16 of compositeness, 3 a probability of 1/64 and so on. Generating all small primes less than a few billion to use as possible witnesses can be accomplished quickly using a simple algorithm known as Eratothenes' Sieve or a more complex but faster variant of this known as Atkin's Sieve .

Using OpenSSL to check if a large number is prime

The openssl command line tool enables us to try various cryptographic primitive operations from the command line. It has a prime option, to test if a large number is prime or not, using a set of Rabin Miller tests. Example:

rich@saturn:~$ openssl prime
92938982342342198347213987237213423914192384241397283498234172139942317423921392173428317519
2D9FE6C6431358699F8490E1B3F2DF525DF375B5FC788690DF09B96E219DC46ED86CA74611D4F is not prime
rich@saturn:~$ openssl prime
92938982342342198347213987237213423914192384241397283498234172139942317423921392173428317521
2D9FE6C6431358699F8490E1B3F2DF525DF375B5FC788690DF09B96E219DC46ED86CA74611D51 is prime

This and other commands usable with openssl are documented in theOpenSSL Command-line HOWTO.

Modular exponentiation

The Diffie Hellman mathematics is similar to RSA. The operation of both requires the use of modular exponentiation. A slow and memory inefficient version of this would involve multiplying one number by itself by the number of times given by the exponent minus 1, and then taking an integer division remainder or modulus.

E.G. ag mod p

multiples a by itself g - 1 times, and takes the remainder when dividing the result by p. A simple example of this is if a is 5, g is 3 and p is 11 then

E.G. 53 mod 11 = 4 
( because 53 is 125, and 112 is 121, so the remainder is 125 - 121 = 4

A much faster version giving the same result involves reducing the number of multiplications by using repeated squaring, based on the relation:

if b is even

ab = (a(b/2))2,
where b/2 is the integer (no fractions) division, 
Example: 28 = (24)2 = 256 

if b is odd

ab = (a(b/2))2 . a 
Example: 29 = (24)2.2 = 512 
.

This is done recursively e.g.

237 = (218)2.2
218 = (29)2
29 = (24)2.2
24 = (22)2
22 = 2.2

Exponentiating this way involves 7 multiplications instead of 36. This number of multiplications is about log2g if g is the exponent. So instead of needing to multiply g - 1 times, we need to multiply by about 3 times the number of decimal digits in the number g, or about 1024 times if g is a 1024 bit binary number which is a 300 digit decimal one.

However, if we used this optimisation naively we would still end up with a number with so many digits (before taking the remainder on dividing by p) that we wouldn't have enough memory to store it, and the very long multiplications would then get too slow.

A second optimisation we can combine with this one is by repeatedly taking the remainder on division by p, which will prevent the intermediate results from growing too large. We can combine this second optimisation with the first because if:

f = g . h  and  f mod i = j
then (( g mod i ) . (h mod i)) mod i = j

This identity follows because:

if g = k + q where k = g - (g mod i) and q = g mod i
and h = m + n where m = h - (h mod i) and n = h mod i, because
f = g . h = (k + q).(m + n) = k.m + q.m + k.n + q.n . 

In this expression, as k and m are exact multiples of i,

k.m mod i = q.m mod i = k.n mod i = 0 . Therefore
j = g.h mod i = q.n mod i  

Example: start with f = 221, g = 13, h = 17 and i = 7.

221 = 13 . 17  and  221 mod 7 = 4  
( so j is 4 because 31 . 7 = 217 and 221 - 217 = 4)
13 mod 7 = 6   and   17 mod 7 = 10   and   (6 . 10) mod 7 = 4

In this example, we didn't need to get the remainder on dividing 221 by 7 and we didn't need to multiply 13 and 17 together. We got the same result keeping our numbers to multiply (6 and 10) smaller, because we used the remainder on division by 7 three times instead of once.

So for example, our terms g and h might both be the same number when we were multiplying to get the square of an intermediate value in our previous optimisation e.g: a(b/2). Here we need get the remainder

k = (a(b/2)) mod i once, then k . k mod i = m

will be the same result as
((a(b/2))2) mod i = m

The result of combining these optimisations into an algorithm that reduces both the number of multiplications and the size of these, is that modular exponentiation is very fast. This is also considered to be a one way function, in the sense that in the equation: f mod i = j knowledge of the result j and the exponent i does not enable an attacker to discover f , given suitable initial values for f, i and j.

The Diffie Hellman Key Exchange Protocol

This messaging protocol enables Alice and Bob to establish a shared secret suitable for one-off use to support a communication session without having any prior knowledge of each other. It doesn't allow their shared secret key to be used for authentication. However, it can be combined with other public-key protocols, e.g. by Bob and Alice signing the shared secret established using Diffie-Hellman using their semi-permanent GPG secret keys. Then if Alice and Bob both trust each others' keys as genuinely belonging to the other their signatures on the Diffie Hellman (DH) session key also authenticate them to each other.

The DH one-off shared secret or session key allows for what cryptographers call "perfect forward secrecy", which means that compromise of a long-duration key does not compromise previous session keys derived using this.

The Diffie Hellman protocol is probably best explained in more detail by the Wikipedia article which was copied here on 13 Feb 2007.

RSA Explained

This requires some more mathematics, probably most clearly explained in the Wikipedia RSA article from which a copy taken on 9 Feb 2006 is available here. RSA enables enables Bob and Alice to send encrypted and signed datagrams where a two directional communication session does not exist, and confirm identities in connection with use of Diffie Hellman session keys.

Quantum computers and RSA cracking

Conventional computers perform computations on bits. A bit has a value of one or zero. Quantum computers depend upon the validity of the branch of physics known as quantum theory. These use qubits, which can be a 0 or a 1, or a quantum superposition of both of these states. Shor's algorithm is theoretically capable of cracking RSA given a sufficiently large quantum computer.

D-Wave have recently announced a quantum computer with as many as 16 qubits. For some light amusement here is a userfriendly cartoon honouring this event.

Quantum Cryptography

The same area of physics which might one day be used to break RSA might also one day replace it for some purposes. Quantum cryptography derives from quantum states e.g. of single polarised photons transmittable over fibre optics, being dependant upon whether they are observed or not. Unfortunately this technique requires dedicated optic-fibre or free-space optical point to point communication links and expensive equipment. It is thought that this technique will not work over a routed network.

The Riemann Zeta function and prime factoring

'Those who believe mathematics holds the key to the Universe might do well to ponder a question that goes back to the ancients: What secrets are locked within the primes?"'

E. Klarreich, "Prime Time" (New Scientist, 11/11/00)

Reimann zeta function  picture

The zeta function (above) holds inside it the secrets of the primes. How come? Like any function, all it does is turn one number into another number. If n is 3, for example, you add up the infinity of terms in the formula to find out that (3) is roughly 1.2. As the formula shows, the zeta function can also be written as a product of infinitely many terms, each based on one prime number.

The true significance of this function emerges if you feed it complex numbers such as 24 + 13i — combinations of ordinary, real numbers and so-called imaginary numbers (where i is the square root of -1). Although they may sound abstruse, complex numbers are used to simplify practical calculations in everything from engineering to quantum mechanics. By plotting out real parts along the x-axis and imaginary parts along the y-axis, complex numbers can be visualised as points on a two-dimensional plane (right).

Reimann zeta function zeroes along x=1/2 vertical

For certain complex numbers, the zeta function is zero. All the known "zeros" lie along a line in the complex plane, with real parts equalling 1/2. Riemann's hypothesis is that every zero lies on this line. If they do, Riemann proved, the prime numbers must show up as if they are picked at random, but still following an underlying distribution.

Many other investigations in number theory also depend on the truth of the hypothesis, which is why mathematicians long to pin down the zeros of zeta.

The whole article the above extract is taken from is available here.

The TV thriller "prime suspect" was based on the idea that proving the Reimann hypothesis would result in breaking all Internet security - presumably by making it easier to factor products of large primes. Interestingly enough, there are deep connections between the Reiman zeta function, prime numbers and also quantum physics.

Further Reading

RFC4419 Diffie-Hellman Group Exchange for the Secure Shell (SSH) Transport Layer Protocol