Integer Factorization and RSA

This article is the final report of CSCI5440: Theory of Cryptography. I summarized the mathematical details of factoring algorithms and RSA.

Security

Preliminaries

Notations

We use \(\mathbb{Z}_N\) to denote the group \(\{0,1,...,N-1\}\) with respect to addition modulo \(N\). Then we define \(\mathbb{Z}_N^*\triangleq\{b\in\{1,...,N-1\} | gcd(b,N)=1\}\). We can proof that \(\mathbb{Z}_N^*\) is a group under multiplication modulo \(N\):

\[ab\triangleq[ab\mod N].\]

Euler's Phi Function

The Euler's phi function is defined as \(\phi(N)\triangleq|\mathbb{Z}_N^*|\), the order of the group \(\mathbb{Z}_N^*\). The value of \(\phi(N)\) is not difficult to compute. First consider the case when \(N=p\) is prime. Then all elements in \(\{1,...,p-1\}\) are relatively prime to \(p\), so \(\phi(N)=|\mathbb{Z}_N^*|=p-1\). Next consider the case when \(N=pq\), where \(p,q\) are distinct primes. If an integer \(a\in\{1,...,N-1\}\) is not relatively prime to \(N\), then either \(p|a\) or \(q|a\). The elements in \(\{1,...,N-1\}\) divisible by \(p\) are exactly the \(q-1\) elements \(p,2p,3p,...,(q-1)p\). The number of elements remaining is therefore given by \((N-1)-(q-1)-(p-1)=(p-1)(q-1)\).

Euler's Theorem

Take arbitrary integer \(N>1\) and \(a\in\mathbb{Z}_N^*\). Then

\[a^{\phi(N)}\equiv1\mod{N}.\]

The proof is based on the fact that for any element \(g\) in a finite group \(\mathbb{G}\) with group order \(|\mathbb{G}|=m\), we have \(g^m=1\).

Integer Factorization

Integer factorization, or just factoring, is one of the oldest problems in mathematics. It has been studied for hundreds of years without an efficient algorithm being found. Based on the hardness of this problem, people constructed the famous RSA crypto-system, which is one of the most important algorithms nowadays.

Definition

Given a composite integer \(N\), the factoring problem is to find integers \(p,q>1\) such that \(pq=N\). So far, no polynomial-time algorithm for factoring has been demonstrated despite hundreds of years of effort.

Factoring problem is believed to be hard, but not always. For example, it is easy to factor \(N\) if it is even. Even if \(N\) is odd, it might still have small prime factors that can be easily found. This suggests that the "hardest" numbers to factor are those having only large prime factors.

The Factoring Assumption

Let GenModulus be a polynomial-time algorithm that, on input \(n\), outputs \((N,p,q)\) where \(N=pq\), and \(p\) and \(q\) are \(n\)-bit primes except with probability negligible in \(n\). Then consider the following experiment for a given algorithm \(\mathcal{A}\) and parameter \(n\):

The factoring experiment \(Factor_{\mathcal{A},GenModulus}(n)\):

  1. Run GenModulus(n) to obtain \((N,p,q)\).
  2. \(\mathcal{A}\) is given \(N\), and outputs \(p',q'>1\).
  3. The output of the experiment is defined to be \(1\) if \(p'\cdot q'=N\), and \(0\) otherwise.

DEFINITION Factoring is hard relative to GenModulus if for all probabilistic polynomial-time algorithms \(\mathcal{A}\) there exists a negligible function \(negl\) such that

\[Pr\left[Factor_{\mathcal{A},GenModulus}(n)=1\right]\leq negl(n).\]

Here, we say a function \(f:\mathbb{N}\rightarrow\mathbb{R}^*\) is negligible if for every positive polynomial \(p\) there is an \(N\) such that for all integers \(n>N\) it holds that \(f(n)<\frac{1}{p(n)}\).

The factoring assumption is the assumption that there exists a GenModulus relative to which factoring is hard. This assumption basically says that no polynomial-time algorithm can efficiently factor a big number.

Generating Random Primes

Now let's talk about how GenModulus(n) works. This function needs to generate two \(n\)-bit prime numbers \(p\) and \(q\). A intuitive way to generate a random \(n\)-bit prime is to repeatedly choose random \(n\)-bit integers until we find a prime. There are two issues we need to consider:

  1. The probability that a uniform \(n\)-bit integer is prime.
  2. How to efficiently test whether an integer is prime.

The first issue is basically whether there is a reasonable change that a random number is prime. Luckily, mathematicians have proofed that this is the case:

THEOREM (Bertrand's postulate) For any \(n>1\), the fraction of \(n\)-bit integers that are prime is at least \(1/3n\).

This means if we iterate \(t=3n^2\) times, then the probability that a prime a not chosen in all iterations is negligible in \(n\):

\[\left(1-\frac{1}{3n}\right)^t=\left(\left(1-\frac{1}{3n}\right)^{3n}\right)^n\leq\left(e^{-1}\right)^n=e^{-n}.\]

For example, if we want a \(512\)-bit prime, we can iterate \(3\cdot512^2=786432\) times. This is not a difficult task for computers.

As for the second issue, the first idea could be to factor the number. But it does not work since \(p\) and \(q\) are too large. The practical algorithms that people use are probabilistic. That is, if the input \(p\) is prime, the algorithm will always output "p is prime"; if \(p\) is composite, then the algorithm has a high probability to output "p is not prime". Examples for these algorithms are the Fermat Primality Test, the Miller-Rabin Test, etc. Although in theory, there is a small change to make mistakes, however, no numbers are actually known that pass advanced probabilistic tests (such as Miller-Rabin) yet are actually composite.

RSA

The RSA Problem

Given a modulus \(N\) and an integer \(e>2\) relatively prime to \(\phi(N)\). If \(m^e\equiv c\mod{N}\) (\(m, c\in\mathbb{Z}_N^*\)), we define \(m\triangleq c^{1/e}\mod{N}\). The RSA problem is to compute \(\left[c^{1/e}\mod{N}\right]\) for a modulus \(N\) of unknown factorization.

The RSA problem is believed to be hard for any \(e\) that is relatively prime to \(\phi(N)\), and it yields a cryptosystem.

RSA Encryption

First we need a GenRSA algorithm that, on input \(n\), outputs a modulus \(N\) that is the product of two \(n\)-bit primes, along with integers \(e\), \(d\) satisfying \(ed\equiv 1\mod \phi(N)\). \(N\), \(p\), \(q\) can be generated by GenModulus. \(\phi(N)=(p-1)(q-1)\). \(e\) can be arbitrarily chosen as long as \(e>1\) and \(\gcd(e, \phi(N))=1\). \(d\) is computed as \(e^{-1}\mod \phi(N)\).

We define a public-key encryption scheme as follows:

The RSA encryption relies on the fact that someone who knows \(d\) can recover \(m\) from \(c\) by computing \(c^d\mod N\). This works because:

\[ \begin{aligned} &\left[c^d\mod N\right]\\ =&\left[m^{ed}\mod N\right]\\ =&\left[m^{ed\mod{\phi(N)}}\mod N\right]\\ =&\left[m^1\mod N\right]\\ =&\left[m\mod N\right]. \end{aligned} \]

The second equation is based on the fact that for any \(g\) in a finite group \(\mathbb{G}\) with group order \(|\mathbb{G}|=m>1\) and any integer \(x\), we have \(g^x=g^{[x\mod m]}\).

The following figure shows the encryption and decryption process. Everyone can encrypt a message \(m\) by raising it to the \(e\)th power. Then the person who knows the private key \(d\) can extract the \(e\)th root of the ciphertext \(c\), by simply raising \(c\) to the power of \(d\).

RSA Encryption

Since \(d\) is the inverse of \(e\), which can only be calculated by knowing the group order \(|\mathbb{Z}_N^*|=\phi(N)\), so the core idea of RSA is that everyone can compute in group \(\mathbb{Z}_N^*\), but the group order is kept secret and only the person who knows the private key can extract the \(e\)th root.

RSA Signature

Another use of RSA is digital signature. Signature is just the other side of encryption. In RSA encryption, we raise the \(e\)th power to encrypt, and extract the \(e\)th root to decrypt. In RSA signature, we extract the \(e\)th root to sign, and raise the \(e\)th power to verify. In this way, only the person who knows the private key can sign, but everyone can verify. The whole process is shown in the following figure.

RSA Signature

Relating the RSA and Factoring

If \(N\) can be factored, then we know \(p\) and \(q\) so we can compute \(\phi(N)=(p-1)(q-1)\) and use this to compute \(d=e^{-1}\mod \phi(N)\) for any given \(e\). In other words, if we can solve factoring problem, then we can also solve RSA problem. So if RSA is hard, then factoring must be hard. As for the other direction, it is still not known. The best we can proof is that computing \(d\) from \(N\) and \(e\) is as hard as factoring. But we still can not say if factoring is hard then RSA is hard, because computing \(d\) from \(N\) and \(e\) might not the only way to break RSA, although we do not know how to do it. So the best known attack for RSA still involves factoring \(N\).

Attacks for RSA

Trial Division (Brute-Force)

Trivially, by exhaustively checking whether \(2,3,...\lfloor\sqrt{N}\rfloor\) divides \(N\), this problem can be solved in time \(\mathcal{O}\left(\sqrt{N}\right)\), which is exponential in \(\|N\|=\lfloor\log{N}\rfloor+1\), the length of the binary representation of \(N\).

An interesting fact is that this naive algorithm is efficient for most of numbers. For example, by simply dividing \(N\) by \(2\) we can solve half of numbers. Formally, we define \(p(m)\) as the proportion of integers that are divisible by one (or more) of the first \(m\) prime numbers. It is easy to derive this formula:

\[p(m)=1-\prod_{k=1}^{m}\left(1-\frac{1}{p_k}\right).\]

Using this formula, we get some surprising facts: about 88% of all integers have a factor less than 100, and 92% have a factor less than 1000. Therefore, if we want to construct a general-purpose factoring algorithm, we can first try trial division, if we are not lucky then we try other advanced algorithms.

Pollard's \(p-1\) Algorithm

Suppose \(p\), \(q\) are prime and \(N=pq\). Fermat's little theorem:

\[a^{p-1} \equiv 1\mod{p}.\]

For all \(a\) relatively prime to \(p\). Suppose \(L=(p-1)k\), then

\[ \begin{aligned} &a^L=a^{(p-1)k} \equiv 1\mod{p}\\ \Rightarrow&a^L-1=pr\\ \Rightarrow&\gcd(a^L-1=pr, N=pq)=p \end{aligned} \]

This means, by simply calculating the gcd of \(L\) and \(N\), we get a factor \(p\). There is only one question left: how to find such an \(L\)?

From fundamental theorem of arithmetic, \(L\) can be written as a product of prime numbers:

\[ L=(p-1)k=p_1^{e_1}\cdot p_2^{e_2}\cdot p_3^{e_3}... \]

In this equation, some prime numbers correspond to \(p-1\), the other correspond to \(k\). Now we try to evaluate \(L\) by multiplying as much prime numbers as we can:

\[ L'=p_1^{e_1'}\cdot p_2^{e_2'}\cdot p_3^{e_3'}... \]

Here we also make the exponents sufficiently large. If \(p-1\) has only "small" prime factors, then \(L'\) includes all the prime numbers that correspond to \(p-1\). So \(L'\) is a multiple of \(p-1\), now we have found an \(L\) that we need:

\[L'=(p-1)k'\]

Pollard's \(p-1\) algorithm is thwarted if \(p\) is a safe prime. (Recall that \(p\) is a safe prime if \((p-1)/2\) is also prime, which means \(p-1\) has a very large prime factor.)

Pollard's Rho Algorithm

Pollard's Rho algorithm is based on the birthday attack, which tells the probability that two students have the same birthday. Formally, we randomly choose \(k\) numbers \(x_1,...,x_k\) in range \([1,365]\), then what is the probability that there exist distinct \(i,j\) with \(x_i=x_j\)? The conclusion of birthday attack is that, the probability is high if \(k=\sqrt{365}\).

Pollard's Rho algorithm is almost the same. We randomly choose \(k\) numbers \(x_1,...,x_k\) in \(\mathbb{Z}_N^*\), what is the probability that there exist distinct \(x, x'\) with \(x=x'\mod p\)? (\(p,q\) are \(n\)-bits primes, \(N=pq\)) The conclusion is also similar: the probability is high if \(k=2^{n/2}=\mathcal{O}(\sqrt{p})\). Finally, if \(x=x'\mod p\) then \(\gcd(x-x'=pr,N=pq)=p\). The complexity is also exponential: \(\mathcal{O}(N^{1/4})\).

Other Algorithms and the Complexity of Factoring Problem

We have talked about three algorithms but all of them are still exponential. Can we do better? In fact, the complexity of factoring problem is still not known. People always represent the heuristic running time using the \(L\)-notation:

\[L_n[u,v]=e^{v(\log n)^u(\log\log n)^{1-u}}.\]

If \(u=0\), then \(L_n[0,v]=(\log n)^v\), which is polynomial. If \(u=1\), then \(L_n[1,v]=(e^{\log n})^v\), which is exponential in the length of \(n\). If \(0<\)u\(<1\), then it is sub-exponential. Now people have proposed some algorithms that are sub-exponential, elliptic curve method and quadratic sieve algorithm can achieve \(L_n[1/2,1+o(1)]\). The fastest known general-purpose factoring algorithm is the general number field sieve, whose expected running time is \(L_n[1/3, \sqrt[3]{64/9}]\).