Big Integers: Fast Multiplication

In the last blog post, we discussed the internal representation of big integers, and basic addition and subtraction. In this post, I will discuss the wonderful world of multiplication algorithms for big integers. This post will be somewhat theory heavy, so buckle up and let’s dive right in.

Naive Method

Consider the long multiplication¹ one learns in second grade. We will be using this as a starting point in multiplication algorithms. Let $a = 123$ and $b = 456$, and we will compute the product $a \cdot b$ using long multiplication:

$$ \large \def\arraystretch{1.5} \begin{array}{ccccc} && 1 & 2 & 3 \\ \times & & 4 & 5 & 6 \\ \hline && 7 & 3 & 8 \\ & 6 & 1 & 5 & \\ 4 & 9 & 2 & & \\ \hline 5 & 6 & 0 & 8 & 8 \end{array} $$

This algorithm works by multiplying each digit of $a$ with each digit of $b$ from right to left. Then, we shift the digits of each intermediate result to the left as appropriate, add the results, and carry as needed.

More precisely, let $a_i$ and $b_j$ denote the $i$-th and $j$-th digit of $a$ and $b$, counting from the right respectively. For each pair of indices $(i, j)$, the quantity $a_i \cdot b_j$ contributes to digit $i + j$ from the right in the product. By iterating through all pairs of $(i, j)$ to compute the product and carry as appropriate, we arrive at the answer.

Here’s the implementation of long multiplication in C++, with the storage vectors in base $2^{32}$ and stored in little-endian (least significant digit first):

using storage_t = std::vector<uint32_t>;

inline storage_t naive_mul_int(const storage_t& a, const storage_t& b) {
    assert(a.size() == b.size());
    const uint64_t BASE = (1ULL << 32);
    const size_t n = a.size();
    
    uint64_t carry = 0;
    storage_t result(2 * n);

    for (uint32_t i = 0; i < n; i++) {

        for (uint32_t j = 0; j < n; j++) {
            uint64_t ai = a[i];
            uint64_t bj = b[j];
            uint64_t prod = ai * bj + carry + (uint64_t) result[i + j];
            result[i + j] = static_cast<uint32_t>(prod & (BASE - 1));
            carry = prod >> 32;
        }

        // Propagating carries
        for (uint32_t j = i + 1; carry != 0; j++) {
            uint64_t temp = carry + (uint64_t) result[j];
            result[j] = static_cast<uint32_t>(temp & (BASE - 1));
            carry = temp >> 32;
        }
    }

    while (result.back() == 0 && result.size() > 1)
        result.pop_back();
    return result;
}

It is not difficult to see that this algorithm runs in quadratic time and scales with $n$, the number of digits of $a$ and $b$. This is too slow for integers with many digits, such as those handled by big integer libraries. Thus, we must improve.

Fast Method

One common fast multiplication algorithm used in big integer libraries is the Toom-Cook algorithm. This algorithm uses divide-and-conquer to improve the time complexity. We will use this as our fast multiplication algorithm.

To understand this algorithm, we must first understand fast polynomial multiplication. Let $p$ and $q$ be quadratic polynomials such that

$$ p(t) = at^2 + bt + c \\ q(t) = dt^2 + et + f \\ (a \ne 0, d \ne 0) $$

Our task is to compute the product of the polynomials $r(t) = p(t) \cdot q(t)$. One naive way is by multiplying each term of $p$ with each term of $q$. This takes $9$ multiplications and scales with quadratic time still as the number of terms in the polynomials increase, but we can do better.

The trick is to realize that any polynomial of degree $k$ is determined by $k + 1$ unique points. In our case, the polynomial we are trying to compute $r(t)$ is the product of two quadratics. Therefore, it has degree $4$, so we can recover its coefficients by evaluating it at $4 + 1 = 5$ unique points to interpolate its coefficients.

For this example, let us evaluate $r(t)$ at $0, 1, -1, 2$, and $\infty$. Note that in this context, “evaluating” $r(\infty)$ by convention just means to take its leading coefficient. The rationale behind this is that we just want to determine its end behavior.

$$ \def\arraystretch{1.33} \begin{array}{lclcl} r(0) & = & p(0) \cdot q(0) & = & cf \\ r(1) & = & p(1) \cdot q(1) & = & (a + b + c)(d + e + f) \\ r(-1) & = & p(-1) \cdot q(-1) & = & (a - b + c)(d - e + f) \\ r(2) & = & p(2) \cdot q(2) & = & (4a + 2b + c)(4d + 2e + f) \\ r(\infty) & & & = & ad \end{array} $$

As you can see, to evaluate $r(t)$ at the $5$ points, we only needed $5$ multiplications, which is an improvement over the $9$ multiplications from the $n^2$ method previously. Now, going back to our problem, the last step is to find the coefficients of $r$ by interpolation.

We can do this by solving a linear system of equations² involving these evaluation points. Let’s represent the system using matrix notation below.

$$ \def\arraystretch{1.33} \begin{array}{l l l} \begin{bmatrix} r(0) \\ r(1) \\ r(-1) \\ r(2) \\ r(\infty) \end{bmatrix} & = & \begin{bmatrix} (0)^4 & (0)^3 & (0)^2 & (0)^1 & 0 \\ (1)^4 & (1)^3 & (1)^2 & (1)^1 & 1 \\ (-1)^4 & (-1)^3 & (-1)^2 & (-1)^1 & 1 \\ (2)^4 & (2)^3 & (2)^2 & (2)^1 & 1 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} r_4 \\ r_3 \\ r_2 \\ r_1 \\ r_0 \end{bmatrix} \\ \\ \begin{bmatrix} r(0) \\ r(1) \\ r(-1) \\ r(2) \\ r(\infty) \end{bmatrix} & = & \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & -1 & 1 & -1 & 1 \\ 16 & 8 & 4 & 2 & 1 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} r_4 \\ r_3 \\ r_2 \\ r_1 \\ r_0 \end{bmatrix} \end{array} $$

Each $r_i$ is the $i$-th coefficient of the result polynomial. If the evaluation points were chosen suitably, this matrix will be invertible. Thus, to solve for the coefficients, we simply need to find and multiply by the inverse of the matrix.

$$ \def\arraystretch{1.33} \begin{array}{l l l} \begin{bmatrix} r_4 \\ r_3 \\ r_2 \\ r_1 \\ r_0 \end{bmatrix} & = & \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & -1 & 1 & -1 & 1 \\ 16 & 8 & 4 & 2 & 1 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} ^{-1} \begin{bmatrix} r(0) \\ r(1) \\ r(-1) \\ r(2) \\ r(\infty) \end{bmatrix} \\ \\ \begin{bmatrix} r_4 \\ r_3 \\ r_2 \\ r_1 \\ r_0 \end{bmatrix} & = & \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ \tfrac 12 & -\tfrac 12 & -\tfrac 16 & \tfrac 16 & -2 \\ -1 & \tfrac 12 & \tfrac 12 & 0 & -1 \\ -\tfrac 12 & 1 & -\tfrac 13 & -\tfrac 16 & 2 \\ 1 & 0 & 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} r(0) \\ r(1) \\ r(-1) \\ r(2) \\ r(\infty) \end{bmatrix} \end{array} $$

By computing the matrix-vector product, we obtain the final coefficients³.

Toom-Cook algorithm takes advantage of the fast polynomial multiplication by realizing that integers are actually polynomials in disguise. Going back to our example of $a = 123$ and $b = 456$, we can rewrite them as

$$ p(t) = 1 \cdot t^2 + 2 \cdot t + 3 \\ q(t) = 4 \cdot t^2 + 5 \cdot t + 6 $$

Notice when $t = 10$, both polynomials evaluate back to $a$ and $b$. This is because we are operating in base ten. If we are operating in any other base, we need to evaluate the polynomials at that base to recover the numbers.

To find the product of the numbers, we just need to find the polynomial

$$ r(t) = p(t) \cdot q(t) $$

using interpolation. Notice if the coefficients of $p$ and $q$ themselves are large, we may need to perform more big integer multiplication while computing their product. Thus, we can apply polynomial interpolation recursively to do these computations, improving our time complexity. Finally, to get back our number, we simply need to evaluate $r(t)$ at our desired base.

General Toom-$k$

More generally, for a Toom-$k$ algorithm to compute the product of any integers $n$ and $m$, it splits both of them into $k$ parts⁴. The numbers are then represented using degree $k - 1$ polynomials.

Then, the algorithm recursively computes the product of the polynomials via interpolation. Finally, it evaluates the resulting polynomial at $t = b^i$, where $b$ is the base of the integers, and $i$ is the number of digits in each part. Because $t$ is the base raised to some integer power, this final step actually reduces to just summing the coefficients that are bitwise shifted by different multiples of $i$, making this final step fast.

A Toom-$k$ algorithm runs in $O(n^m)$, where $m = \log_k(2k - 1)$. In our toy example above, we used a Toom-3 algorithm to split the input into $k = 3$ parts, so $m = \log_3(5) \approx 1.465$. Thus, our algorithm runs in approximately $O(n^{1.465})$, which is an improvement over quadratic time.

When $n$ is small, however, the $n^2$ long multiplication outperforms it due to a high constant factor in Toom-Cook multiplication from the added overhead of small additions and subtractions.

I proceeded to implement the Toom-3 multiplication algorithm in the big integer library vint, which is what this series of blogs are based on. The entire implementation is too long to be copy and pasted here verbatim, but free to take a look at the source code for details.

Next time, we’ll discuss division algorithms, and be another step closer to calculating $\pi$!

Footnotes

Sometimes also referred to as the “schoolbook multiplication”, “grade-school multiplication”, etc. They all mean the same thing and are used interchangeably. ↩︎
See the Wikipedia article on polynomial interpolation, which goes way in depth into the math. We will be using the most basic method to interpolate the coefficients here. ↩︎
Although this final matrix can contain fractions and negative numbers, the resulting coefficients will be positive integers. Therefore, we only need to use operations with small constants for the computations, which are fast linear time algorithms. ↩︎
Sometimes one number has significantly more digits than the other. In this case, we can split the integers into uneven number of parts to save some operations. This will change our interpolation matrix (which can be hard-coded), but will not change the overall strategy. ↩︎

Have a comment or a question about this post? Reply by email!