I've been taking a break in pi computations because I pretty much maximized the computation size that's practical with the computing resources I've got. Santa Clara University was very generous with their computer time and electric bills, but I've moved on to the University of Illinois, and here all the computing power is already in use working on much more practical projects. This would make a nice BOINC project. In the mean time, the code could use some cleaning up, and I'd like to optimize the CPU version of the code and port it to the Xeon Phi. Because in grad school, I've got a lot of spare time :-)

Here's a summary of my computations of π using CUDA:

Starting pointDigitsCompletion DateTotal timeGPU time5 * 10 ^{14}+ 64`8B223942697A609C4`

2012-09-16 37 days 154 days 1 * 10 ^{15}`8353CB3F7F0C9ACCFA9AA215F2`

2012-10-24 39 days 297 days 2 * 10 ^{15}`653728f1.................`

2013-01-22 26 days 694 days 4 * 10 ^{15}`5cc37dec.................`

2013-05-20 32 days 1434 days 10 * 10 ^{15}`9077e016.................`

2014-04-05 88 days 3772 days

If you're interested, I gave a presentation at the 2013 NVIDIA GPU Technology Conference. Here are my slides, and here is a video of the presentation either streaming or in downloadable FLV. Or you can find it here, under the title "Computing the Quadrillionth Digit of Pi: A Supercomputer in the Garage".

` 5cc37dec `

I've computed another 17 digits after those, and again, I'll keep those under wraps to help confirm other people's computations.

OK, the odds of guessing that are pretty good, so how about more details?

Starting at the two quadrillionth hexadecimal digit (where each hexadecimal digit is four bits) the next eight digits of π are.`653728f1`

All the computations were done on graphics cards rather than on regular CPUs. I spread the job across three sets of graphics card-enabled computers:

- One computer with four NVIDIA GTX 690 graphics cards.
- One computer with two NVIDIA GTX 680 graphics cards.
- 24 computers (at Santa Clara University Design Center) with one NVIDIA GTX 570 graphics card each.

The doublecheck run showed that 25 digits of the initial run were accurate. So I've got 25 new digits, but I'm only making the first eight public for now. That way, if and when someone else extends this result, I can help verify that they really did get accurate results in their computation.

I've been working on a program that uses CUDA to compute digits of π. CUDA is a programming system that lets you offload work from the main processor to the graphics card(s), which often have far more computing horsepower than the main processor. It started as a project for a course I took at Santa Clara University. On August 27th, I completed a run that set a new world record for the computation of π, and I intend to double that record within two months.

The previous record was set in 2011 by a team from Yahoo, and it was computed using a cluster of 1000 computers. I performed a similar computation, but starting a few decimal places further, on a single computer in my garage. All the processing was done on four NVIDIA GTX 690 graphics cards installed in that machine. Yahoo's run took 23 days; mine took 37 days.

Normally when computer and math geeks talk about computing π they're talking about computing every digit from the first decimal place to some target digit many decimal places later. The current record for that type of computation is 10 trillion digits, computed in 2011 by Alexander J. Yee and Shigeru Kondo.

In 1995 Bailey, Borwein and Plouffe discovered a new formula for π which makes it possible to compute a few digits of π without computing all the previous digits.

The Yahoo team used a variation on this formula to compute 64 hexadecimal (base-16) digits of π, starting at the 500 trillionth place:

0E6C1294 AED40403 F56D2D76 4026265B CA98511D 0FCFFAA1 0F4D28B1My calculation started at the 500,000,000,000,056th place, overlapping the Yahoo results by eight digits (to doublecheck my results):BB5392B8

Due to the nature of the computation, the last few digits of this result are bound to be inaccurate. I'm currently doing a second run, offset by four digits, to determine how many are accurate and thus by how much I have extended the record.BB5392B88B223942 697A609C 45CD4228

**Update:** Here are the results from the second run:

run 1: BB5392B8Seventeen of the new digits are correct. The final result, never computed before, is:8B223942 697A609C 45CD4228 run 2: 92B88B223942 697A609C 44D95A52FC3F

I'm going to make a few tweaks to the program, and then start a run twice as deep, starting at the one quadrillionth hexadecimal digit or four quadrillionth bit. The run should take about 40 days on my home machine. I'll be doublechecking the results simultaneously on a set of single-GPU computers at Santa Clara University.

**Update 2012-10-24:** The quadrillionth digit run finished!
Here are the results:

353CB3F7 F0C9ACCF A9AA215F 2556D630This doubles the previous record. My guess is that the first 24 digits are correct. I'm currently doing the doublechecking run, which should finish in about a week.

**Update 2012-11-07:** The results are good to 25 digits!
Here are the results of the doublechecking run:

Initial run, starting at 10^{15}+1:353CB3F7 F0C9ACCF A9AA215F 2556D630 Doublecheck, starting at 10^{15}+5:B3F7 F0C9ACCF A9AA215F 24E1CB4C7030

quadrillionth digit | v Triplecheck, starting at 10So, the new record result is:^{15}-7:1C23D488 353CB3F7 F0C9ACCF A8262A4B Initial run, starting at 10^{15}+1:353CB3F7 F0C9ACCF A9AA215F 2556D630 Doublecheck, starting at 10^{15}+5:B3F7 F0C9ACCF A9AA215F 24E1CB4C7030

After a couple months at a continuous draw of 1100 watts this power strip retired early. |

Each term is in the form 2

For these calculations, D and E are as large as 2 × 10

(let D = A-6-10i+N)

(let E = B i + C)

for (i=...) { D = ...Division is a slow operation, particularly on current NVIDIA GPUs, because they have no hardware division instruction. There is a single-precision floating point reciprocal instruction. For higher precision division, the compiler generates code that uses the reciprocal instruction and a series of Newton-Raphson approximations to achieve the desired precision. For a 32-bit integer divide, this results in 17 instructions. For 64 bits, 70 instructions.function of i... E = ...function of i... m = modularExponentiation(2, D, E); //pow(2,D) % Etotal += m / E; }

The standard modular exponentiation algorithm uses many modulo operations, which are normally just as expensive as division operations. In my modular exponentiaion routine, I used Montgomery reduction to convert the modular base to a power of two, turning all the modulo operations into simple bitwise AND operations.

I used a 128-bit integer datatype to accumulate the results of the operation, so the division at each iteration of the main loop needed 128 bit of precison. Like the compiler-generated code, I used a few rounds of Newton-Raphson approximations to compute a high-precision reciprocal, then multiplied by the dividend.

GPU conference slides in Powerpoint and PDF.

*Last updated 2013-3-14*