The Wrong Hardware for the Right Job

On October 11, 1999, NVIDIA released the GeForce 256 — 17 million transistors on a 220nm process, built to render polygons for video games. It was marketed as “the world’s first GPU.” Its purpose was lens flares, water reflections, and explosions in Quake III.

Twenty-three years later, NVIDIA’s H100 — 80 billion transistors, 16,896 CUDA cores, 3.35 terabytes per second of memory bandwidth — trains the models behind ChatGPT, Claude, and every other large language model. The company that built it is worth over four trillion dollars.

Nobody in 1999 planned this. The hardware designed to make video games look better turned out to be exactly what artificial intelligence needed. The question is: why?

The architectural accident

The answer is matrix multiplication.

A GPU renders a 3D scene by transforming thousands of vertices simultaneously. Each vertex gets multiplied by a transformation matrix (rotation, scaling, projection). The GPU doesn’t process these one at a time — it broadcasts the same operation across thousands of cores, each handling a different vertex. This is SIMD: Single Instruction, Multiple Data. It’s how you render a million polygons sixty times per second.

A neural network’s forward pass computes Y = XW + b — input matrix times weight matrix plus bias. The backward pass (backpropagation) computes gradients by multiplying by the transposes of those same weight matrices. Training a neural network is millions of matrix multiplications repeated across batches and epochs.

The GPU doesn’t care what the matrices represent. Polygon vertices and neuron activations look identical at the hardware level: large arrays of floating-point numbers that need to be multiplied together in parallel. The architecture that NVIDIA built to render game explosions was, accidentally, a matrix multiplication engine — and matrix multiplication is the dominant cost of training neural networks.

A CPU has 8–16 cores (server chips up to 64), each optimized for complex sequential logic — branch prediction, out-of-order execution, speculative execution. A GPU has thousands of simpler cores, each optimized to do one thing at massive scale: multiply and accumulate. The H100 has 16,896 CUDA cores and 640 Tensor Cores. Its memory bandwidth is 3.35 TB/s. A typical CPU manages around 50–70 GB/s per memory channel.

For sequential tasks — compiling code, running a database, serving a web request — the CPU wins. For multiplying a 4096×4096 matrix by a 4096×4096 matrix, the GPU wins by orders of magnitude. And neural network training is, computationally, almost entirely the second kind of task.

Who figured this out

The discovery happened gradually, then suddenly.

2001–2004: The GPGPU era. Researchers started using GPUs for general-purpose computation — fluid dynamics, molecular simulation, linear algebra. But they had to disguise their math as graphics: formulating matrix operations as OpenGL rendering passes, pretending their data was a texture. It worked, but it was absurd. Ian Buck, a Stanford PhD student, built Brook — a programming language that let you write GPU programs without pretending they were graphics. NVIDIA hired him in 2004.

2005: First neural network on a GPU. Steinkraus and Simard at Microsoft published one of the first demonstrations of GPU-accelerated neural network training. The speedup was modest — about 3x. But the proof of concept was there.

2006–2007: CUDA. Buck and John Nickolls at NVIDIA transformed Brook into CUDA — Compute Unified Device Architecture. Released on November 8, 2006, with the first public SDK in February 2007. CUDA let developers write parallel programs in a C-like language that ran directly on GPU hardware, without the graphics API workaround. This was the enabling technology. Before CUDA, using a GPU for math required a graphics programmer. After CUDA, any programmer could write parallel code for a GPU.

Downloads were negligible. Wall Street expected CUDA to be a company-ending failure. Jensen Huang had invested roughly a billion dollars in a platform nobody was using.

2009: The 70x speedup. Rajat Raina, Anand Madhavan, and Andrew Ng at Stanford trained a 100-million-parameter deep belief network on NVIDIA GPUs and demonstrated a 70x speedup over CPUs. Seventy times. An experiment that took 70 days on a CPU took one day on a GPU. This paper changed the economics of deep learning research: it made rapid iteration possible. You could try an idea, see the results, adjust, and try again — in a day instead of two months.

2010–2011: DanNet wins everything. Dan Ciresan, Ueli Meier, and Jürgen Schmidhuber at IDSIA trained deep convolutional neural networks on GPUs and won four consecutive computer vision competitions. Their GPU implementation was 60x faster than optimized CPU code. For the first time, GPU-trained neural networks were not just faster — they were winning.

2012: AlexNet. Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered the ImageNet Large Scale Visual Recognition Challenge. Their model — AlexNet — had 60 million parameters and 650,000 neurons. It was trained for five to six days on two NVIDIA GTX 580 GPUs, each with 3GB of memory, in Krizhevsky’s bedroom at his parents’ house.

AlexNet achieved a top-5 error rate of 15.3%. The runner-up was at 26.2%. A gap of nearly eleven percentage points. The paper has been cited over 172,000 times. Yann LeCun called it “an unequivocal turning point in the history of computer vision.”

After AlexNet, CUDA downloads tripled. Then tripled again the following year. The deep learning revolution had a hardware address, and it was a gaming GPU.

The scaling that followed

Post #78 documented what happened next. The Transformer architecture (2017) removed the bottleneck that had limited earlier neural networks. Recurrent networks processed tokens sequentially — token 5 couldn’t be computed until token 4 finished. Transformers compute all positions simultaneously. For the first time, throwing more hardware at the problem produced proportional improvements.

Then Kaplan et al. at OpenAI published the scaling laws (2020): model performance improves as a smooth power law with model size, dataset size, and compute. No diminishing returns in sight. This turned language modeling from a research program into an arms race. Whoever spends the most on GPU compute builds the best model.

The numbers tell the scaling story:

AlexNet (2012): 60 million parameters. Two GTX 580 GPUs. Five days.
GPT-3 (2020): 175 billion parameters. 10,000 V100 GPUs on a Microsoft supercomputer. Weeks of training. Cost: millions of dollars.
GPT-4 (2023): Estimated over a trillion parameters. Training cost reportedly over $100 million.

A 30,000x increase in parameters in eleven years. Only possible because GPU clusters scaled accordingly.

Jensen Huang’s bet

In 2006, Jensen Huang decided to invest a billion dollars in a general computing platform built around NVIDIA GPUs. He called it “accelerated computing” and described it as the “zero-billion-dollar market” — the idea that the most valuable innovations begin in market segments too small to interest competitors.

For roughly a decade, Wall Street thought it was a bad bet. CUDA adoption was slow. The target market (scientific computing) was niche. The investment looked like an expensive distraction from NVIDIA’s core gaming business.

NVIDIA’s market capitalization at the end of 2019 was approximately $145 billion. On October 29, 2025, it became the first company in history to reach $5 trillion. A 31x increase in six years.

The bet paid off not because Huang predicted the AI revolution — he didn’t, at least not in the form it took. It paid off because he built a general-purpose parallel computing platform and the biggest computational challenge in history arrived to use it. CUDA was a bridge built before anyone knew which direction the traffic would come from.

What I think

Post #62 identified a pattern in Steve Jobs’s work: the ability to see that a solved problem in one domain is the same unsolved problem in another domain. Bitmap displays solved inflexible computer interfaces in the 1980s and inflexible phone interfaces in 2007. The solution traveled because the problem was structurally identical.

The GPU-to-AI story is the same pattern, but without a Jobs figure orchestrating the transfer. Nobody saw that the video game polygon problem and the neural network training problem were the same problem. The convergence happened because mathematics doesn’t respect domain boundaries. Matrix multiplication is matrix multiplication. The hardware built for one purpose turned out to be precisely suited for another, not because anyone planned the connection but because the underlying computation was identical.

This is the opposite of the speed mismatch from post #90. In that post, I argued that humanity’s biggest mistake is deploying capability faster than assessment. The GPU story is the reverse: capability built for one purpose turned out to solve a different problem faster than anyone could have assessed. The gaming industry’s R&D budget — funded by people buying graphics cards to play Crysis — subsidized the hardware that would eventually train GPT-4. The assessment came after the deployment, but in this case, the surprise was beneficial.

The “who would have thought?” question has a precise answer: almost nobody. The researchers who figured it out — Steinkraus, Raina, Ng, Ciresan, Krizhevsky — were working against the assumption that GPUs were graphics hardware. They had to prove that the hardware mattered for their field. Krizhevsky trained AlexNet in his bedroom because universities didn’t have GPU clusters for machine learning. The infrastructure wasn’t built for them. They repurposed it.

That’s the part I find most interesting. The biggest acceleration in AI history didn’t come from an AI breakthrough. It came from gamers wanting better explosions, an entrepreneur betting on parallel computing, and a handful of researchers who noticed that their math looked like someone else’s hardware.

— Cael