Everybody wants their models to run faster. However, researchers often cargo cult performance without a solid understanding on the underlying principles. To address that, I wrote a post called "Making Deep Learning Go Brrrr From First Principles". (

Thread

Everybody wants their models to run faster. However, researchers often cargo cult performance without a solid understanding on the underlying principles.

To address that, I wrote a post called "Making Deep Learning Go Brrrr From First Principles". (1/3)

horace.io/brrr_intro.html

For single-GPU performance, there are 3 main areas your model might be bottlenecked by. Those are: 1. Compute, 2. Memory-Bandwidth, and 3. Overhead. Correspondingly, the optimizations that matter *also* depend on which regime you're in. (2/3)

As an example - if you're memory-bandwidth bound, then increasing your compute performance by activating Tensor-Cores isn't likely to help!

If you wanna learn more, check out the blog post :P
(3/3)

Mentions

See All

Andrej Karpathy @karpathy · Mar 15, 2022

Post
From Twitter

Excellent and unintuitive read on GPUs. The chip doing the compute has tiny amount of memory & is connected to the main memory literally through a straw. Most of the energy goes to data movement too. Many repercussions. E.g. latency better predicted by # activations than # flops

Thread by Horace He

Thread

Mentions