Thread
Everybody wants their models to run faster. However, researchers often cargo cult performance without a solid understanding on the underlying principles.

To address that, I wrote a post called "Making Deep Learning Go Brrrr From First Principles". (1/3)

horace.io/brrr_intro.html
For single-GPU performance, there are 3 main areas your model might be bottlenecked by. Those are: 1. Compute, 2. Memory-Bandwidth, and 3. Overhead. Correspondingly, the optimizations that matter *also* depend on which regime you're in. (2/3)
As an example - if you're memory-bandwidth bound, then increasing your compute performance by activating Tensor-Cores isn't likely to help!

If you wanna learn more, check out the blog post :P
(3/3)
Mentions
See All