Thread
Transformers were introduced to replace the need for Recurrent Neural Networks in natural language processing.

Here are THREE reasons why the transformer architecture is better than the RNN architecture.

--A Thread --
🧵
First, a brief on RNNs vs. Transformers

RNNs such as LSTMs have a remember gate that enables the network to have long-term memory.

Transformers achieve the same using self-attention.

Transformers relay information through the input sequences through self-attention.
For example, consider this sentence.

"The animal didn't drink the water because it was too dirty." Does "it" refer to the water or the animal? Water.

The transformer uses self-attention to know that "it" refers to water by giving water more weight compared to the other terms.
Reason 1: The Transformer is better than RNNs because they are better at handling long-term dependencies than RNNs.

The use of attention makes it possible to model longer-term dependencies because each layer can have access to the entire input, unlike RNNs.
Reason 2: Transformers are parallelizable, unlike RNNs. Making them a good fit for accelerators such as GPUs and TPUs.

Parallelization is possible because layer outputs can be computed in parallel.
Reason 3: Transformers are faster than RNNs because of the parallelization factor. In RNNs, the computation happens one after the other, making the process slow, while in Transformers, computation happens in parallel.

Follow @themwiti for more machine learning content.
Mentions
See All