Deep Learning for NLP – Part 5

Part 5: Efficient Transformer models

This course is a part of “Deep Learning for NLP” Series. In this course, I will talk about various design schemes for efficient Transformer models. These techniques will come in very handy for academic as well as industry participants. For industry use cases, Transformer models have been shown to lead to very high accuracy values across many NLP tasks. But they have quadratic memory as well as computational complexity making it very difficult to ship them. Thus, this course which focuses on methods to make Transformers efficient is very critical for anyone who wants to ship Transformer models as part of their products.

What you’ll learn

  • Deep Learning for Natural Language Processing.
  • Efficient Transformer Models: Star Transformers, Sparse Transformers, Reformer, Longformer, Linformer, Synthesizer.
  • Efficient Transformer Models: ETC (Extended Transformer Construction), Big bird, Linear attention Transformer, Performer, Sparse Sinkhorn Transformer, Routing transformers.
  • Efficient Transformer benchmark: Long Range Arena.
  • Comparison of various efficient Transformer methods.
  • DL for NLP.

Course Content

  • Efficient Transformers: Part 1 –> 8 lectures • 1hr 41min.
  • Efficient Transformers: Part 2 –> 10 lectures • 1hr 51min.

Deep Learning for NLP - Part 5

Requirements

  • Basics of machine learning.
  • Basic understanding of Transformer based models and word embeddings.

This course is a part of “Deep Learning for NLP” Series. In this course, I will talk about various design schemes for efficient Transformer models. These techniques will come in very handy for academic as well as industry participants. For industry use cases, Transformer models have been shown to lead to very high accuracy values across many NLP tasks. But they have quadratic memory as well as computational complexity making it very difficult to ship them. Thus, this course which focuses on methods to make Transformers efficient is very critical for anyone who wants to ship Transformer models as part of their products.

Time and activation memory in Transformers grows quadratically with the sequence length. This is because in every layer, every attention head attempts to come up with a transformed representation for every position by “paying attention” to tokens at every other position. Quadratic complexity implies that practically the maximum input size is rather limited. Thus, we cannot extract semantic representation for long documents by passing them as input to Transformers. Hence, in this module we will talk about methods to address this challenge.

The course consists of two main sections as follows. In the two sections, I will talk about Efficient Transformer Models, Efficient Transformer benchmark and a Comparison of various efficient Transformer methods.

In the first section, I will talk about methods like Star Transformers, Sparse Transformers, Reformer, Longformer, Linformer, Synthesizer.

In the second section, I will talk about methods like ETC (Extended Transformer Construction), Big bird, Linear attention Transformer, Performer, Sparse Sinkhorn Transformer, Routing transformers. Long Range Arena is a recent benchmark for evaluating models on long sequence tasks with respect to accuracy, memory usage and inference time. We will discuss details about long range arena and finally wrap up with a philosophical categorization of various efficient Transformer methods.

For each method, we will discuss specific scheme for optimization, architecture and results obtained for pretraining as well as downstream tasks.

Get Tutorial