Transformers in artificial intelligence

Parag Mahajani
Thoughtworks: e4r™ Tech Blogs
2 min readJan 16, 2024

--

Sci-Tech Neologisms-1.

Photo by Daniel Menakhovsky on Unsplash

What is a transformer in artificial intelligence?

  • A neural network architecture pioneered by the Google Brain and Google Research teams.
  • Published in a paper named “Attention Is All You Need” in 2017 at the Neural Information Processing Systems (NIPS) Conference in 2017.
  • Transforms one sequence of elements (e.g., sentences and words) into another.
  • Major components: encoder stack, decoder stack, and multi-head attention layers.
  • Attention layers capture temporal relations among the elements.
  • Process overview: The encoder reads the input sequence. The attention mechanism determines important parts of the input (context) and assigns weights to them. The decoder takes in the encoded sequence with the weights and completes the transformation.
  • The trained model ultimately predicts (and not copies) the next element in a sequence.

Why is it used?

  • Ability to deal with sequenced tasks and long-term dependencies.
  • Perfect architecture for natural language processing (NLP).
  • The backbone of large language models (LLMs) like GPT, Dall-E, MS Copilot, Bard, etc.
  • Easy to implement and uses parallel processing of inputs, which is efficient.
  • Use of self-attention mechanisms, instead of convolution and recurrences, which simplifies operation.
  • Preferred in synthesis applications.
  • Applications: NLP, text generation, computer vision, image recognition, image creation, video processing, 3D imaging, time-series forecasting, pharmaceuticals, genetics, life sciences, and drug discovery.

What are its current challenges?

  • Expensive in terms of time and memory.
  • Highly complex, larger models. Real-world applications on a smaller scale are challenging.
  • Overfitting to the training data, poor generalisation performance.
  • Hard to control the long-term dependencies that span multiple tokens.
  • Attention mechanisms can be biased, giving spurious outcomes.
  • Difficult to interpret and understand due to intermediate representations.
  • Autoregressive model limitations: limited access to long memory and limited ability to update state.
  • Restriction on accessing the higher-level layer representations due to parallel processing.

To probe further

--

--

Parag Mahajani
Thoughtworks: e4r™ Tech Blogs

Sci-tech communicator, author, technical writer and public speaker of science and technology working for multinational corporates for more than 30 years.