What are transformers in Generative AI?

The Transformer architecture is pivotal in modern natural language processing (NLP), powering AI tools like ChatGPT. We explain what it is and how it works.

By Kesha Williams

Apr 15, 2024 • 8 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

In the world of artificial intelligence, a force has revolutionized the way we think about and interact with machines: Transformers. No, not those shape-shifting toys that morph into trucks or fighter jets! Transformers let AI models track relationships between chunks of data and derive meaning — much like you deciphering the words in this sentence. It’s a method that has breathed new life into natural language models and revolutionized the AI landscape.

In this post, I’ll explain the Transformer architecture, how it powers AI models like GPT and BERT, and its impact on the future of Generative AI.

How does generative AI work?

Generative AI (GenAI) analyzes vast amounts of data, looking for patterns and relationships, then uses these insights to create fresh, new content that mimics the original dataset. It does this by leveraging machine learning models, especially unsupervised and semi-supervised algorithms.

So, what actually does the heavy lifting behind this capability? Neural networks. These networks, inspired by the human brain, ingest vast amounts of data through layers of interconnected nodes (neurons), which then process and decipher patterns in it. These insights can then be used to make predictions or decisions. With neural networks, we can create diverse content, from graphics and multimedia to text and even music.

These neural networks adapt and improve over time with experience, forming the backbone of modern artificial intelligence. Looping back to Transformers, it’s like the Matrix of Leadership, which allows Optimus Prime to leverage the knowledge of his ancestors to inform his decisions.

There are three popular techniques for implementing Generative AI:

Generative Adversarial Networks (GANs)
Variational Autoencoders (VAEs)
Transformers

Examining how the first two work helps provide insight into how transformers operate, so let’s delve a little deeper into GANs and VAEs.

What are Generative Adversarial Networks? (GANs)

Generative Adversarial Networks (GANs) are a type of generative model that has two main components: a generator and a discriminator. The generator tries to produce data while the discriminator evaluates it.

Let’s use the analogy of the Autobots and Decepticons in the Transformers franchise. Think of the Autobots as "Generators," trying to mimic and transform into any vehicle or animal on Earth. On the opposite side, the Decepticons play the role of "Discriminators," trying to identify which vehicles and animals are truly Autobots. As they engage, the Autobots fine-tune their outputs, motivated by the discerning eyes of the Decepticons. Their continuous struggle improves the generator's ability to create data so convincing that the discriminator can't tell the real from the fake.

GANs have many limitations and challenges. For instance, they can be difficult to train—because of problems such as model collapse, where the generator produces limited varieties of samples or even the same sample, regardless of the input. For example, it might repeatedly generate the same type of image rather than a diversity of outputs.

What are Variational Autoencoders? (VAEs)

Variational Autoencoders (VAEs) are a generative model used mainly in unsupervised machine learning. They can produce new data that looks like your input data. The main components of VAEs are the encoder, the decoder, and a loss function.

Within deep learning, consider VAEs as Cybertron's advanced transformation chambers. First, the encoder acts like a detailed scanner, capturing a Transformer's essence into latent variables. Then, the decoder aims to rebuild that form, often creating subtle variations. This reconstruction, governed by a loss function, ensures the result mirrors the original while allowing unique differences. Think of it as reconstructing Optimus Prime's truck form but with occasional custom modifications.

VAEs have many limitations and challenges. For instance, the loss function in VAEs can be complex, where striking the right balance between making generated content look real (reconstruction) and ensuring it's structured correctly (regularization) can be challenging.

How Transformers are different from GANs and VAEs

The Transformer architecture introduced several groundbreaking innovations that set it apart from Generative AI techniques like GANs and VAEs. Transformer models understand the interplay of words in a sentence, capturing context. Unlike traditional models that handle sequences step by step, Transformers process all parts simultaneously, making them efficient and GPU-friendly.

Imagine the first time you watched Optimus Prime transform from a truck into a formidable Autobot leader. That’s the leap AI made when transitioning from traditional models to the Transformer architecture. Multiple projects like Google’s BERT and OpenAI’s GPT-3 and GPT-4, two of the most powerful generative AI models, are based on the Transformer architecture. These models can be used to generate human-like text, help with coding tasks, translate from one language to the next, and even answer questions on almost any topic.

Additionally, the Transformer architecture's versatility extends beyond text, showing promise in areas like vision. Transformers' ability to learn from vast data sources and then be fine-tuned for specific tasks like chat has ushered in a new era of NLP that includes ground-breaking tools like ChatGPT. In short, with Transformers, there’s more than meets the eye!

How does the Transformer architecture work?

Transformer is an architecture of neural networks that takes a text sequence as input and produces another text sequence as output. For example, translating from English (“Good Morning”) to Portuguese (“Bom Dia”). Many popular language models are trained using this architectural approach.

The input

The input is a sequence of tokens, which can be words or subwords, extracted from the text provided. In our example, that’s “Good Morning.” Tokens are just chunks of text that hold meaning. In this case, “Good” and “Morning” are both tokens, and if you added an “!”, that would be a token too.

The embeddings

Once the input is received, the sequence is converted into numerical vectors, known as embeddings, which capture the context of each token. These embeddings allow models to process textual data mathematically and understand the intricate details and relationships of language. Similar words or tokens will have similar embeddings.

For example, the word “Good” might be represented by a set of numbers that capture its positive sentiment and common use as an adjective. That means it would be positioned closely to other positive or similar-meaning words like “great” or “pleasant”, allowing the model to understand how these words are related.

Positional embeddings are also included to help the model understand the position of a token within a sequence, ensuring the order and relative positions of tokens are understood and considered during processing. After all, “hot dog” means something entirely different from “dog hot” - position matters!

The encoder

Now that our tokens have been appropriately marked, they pass through the encoder. The encoder helps process and prepare the input data — words, in our case — by understanding its structure and nuances. The encoder contains two mechanisms: the self-attention and feed-forward mechanisms.

The self-attention mechanism relates every word in the input sequence to every other word, allowing the process to focus on the most important words. It's like giving each word a score that represents how much attention it should pay to every other word in the sentence.

The feed-forward mechanism is like your fine-tuner. It takes the scores from the self-attention process and further refines the understanding of each word, ensuring the subtle nuances are captured accurately. This helps optimize the learning process.

The decoder

At the culmination of every epic Transformers battle, there's usually a transformation, a change that turns the tide. The Transformation architecture is no different! After the encoder has done its part, the decoder takes the stage. It uses its own previous outputs — the output embeddings from the previous time step of the decoder — and the processed input from the encoder.

This dual input strategy ensures that the decoder takes into account both the original data and what it has produced thus far. The goal is to create a coherent and contextually appropriate final output sequence.

The output

At this stage, we’ve got the “Bom Dia” — a new sequence of tokens representing the translated text. It's just like the final roar of victory from Optimus Prime after a hard-fought battle! Hopefully, you’ve now got a bit more of an idea of how a Transformer architecture works.

Transformer Architecture: It’s ChatGPT’s AllSpark

In the Transformer series, the shape-shifting robots were animated by an ancient artifact called the AllSpark. Much in the same way, the Transformer architecture is ChatGPT’s AllSpark — the core technology that “brings it to life” (at least in the sense of allowing it to process and coherently generate language).

The Generative Pre-trained Transformer (GPT) is a model built using the Transformer architecture, and ChatGPT is a specialized version of GPT, fine-tuned for conversational engagement. Thus, the Transformer architecture is to GPT what the AllSpark is to Transformers: the source that imbues them with their capabilities.

What’s next for Transformers and tools like ChatGPT?

The Transformer architecture has already brought about significant changes in the AI field, particularly in NLP. There could be even more innovation in the Generative AI field thanks to the Transformer architecture.

Interactive Content Creation: Generative AI models based on Transformers could be used in real-time content creation settings, such as video games, where environments, narratives, or characters are generated on the fly based on player actions.
Real-world Simulations: Generative models can be used for simulations. These simulations could become highly realistic, aiding in scientific research, architecture, and even medical training.
Personalized Generations: Given the adaptability of Transformers, generative models might produce content personalized to individual tastes, preferences, or past experiences. Think of music playlists, stories, or artworks generated based on personal moods or past interactions.
Ethical and Societal Implications: With increased generative capabilities come challenges. Deepfakes, misinformation, and intellectual property concerns are just a few. The evolution of generative AI will require mechanisms to detect generated content and ensure ethical use.

Conclusion: An architecture changing AI as we know it

The Transformers architecture is poised to significantly advance the capabilities and applications of Generative AI, pushing the boundaries of what machines can create and how they assist in the creative process. And now, having read this article, you should now have a better grasp on how it works.

Want to transform your GenAI skills?

If you’re just beginning with Generative AI and ChatGPT, learning prompt engineering is a great starting point. Take a look at my course "Prompt Engineering for Improved Performance" to master advanced techniques in prompt engineering!

Kesha W.

Kesha Williams is an Atlanta-based AWS Machine Learning Hero, Alexa Champion, and Director of Cloud Engineering who leads engineering teams building cloud-native solutions with a focus on growing early career technologists into cloud professionals and engineering leaders. She holds multiple AWS certifications and has leadership training from Harvard Business School. Find her on Topmate at https://topmate.io/kesha_williams.

More about this author