A Guide to Transformer Architecture:
The Brain Behind ChatGPT

Written By
Published On
Share:

A new age of AI revolution began when OpenAI decided to give the power of AI to a common person’s hand via their invention ChatGPT. Before that, AI was synonymous with post-apocalyptic robots going wild like in Arnold Schwarzenegger films. But, how did this sudden change come into reality?

It wasn’t that sudden, but the ball started to roll when some scientists from Google published their paper “Attention is all you need” which gave the building blocks for the Transformer architecture to be such a potent and powerful architecture in the Neural network domain.

Overall, a Transformer NN Architecture ( Transformer Neural Networks Architecture) consists of a lot of elements, to make it able to understand the meaning of the text and generate the appropriate response. But if we want to divide it, we can separate the transformer-based architecture into two parts:

  1. Encoder
  2. Decoder


In the original paper, the Transformer model architecture is designed with 6 stacked Encoder and Decoder blocks, but this configuration is adjustable.

Encoder is good at understanding the input text, which is generally used to understand the given input, Decoder is generally good at generating text, hence most GPTs are decoder-only architecture.

Then there is Encoder-Decoder architecture which is used by models like T5, which is good for understanding the prompt and generating text.

Transformer Architecture Explained In Detail:

In this section, we break down the Transformer architecture into its two key components: the Encoder and Decoder. Each follows a structured process with multiple stages to transform input data into meaningful outputs. Let’s start by exploring the Encoder, its steps, and how it builds context-rich representations, before transitioning from the Encoder to Decoder, where these representations are utilized to generate the final output.

Encoder

Now let’s focus on the Encoder part, which processes the input sequence through embeddings, self-attention, and feed-forward layers to generate a set of context-rich representations. For better understanding of the part of the transformer model architecture, we have shown the simplest flow in the following visual.

I. Input Embedding

Although we are used to writing natural language words on our computers, did you know that your computer cannot understand them unless they are converted into a numerical format?

Hence, how does the LLM understand the text input we insert as a prompt?

We need to convert those texts into numerical format via vectorization.

It will need two steps to convert those text into numbers:

1. Tokenization: Converting a whole sentence into a list of tokens. Also, one word is not always equals to one token.

2. Embedding: Based on a learned model, the words are assigned a vector that accurately assigns a place that is suitable for it in the vector space w.r.t other available words.
Eg: Man and Woman will have similar distance difference as Male and Female, also Man and Male will be more closer as compared to Man and Female.

At this point, we have the words and their meanings processed in parallel. However, this process lacks the positional information of the words, which is crucial in natural language understanding. Therefore, the next step in the Transformer architecture model is to incorporate positional information.

II. Positional Encoding

A straightforward way to incorporate positional information is by assigning numerical indexes like 1, 2, 3, and so on to the tokens. However, as the sequence length increases, these numerical indexes can introduce additional complexity. To address this, a positional encoding vector is generated and added to the word vector obtained in the previous step.

This approach enables the model to effectively represent the position of each word within the vector.

III. Attention

Attention is a mechanism that determines the importance of different parts of input data by assigning numerical values to them. This helps the transformer NN architecture model focus on the most relevant parts of a sentence when processing information by dynamically assigning weights in aspects of relative importance or relevance

The above work shown is only for one attention head, similarly if the transformer architecture has supposed 8 heads then this process is done 8 times and the result is concatenated for further processing.

IV. Add & Norm

In the Attention mechanism, when the model begins learning the relationships between words, important words might be overlooked. To address this, we add the vector inserted before the attention layer to the output of the Attention layer and normalize it.

V. Feedforward Neural Network

It is essentially a straightforward neural network, consisting of an input layer, an output layer, and several hidden layers. With each pass, the weights are updated for each interconnected neuron.

In this way, we obtain learned embeddings that are more accurate. Embeddings retrieved from BERT are of this nature.

The output from the Encoders is then passed into the Decoder section, where it is processed to generate the final output.

Incorporate Generative AI to Simplify Your Business with Us

Transform your business process with our Gen AI App Development

Decoder

The Decoder generates output step by step, unlike the Encoder, which processes the entire input at once. It uses Masked Attention to ensure the model only considers past words, preventing it from seeing future ones. The Decoder also takes context from the Encoder to refine its predictions.

Now, as the block of legos, certain block of mechanism is repeated or reused in the decoder as well. The main difference is, in Encoder all the text inserted is visible completely to the Encoder, but on Decoder side, a part of it is masked in the Masked Attention so that the model learns to generate the next best word from it. Here, we will have multiple inputs, one is from the encoder, which in essence is providing the context or the meaning of the given input to the decoder, and the text itself is the input, which via a loop, keeps on generating the next best word.

Let’s look in detail at this key part on how this works and how it differs Decoder from the Encoder.

I. Masked Attention

Here, as shown in the gif above, each word is predicted line by line. For the next word, the entire available context (the segment in green) is used as input, and the next best word is generated as output. This cycle continues until the token is encountered.

Thus, this is how the generated output is produced by our beloved ChatGPT generative language model based on the transformer architecture.

Conclusion

Here we have tried to reveal the inner workings of the LLMs, no matter whether it is ChatGPT, Gemini or any other LLM, Transformer architecture is the basic building block behind it.

Hence, it is advisable to check the output of generated models, as they are created using complex calculations by the AI models and are highly biased based on the data on which they are trained.

At Triveni Global Software Services, we specialize in leveraging the power of Generative AI and Transformer architecture to build innovative Gen AI applications. From custom solutions to cutting-edge advancements in AI, we offer tailored app development services that empower businesses to harness the full potential of AI-driven technologies. Whether you’re looking to integrate GPT-based models or create unique AI-powered experiences, we are here to turn your vision into reality with Gen AI App Development.

FAQS

1. What Is ChatGPT Transformer Architecture?
ChatGPT uses the Transformer architecture, which processes text using an attention mechanism. This allows the model to focus on relevant parts of the input and generate coherent, context-aware text, making it effective for tasks like conversation and text generation.
2. Does Transformer Architecture Work For Images?
Yes, Transformer architecture can be applied to images. Vision Transformers (ViTs) treat images as sequences of patches and use self-attention to capture spatial relationships, enabling efficient image classification and object detection.
3. What Is the Structure of AI Transformer Architecture?
The Transformer architecture consists of an encoder-decoder structure. The encoder processes input data into context-rich representations, while the decoder generates the output sequence, with both components utilizing self-attention and feed-forward networks.
4. What Is the Machine Translation Encoder-Decoder Model?
The encoder-decoder model for machine translation converts input text into a context-rich representation, which the decoder uses to generate the translated sentence. Attention mechanisms help improve translation by allowing the model to focus on relevant input parts.
5. What Is the Multimodal Approach Using Transformer-based Architectures?
The multimodal approach in the transformer architecture model combines different types of data, like text, images, and audio, using Transformer models. This enables tasks such as image captioning and text-to-image generation, as the model learns cross-modal relationships.

Want to Lead Your Industry?

See How Gen AI Can Be A Game Changer For Your Business

Get the expert advice to grow your business digitally

    ×

    Table Of Content