Understanding GPTs: A Brief Overview

GPT is a buzzword that is intimidating for freshers these days. Technical freshers feel anxious after witnessing its capabilities, concerned that it may threaten their jobs. In contrast, both technical and non-technical individuals are amazed, pondering, "How can a machine accomplish all this?" Let's delve deeper to understand the mechanism behind it.

What is Generative?

The term "generative" refers to the ability to create or produce something. As we have already experienced, Unlike traditional systems that retrieve information from the web, these large language models (LLMs) are designed to generate content independently.

What is Pre-Trained?

As the term suggests, these LLMs are pre-trained on some data, enabling them to generate responses. Generating a response simply means predicting the next letter repeatedly using mathematical calculations, not magical actions.

What are Transformers?

This term represents the entire mechanism behind how GPTs function. This is the core neural network architecture that GPT models are built upon. Let’s understand the underlying mechanisms step by step by referencing the research work by Google itself.

The Transformer - model architecture.

Input and Encoding

This is the initial stage of interacting with LLMs, whether for training or inference. This step involves receiving input from the user, converting it into machine language, and understanding the actual context the user is referring to. Here are the detailed steps involved in this:

Tokenization: As we know, machines only understand numbers. Therefore, it's essential to convert every user input into numbers first. This step is called Tokenization, and these numbers are called Tokens.

The high-level architecture of tokenization involves breaking down a sentence into chunks of words, symbols, or sometimes even small sentences. These chunks are then replaced with corresponding numbers from their vocabulary dictionary. Each LLM has its own dictionary for replacing these chunks.

For example, let's create an imaginary vocabulary dictionary and tokenized sentences to see how this process might work.

🚨

This example is meant to provide a better understanding of tokenization and does not reflect how tokenization occurs in the real world.

I want to simulate the traditional multi-tap process on old phone keypads where hitting '2' once gives 'a', twice gives 'b', thrice gives 'c', etc.

Here are the steps for this:
- Iterate through each character of the string.
- If the character is a letter (a-z), replace it with its corresponding multi-tap digit sequence.

If the character is any special character, append 1.

Example:

Hello Hashnode will become ['44', '33', '555', '555', '666', '0', '44', '2', '7777', '44', '66', '666', '3', '33', '1'], where 44 represents H, 33 maps e and so on.

Here's the equivalent Python script to do the same and try it out.

    import string

    t9_map = {
        'a': '2', 'b': '22', 'c': '222',
        'd': '3', 'e': '33', 'f': '333',
        'g': '4', 'h': '44', 'i': '444',
        'j': '5', 'k': '55', 'l': '555',
        'm': '6', 'n': '66', 'o': '666',
        'p': '7', 'q': '77', 'r': '777', 's': '7777',
        't': '8', 'u': '88', 'v': '888',
        'w': '9', 'x': '99', 'y': '999', 'z': '9999',
    }

    # Reverse mapping for detokenization
    reverse_t9_map = {v: k for k, v in t9_map.items()}

    def tokenize(text):
        text_lower = text.lower()
        tokens = []

        for char in text_lower:
            if 'a' <= char <= 'z':
                tokens.append(t9_map[char])
            elif char == ' ':
                tokens.append('0')
            else:
                tokens.append('1')

        return tokens

    def detokenize(tokens):
        result = ""
        for token in tokens:
            if token == '0':
                result += ' '
            elif token == '1':
                result += '?'  # symbol placeholder
            else:
                result += reverse_t9_map.get(token, '?')  # fallback in case of unknown
        return result

    # Example usage
    input_string = "Hello Hashnode!"
    tokens = tokenize(input_string)
    print("Tokens:", tokens)

    decoded = detokenize(tokens)
    print("Detokenized:", decoded)

    """
    Output:
    Tokens: ['44', '33', '555', '555', '666', '0', '44', '2', '7777', '44', '66', '666', '3', '33', '1']
    Detokenized: hello hashnode?
    """

💡

To experience how tokenization works in the real world, you can visit Tiktokenizer.

Want to understand how OpenAI tokenizes a message? Here's the code.

    import tiktoken

    encoder = tiktoken.encoding_for_model('gpt-4o')    # gpt model

    print("Vocab Size", encoder.n_vocab) # 2,00,019 (200K)

    text = "Hello Hashnode"
    tokens = encoder.encode(text)

    print("Tokens: ", tokens) # Tokens [13225, 10242, 7005]

    my_tokens = [13225, 10242, 7005]
    decoded = encoder.decode(my_tokens)
    print("Decoded: ", decoded)    # Decoded: Hello Hashnode

Vector Embedding: Vector embedding maps the semantic meaning of words in a sentence to multi-dimensional coordinates (often visualized in 2D or 3D). For example, in the sentences, Monkey eats banana and Man eats rice, ‘monkey' and 'man' are both animals, while 'banana' and 'rice' are food items. As a result, 'monkey' and 'man' would be positioned close to each other in one region of the space, and 'banana' and 'rice' in another. Moreover, the vector from 'monkey' to 'banana' would be similar in direction and magnitude to the vector from 'man' to 'rice', reflecting similar semantic relationships. It’s just mathematical matrices.
Positional Encoding: Positional Encoding involves adding positional information to token embeddings to help the model understand the order of words in a sentence. For example, consider the sentences The man is eating rice and The rice is eating man. Although the words (and thus their embeddings) are the same in both sentences, their meanings are entirely different due to the change in word order. Since vector embeddings alone do not capture positional context, positional encoding is crucial—it allows the model to distinguish between such cases by encoding each token's position in the sequence, thereby enabling a better understanding of the actual context.

Attention and Feed Forwarding (Encoding Phase)

Attention and Feed Forwarding is the next phase in the processing pipeline of Large Language Models (LLMs). At this stage, the model focuses on determining which parts of the input are most relevant to each token using the attention mechanism, and then applies feed-forward neural networks to transform these representations. This phase helps the model capture complex relationships between words and introduces non-linearity, allowing it to understand context and meaning beyond simple sequential patterns.

Multi-Head Attention: Multi-Head Attention builds upon the concept of self-attention, where each token in a sequence can interact with every other token to better understand contextual relationships. For example, consider the sentences 'The river bank' and 'The HDFC bank'. In both cases, the word 'bank' has the same token and embedding, and even its positional encoding would be similar since it appears at the end of the sentence. However, the meaning of 'bank' differs in each context. Self-attention helps the model capture these nuances by allowing the token 'bank' to attend to other tokens like 'river' or 'HDFC' for disambiguation.

Multi-Head Attention enhances this process by using multiple attention heads in parallel. Each head learns different types of relationships or focuses on different aspects of the input, enabling the model to capture richer and more diverse contextual information.
Feed Forwarding: Feed Forwarding in Large Language Models introduces non-linearity into the processing pipeline, allowing the model to interpret the context from multiple perspectives. For instance, imagine a scene where a dog is looking out the window while traveling in a car. Different parts of our brain might focus on various aspects of this moment: 'The car was white', 'The dog was a Labrador', 'The family was going on a trip', 'The dog was fascinated by the scenery', and so on. Similarly, during this phase, the model processes the contextual information through multiple dense layers to extract and represent diverse interpretations and deeper meaning.

This is what happens during the Input phase of LLM interaction, marking the end of the Encoding phase. Now, let's turn our attention to the Output phase and explore what happens in the decoder.

Output Embedding & Positional Encoding

The decoding phase is iterative — it generates one token at a time, and each newly generated token is used to predict the next one.

It begins with the tokens that have already been generated so far (this is often referred to as "shifted right" input). These tokens are passed through an output embedding layer, just like on the encoder side. Then, positional encoding is added to retain the order of the tokens and the combined representation (embedding + position) forms the input to the decoder stack.

Example:

Let’s say the user input is: “How are you?”
After the input is fully processed by the encoder, the decoder starts generating the response:

The decoder is triggered with a start token: [<start>].
It predicts the first word: "I" → Output so far: [<start>, I].
This gets fed back in → predicts "am" → [<start>, I, am].
Repeats until the model outputs <end> → [<start>, I, am, fine, <end>].

Masked Multi-Head Attention

This is the first step in the decoder stack. It's very similar to the multi-head attention used in the encoder, with one key difference — masking.

Masking ensures that the model can’t look ahead. While generating the third word, for example, it shouldn't peek at the fourth. This keeps the generation process auto-regressive, i.e., predicting the next token using only the known ones. For example,while predicting the third word in [<start>, I, am], the model must not access "fine" or <end> yet. Masking hides those future tokens during attention.

Multi-Head Attention

This layer allows the decoder to attend to the encoder’s output — meaning it connects what’s being generated with what the user actually asked. It helps the model align the generated response with the input context.

Feed Forward + Add & Norm

Same as the encoder — this adds non-linearity and enables the model to understand richer patterns in data. Each token is passed through a Feed Forward Neural Network and a Add & Layer Normalization for stability and better learning.

Linear → Softmax

After decoding is done, the final token representations are passed through a Linear layer, converting them to a large vector (same size as the vocabulary). Then, a Softmax layer is applied to turn this vector into a probability distribution over all possible next words. For Example at some point, if the model sees a high probability like: [I: 2, am: 87, have: 4, was: 1, ...]. It chooses "am" as the predicted word.

This wraps up the explanation of Gen-AI. I hope you found it interesting. Thank you.

This article explores the fundamentals of how Generative Pre-trained Transformers (GPTs) function, focusing on key concepts such as tokenization, vector embedding, positional encoding, and attention mechanisms. By breaking down the encoding and decoding phases of Large Language Models (LLMs), it elucidates how these systems generate contextually relevant responses. Through examples and explanations, readers gain insight into the architecture and processes that enable GPTs to produce human-like text.

Beyond the Black Box of Generative LLMs

What is Generative?

What is Pre-Trained?