Gen-AI

Understanding HyDE: A Guide to Hypothetical Document Embeddings

Raushan Kumar Thakur — Wed, 23 Apr 2025 06:46:06 GMT

In the previous section, we saw that Shreya didn’t get the expected response when she asked:

Explain the difference between waveguide and coaxial cable in practical applications.

The system returned partial matches or generic definitions—not the crisp, real-world comparison she expected.

HyDE – Hypothetical Document Embeddings

Shreya realized that the issue wasn’t the retrieval model or the LLM. It was that her query was too real-world, and her dataset was full of exam-oriented phrasing.

This is where Hypothetical Document Embeddings (HyDE) came to the rescue.

Instead of searching the vector database with the raw user query, HyDE first asks the LLM to generate a “document”—a short, hypothetical paragraph that might resemble the ideal answer to the question. Then that generated paragraph is embedded and used for retrieval.

Steps

Here are the steps followed in this approach:

Take the user's query as input.
Provide it to an LLM and ask it to write a Document on the topic.
Use this document to perform a similarity_search.
Retrieve the chunks from the similarity_search in Step-3 and provide them to the LLM along with the user's original query.
Return the response given by the LLM to the user.

Why will it work?

Before understanding Why will it work?, let’s recall what was the actual issue in the previous section because of which it wasn’t working.

Since the user's query was very real-world and contained broken English, while her document was full of technical phrases and jargon, the system struggled. When it applied similarity_search to the user's query, the matching chunks returned were not very good, leading to a lower quality response from the LLM.

Now, instead of directly using the user's query for similarity_search, we ask an LLM to write a document on the topic. The document created by the LLM will include all the technical phrases and jargon used in the industry. So, when we perform similarity_search on this document, the matching documents will be much more accurate and will cover the topic thoroughly. This ultimately leads to a better response from the LLM.

How to do?

If you have followed the series till here, implementing this certainly would not be the big challenge for you. Still here’s the full code for you:

import os
import json
from collections import defaultdict
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from openai import OpenAI

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

def load_and_split_documents(pdf_path):
    """Load PDF and split into chunks"""
    loader = PyPDFLoader(file_path=pdf_path)
    docs = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )

    split_docs = text_splitter.split_documents(docs)

    print("Number of documents before splitting:", len(docs))
    print("Number of documents after splitting:", len(split_docs))

    return split_docs

def setup_vector_store(split_docs, embedder):
    """Initialize vector store with documents"""
    vector_store = QdrantVectorStore.from_documents(
        documents=split_docs,
        url="http://localhost:6333",
        collection_name="learning_langchain",
        embedding=embedder
    )
    return vector_store

def generate_document(client, user_query):
    """Break out the user query into multiple smaller steps"""
    GENERATE_DOCUMENT_SYSTEM_PROMPT = """
    You are a helpful assistant. You will be provided with a question and you need to write a proper document on the topics included in it. Use proper technical phrases and terms used in the related industry. 
    """

    response = client.chat.completions.create(
        model="gemini-1.5-flash",
        messages=[
            {"role": "system", "content": GENERATE_DOCUMENT_SYSTEM_PROMPT},
            {
                "role": "user",
                "content": user_query
            }
        ]
    )
    content = response.choices[0].message.content
    print("Generate Document response:", content)

    return content

def similarity_search(vector_store, query):
    """Perform similarity search for a given query"""
    relevant_chunks = vector_store.similarity_search(query=query)
    return relevant_chunks

def retrieval_generation(client, query, context_docs):
    """Generate an answer based on query and context"""
    # Format context from documents
    context = "\n\n".join([doc.page_content for doc in context_docs])
    print(context)

    GENERATION_SYSTEM_PROMPT = f"""
    You are a helpful assistant. You will be provided with a question and relevant context filtered according to user's query. 
    Your task is to provide a concise answer based on the context.

    Context: {context}
    """

    response = client.chat.completions.create(
        model="gemini-2.0-flash",
        messages=[
            {"role": "system", "content": GENERATION_SYSTEM_PROMPT},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

def main():
    # Initialize components
    pdf_path = Path("./nodejs.pdf")
    split_docs = load_and_split_documents(pdf_path)

    embedder = GoogleGenerativeAIEmbeddings(
        model="models/text-embedding-004",
        google_api_key=GOOGLE_API_KEY,
    )

    vector_store = setup_vector_store(split_docs, embedder)

    # Create client for chatting
    client = OpenAI(
        api_key=GOOGLE_API_KEY,
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
    )

    # Main interaction loop
    while True:
        user_query = input(">> ")
        if user_query.lower() in ["exit", "quit", "q"]:
            break

        # Generate related questions
        content = generate_document(client, user_query)

        # Final generation that uses all previous context
        relevant_chunks = similarity_search(vector_store, content)
        print(f"Final query: {len(relevant_chunks)} relevant chunks found.")
        final_generation = retrieval_generation(client, content, relevant_chunks)
        print(f"Final Answer: {final_generation}")


if __name__ == "__main__":
    main()

In the article, I explore the use of Hypothetical Document Embeddings (HyDE) to improve document retrieval and information extraction from large datasets, especially when dealing with real-world queries that differ significantly from the technical jargon in the dataset. By generating a hypothetical document that fits the technical tone of industry-standard language, HyDE enhances the accuracy of similarity searches, leading to more relevant document retrieval and improved responses from language models. The article includes a detailed breakdown of the steps in this process and provides an implementation using Python, LangChain, and OpenAI’s generative AI models.

Chain of Thoughts rescue Shreya

Raushan Kumar Thakur — Wed, 23 Apr 2025 06:02:08 GMT

In the previous section, we saw that although Shreya made a good improvement in her system and it worked well for a few prompts, it still struggled with complex tasks. In such cases, the LLM was hallucinating and not performing well.

Chain of Thoughts / Less Abstract Query Transformation Technique

As Shreya sat at her drawing board, pondering a solution, she remembered her mother's advice:

Break complex problems into smaller subproblems and solve them one by one.

Shreya had an idea and felt confident it might work. Here's her plan:

The question she had asked last night was:

Trace how digital logic topics expanded in the last five years.

What if the LLM takes her mother's advice seriously and uses this approach? For example, the question above can be broken down into these steps:

Identify syllabus changes per year

Summarize each trend

Stitch them into a timeline

After breaking it into subproblems, she is fully confident that her system will be able to perform the task perfectly.

In short, the main idea is to break the query into multiple, less abstract subqueries. This way, the LLM can better understand their task.

Let’s look at another example from Google's white paper. It suggests breaking down the prompt

Think Machine Learning

into:

First, think about the machine.

Next, think about learning.

Finally, think about machine learning.

Let's explore her approach in detail using the flow diagram she created:

Here are the steps she wants her system to follow:

Take the user's query as input.
Give the query to the LLM and ask it to break it down into smaller subproblems or steps that can be solved easily.
Perform the next steps synchronously.
Take the first step given by the LLM and perform Retrieval and Generation steps just as in previous sections. You can use any of the Fan-out, Reciprocal-rank fusion, or even a simple generation technique. Let's say the generation was G1.
Now take the second step given by the LLM, append G1 to it, and pass it to Retrieval and Generation steps, as done in step 4. Let's say the generation was G2.
Follow the same pattern for all the steps given by the LLM. For example, G2 will be appended in Step-3, G3 in Step-4, and so on.
In the end, the final generation Gn will be given to the LLM along with the original user's query for the final generation.
This response can then be directly provided to the user.

Why will it work?

Some of you might be wondering why this unusual approach would even work. Let's use Shreya's query as an example to understand it better.

The query was:

Trace how digital logic topics expanded in the last five years.

Since this query wasn't simple enough to just find some chunks from a database, analyze a few paragraphs, and return an answer, it required a lot of computation before reaching a conclusion.

Let’s assume, during the query-breaking phase, the LLM broke the query down in these steps:

Identify syllabus changes per year.

Summarize each trend.

Stitch them into a timeline.

Step-1 (Identify syllabus changes per year.)

Let's start with the first step: Identify syllabus changes per year. When this query is processed through the similarity_search and generation steps, don't you think that with the accuracy Shreya has achieved in her LLM so far, her system will be able to answer it efficiently? Yes.

Step-2 (Summarize each trend.)

After successfully completing Step-1, the Generation has gathered all the data on how the syllabus has changed over the years. Now, if that data is provided along with this step’s query, which is Summarize each trend, don't you think the LLM will effectively summarize it and provide a clear response? Absolutely.

Step-3 (Stitch them into a timeline.)

After successfully completing Step-2, the Generation has collected all the data on the syllabus changes and how these trends have developed. With this information, the LLM can certainly create a timeline. Do you agree?

After finishing this step, we have fully contextual raw data after many filtering processes. Now, we just need to Polish it according to the user’s original query, and that’s what we do.

Pass the Generation of the final step along with the user’s original query to the LLM and return the response to the user.

How to do?

Implementing this is quite simple if you've been following the series up to this point. Here is the code snippet for breaking the query into smaller steps:

def generate_steps(client, user_query):
    """Break out the user query into multiple smaller steps"""
    GENERATE_STEPS_SYSTEM_PROMPT = """
    You are a helpful assistant. You will be provided with a question and you need to break it into 3 simpler & sequential steps to solve the problem. What steps do you think would be best to solve the problem?

    Rules:
    - Follow the output JSON format.
    - The `content` in output JSON must be a list of steps.

    Example:
    User Query: How to handle file-uploads on server?
    Output: { "type": "steps", "content": ["Accept file from req.files. Take help of multer to do that.", "Upload file to the S3 bucket or any other db and take out public url", "Store that public url in actual database"] }
    """

    response = client.chat.completions.create(
        model="gemini-1.5-flash",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": GENERATE_STEPS_SYSTEM_PROMPT},
            {
                "role": "user",
                "content": user_query
            }
        ]
    )
    content = response.choices[0].message.content
    print("Query Breaker response:", content)

    # Parse the JSON response
    parsed_response = json.loads(content)

    # Extract the steps
    steps = parsed_response["content"]
    print("Generated steps:", steps)

    return steps

Get the full code here…

Issue

Shreya was on a roll. With her system now able to answer complex queries using fan-out retrieval and even create thoughtful summaries through chain-of-thought prompts, she felt almost unstoppable.

One evening, while revising Electromagnetics, she typed:

“Explain the difference between waveguide and coaxial cable in practical applications.”

To her surprise, the system returned partial matches or generic definitions—not the crisp, real-world comparison she expected.

This made her realize that the job isn't done yet! She'll be back in the next chapter with a possible solution.

Shreya encountered limitations in her language model system when faced with complex tasks. Inspired by her mother's advice, she developed a method to break down these tasks into smaller, manageable subproblems. Her approach involves using a less abstract query transformation technique to enhance the model's comprehension and performance. By iteratively processing each subproblem, Shreya's system aims to deliver a polished final response. Although she made significant progress, an issue with generating specific comparisons highlighted the ongoing challenge of refining the system's capabilities.

Reciprocal Rank Fusion Aids Shreya

Raushan Kumar Thakur — Tue, 22 Apr 2025 21:24:24 GMT

In the previous section, we saw that Shreya faced a problem. When she asked a simple question expecting a straightforward answer, she was overwhelmed with too much information and responses she didn't request. To recap, the question was:

What was the most common control systems topic asked in last 5 years?

The response she received included:

What was the most common control systems topic asked in last 5 years?

What are control systems and its usage.

What important topics does control systems include?

What was the most common topics asked in last 5 years? (From other subjects as well)

Now she started looking for an improvement algorithm and thinks she has found one.

Reciprocal Rank Fusion

In the previous section, we transformed the user's query and found documents from the vector database using similarity search. Unlike before, where we dumped all the matched documents into the LLM, we will now rank the documents based on how often they appear in transformed queries and how early they appear in the order.

Here's the flowchart of the architecture. Let's go through this step by step:

Before the retrieval step, as you can see, the flowchart is exactly the same as it was in the previous section. We take the user's query as input and transform it into several other queries.

Retrieval Step

The retrieval step is similar to the previous section, with a few minor changes highlighted in the flowchart. When we pass a query to our similarity_search method, the function might return multiple chunks because different parts of the PDF may relate to the query. The different chunks found by the function are shown in various colors in the diagram, such as Red, Green, Yellow, and Blue. We will refer to them by their color in the following paragraphs.

In this technique, I have also included the user's original query in similarity_search, which is always a good idea to implement.

Now, based on how often the chunks appear in similarity_search and their order, each chunk will be assigned a reciprocal_rank_fusion(rrf_score). Only the higher ranked chunks will be sent to the LLM, while the rest will be ignored. The threshold rrf_score for including documents in the LLM call will be determined by the developer according to the use case.

Ranking Algorithm

The method for calculating rrf_score can vary depending on the use case. Here, I will explain the most common approach that is generally used.

What’s our goal?

Favor documents that appear more frequently.
Favor documents that appear earlier (higher) in the lists.

Instead of just counting how many times a document appears or averaging ranks, this approach uses a formula that gives more weight to documents that appear early in any list.

The formula is:

$$\text{Score}(d) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_{i,d}}$$

rank i,d is the rank of document d in list i (starting from 1).
k is a constant to make sure that scores aren't dominated by rank 1s (usually k=60 is used).
If a document doesn't appear in a list, it contributes zero.

Let’s take an example to better understand this approach:

Assume we have 3 query results:

Query 1: [A, B, C, D]
Query 2: [B, C, E]
Query 3: [C, A, F]

Let’s use k=60.

The k keeps the impact of one very early rank from being too extreme (so rank 1 isn’t 1.0 but more like 0.016).

We now calculate scores:

For document A:

Appears in Query 1 at rank 1 → 1 / (60 + 1) = 1/61
Appears in Query 3 at rank 2 → 1 / (60 + 2) = 1/62
Total: ~0.0164 + ~0.0161 = 0.0325

For document C:

Appears in Query 1 at rank 3 → 1 / (60 + 3) = 1/63
Appears in Query 2 at rank 2 → 1 / (60 + 2) = 1/62
Appears in Query 3 at rank 1 → 1 / (60 + 1) = 1/61
Total: ~0.0159 + 0.0161 + 0.0164 = 0.0484

Similarly, we can do this for all documents and sort them by total score.

Once all scores are computed, since we only want to keep the best ones — so filter out those whose score is below a certain threshold.

Now that we have a good understanding of the ranking algorithm, let's see how we can implement this in code:

from collections import defaultdict

def reciprocal_rank_fusion(rankings, k=60, threshold=0.0):
    rrf_scores = defaultdict(float)

    # Iterate over each ranking list
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            # Calculate RRF score: 1 / (k + rank + 1)
            rrf_scores[doc_id] += 1 / (k + rank + 1)

    # Filter by threshold and sort by score in descending order
    filtered = [(doc_id, score) for doc_id, score in rrf_scores.items() if score >= threshold]
    return sorted(filtered, key=lambda x: x[1], reverse=True)

# Example input: 3 ranked lists of documents from different queries
rankings = [
    ["A", "B", "C", "D"],  # Query 1
    ["B", "C", "E"],        # Query 2
    ["C", "A", "F"]         # Query 3
]

# Apply RRF to combine rankings and filter based on threshold
result = reciprocal_rank_fusion(rankings, k=60, threshold=0.02) # Adjust threshold as needed

# Print the results
print("Ranked Documents based on Reciprocal Rank Fusion:")
for doc, score in result:
    print(f"Document: {doc}, RRF Score: {score:.4f}")

The above code can be used to filter out chunks based on their rrf_score.

Generation Step

After filtering the most relevant chunks, the process is the same as we've been doing since chapter 1. Simply insert those chunks into the LLM along with the user’s query, and the LLM will handle the rest. Then, return its response to the user.

Click here to get the full code…

Issue

After doing this, Shreya was over the moon because she thought her system was now super powerful and could answer any type of query. She played with it for a few minutes, but her smile vanished when she asked:

Trace how digital logic topics expanded in the last five years.

The system didn't respond as she expected. However, since she is brave and a great problem-solver, she wasn't disheartened and went back to the drawing board to improve the system further.

Let's see what she did in the next chapter.

In this article, we explore a solution to improve the retrieval and ranking of documents in response to user queries using the Reciprocal Rank Fusion (RRF) algorithm. Shreya's original problem was receiving overwhelming responses to her query about common control systems topics over the last five years. To address this, the RRF algorithm was introduced to rank documents based on their frequency and order of appearance across transformed queries. This process filters out less relevant documents before sending them to a language model for a concise response. Despite initially feeling triumphant with the implementation, Shreya encountered limitations when the system failed to adequately trace the expansion of digital logic topics over five years, prompting further refinement to enhance the system's capabilities.

Shreya Finds Excitement Again with Fan-Out Magic

Raushan Kumar Thakur — Tue, 22 Apr 2025 14:42:28 GMT

In the last chapter, we saw how Shreya was discouraged by her system's response to one of her questions. She then returned to the drawing board to find ways to improve her system. Let's see what she discovered and whether it solved her issue.

Let me remind you, this was the question she asked:

Give me definitions, examples, plus tricky MCQs on LTI systems?

And the system responded with only definitions.

Parallel Query Retrieval / Fan-out Technique

In Parallel Query Retrieval, we create several different versions of the user's query, each focusing on a different aspect. Not clear yet? Don't worry, let's look at an example.

User’s Query:

How does garbage collection work in Python?

We'll input this query into an LLM and ask it to create multiple queries, each focusing on different aspects of the original question, such as:

What triggers garbage collection in Python?

What are Garbage collection algorithms in Python?

How memory leaks relate to GC?

This technique is called Parallel Query Retrieval or Fan-out technique.

How will this help?

As we saw in Shreya’s case, when we directly passed her detailed query about LTI Systems to the LLM, it didn't respond well. Now, let's apply the transformation mentioned above to this question.

The actual query was:

Give me definitions, examples, plus tricky MCQs on LTI systems?

The transformed queries would be something like:

Give me definitions of LTI systems.

Give me examples of LTI systems.

Give me tricky MCQs on LTI systems.

Now, if I pass these three queries individually through the Retrieval and Generation steps learned in the previous chapter, don't you think Shreya will get a better response?

Query 1 will focus only on the definitions.
Query 2 will focus only on the examples.
Query 3 will focus only on the MCQs.

After compiling the LLM responses for all three queries, Shreya will achieve her objective.

💡

In the example, I've transformed the user's query into three distinct queries, but this number can be adjusted based on the specific requirements of the use case.

Understand the whole flow here:

How to do?

Implementing this is quite simple by following these steps:

Take the user's file input.
Perform the indexing process as described in the previous chapter.
Take the user's query.
Make an LLM call with an effective SYSTEM_PROMPT and ask it to regenerate the query into 3 or n queries, each focusing on different aspects of the original query.
Follow the Retrieval & Generation steps from the last chapter again.

Here’s a code snippet of Query Transformation:

finalResponse = ""

def fan_out():
    user_query = input(">> ")

    global finalResponse
    finalResponse = ""  # Reset the final response for each new query

    FAN_OUT_SYSTEM_PROMPT = """
    You are a helpful assistant. You will be provided with a question and you need to generate 3 questions out of it focusing on different aspects of it or related to it. The focus should be on what user might be interested in and maybe he couldn't ask it directly. You need to generate them.

    Rules:
    - Follow the output JSON format.

    Example:
    User Query: How does garbage collection workin python?
    Output: {{ "q1": "What triggers garbage collection in python?", "q2": "Garbage collection algorithms in Python?", "q3": "How memory leaks relate to GC?" }}
    """

    response = client.chat.completions.create(
        model="gemini-2.0-flash",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": FAN_OUT_SYSTEM_PROMPT},
            {
                "role": "user",
                "content": user_query
            }
        ]
    )
    content = response.choices[0].message.content
    print("Fan out response:", content)
    # Parse the JSON response
    parsed_response = json.loads(content)
    # Extract the questions
    questions = [parsed_response["q1"], parsed_response["q2"], parsed_response["q3"]]
    print("Questions:", questions)
    # Call the retrieval_generation function for each question
    for question in questions:
        retrieval_generation(question)

    print("Final response:", finalResponse)

Get the full Code here…

This setup was working well for Shreya, and she was jumping on the sofa in excitement.

Issue

Did she face any issues again? Yes, her joy was short-lived, and soon she encountered another problem. She realized she was receiving a lot more content that she wasn't interested in and hadn't asked for. She was frustrated to see such long responses, even when she asked a simple question.

For example, Now if she asks:

What was the most common control systems topic asked in last 5 years?

In response, she’s getting:

What was the most common control systems topic asked in last 5 years?

What are control systems and its usage.

What important topics does control systems include?

What was the most common topics asked in last 5 years? (From other subjects as well)

And she thought, this isn't a foolproof solution, and she needs to make more improvements. Let's see in the next chapter what idea she comes up with.

Shreya, facing unsatisfactory responses from her system, explores the Parallel Query Retrieval or Fan-out Technique to enhance the quality of information retrieval. This approach involves breaking down queries into multiple focused sub-queries, which individually target different aspects of the original question. For instance, a comprehensive question on LTI systems is divided into queries asking for definitions, examples, and tricky MCQs. This method initially proves effective, but eventually leads to excessive and irrelevant information. The narrative outlines Shreya's ongoing challenge to refine her system's response quality.

Understanding RAG: A Comprehensive Intro and Shreya's Story

Raushan Kumar Thakur — Mon, 21 Apr 2025 20:33:53 GMT

Welcome to the first blog of the series RAG—A powerful technique that enhances the accuracy and relevance of responses generated by large language models (LLMs) by incorporating information from external data sources relevant to user’s query.

💡

Prerequisite: To understand this series better, it's important to have a basic understanding of how large language models (LLMs) function. If you're not, click here to get it.

Meet Shreya: A Gate Aspirant

To make this series more relatable and easy to understand, meet Shreya, a GATE aspirant. She has a PDF of previous years' GATE questions from the last 15 years, along with their answers. Now, she wants to know:

What was the most common control systems topic asked in last 5 years?

Manually going through such a long PDF isn't practical for her, so what's the solution? Should she copy and paste the entire document into ChatGPT? Of course not. The model has a limited context window and can't handle unlimited text at once. Even if she could somehow input all 500 pages, it would be very inefficient, and the model would return bloated or irrelevant answers. She only needs a focused answer, likely based on just 50 to 60 pages. Loading unnecessary data not only wastes computing resources but also reduces the quality of the output. This is where a RAG pipeline becomes essential.

How Retrieval-Augmented Generation (RAG) solves this

With a RAG-based setup, when Shreya wants to ask something from the PDF, she first needs to upload it to the system. The system then:

Chunks the whole PDF into small parts.
Embeds those chunks in 3D vector space.
Stores them in a vector database like Pinecone or Chroma.
Now the system is ready to answer Shreya's questions. Let’s say she asks the same question again. The system will:
Input her query.
Performs a similarity search on the vector database to fetch only the relevant chunks.
Feeds those filtered chunks to the LLM.
Generates a sharp, focused response.

It’s the same as giving a cheat sheet to the model that’s been auto-curated for the specific question.

Now let’s understand all these steps one by one and try to code it:

💡

To run the code snippets in the blog, you'll need a GEMINI_API_KEY. Don't worry, it's free, so go ahead and get one. Also, you’ll need Docker installed on the system. So install that as well, if you haven't already. You can follow the Installation guide from here.

Chunking

What is Chunking?

Chunking—Breaking something into small, manageable pieces, like eating a burger one bite at a time or splitting a long PDF into smaller sections in this case.

The logic for splitting can vary based on needs and circumstances. It can be done page by page, paragraph by paragraph, or even two paragraphs per chunk, or 1000 characters per chunk and so on. It completely depends on the developer.

Why is it needed?

Since the PDF Shreya had was very large, and if she wanted some precise answer, likely based on a few pages, dumping the whole PDF to LLM isn’t a great idea. As discussed above, it’ll lead to bloated or irrelevant answers. Also, it’ll waste computing resources.

What issue may come?

Context Loss is the main issue when chunking. In multi-page PDFs, there might be sentences that start on, let's say, page 1 and continue on page 2. In these cases, chunking by page would lose the sentence's context. Let's understand this better with the help of an image:

As you can see in the above image, chunk 1 doesn’t know about the other positions of `Hitesh Choudhary`, and chunk 2 doesn’t know what this content creator and CTO is.

To solve this, we’ll overlap some characters while chunking, i.e, include some characters from Chunk 1 in Chunk 2. As we can see in the next image, both chunks have enough context about their content.

How to do?

To do this, we’ll take the help of some built-in functions from langchain(If you don’t know, langchain has some built-in functions for developers to perform the common tasks in the world of LLMs.).

"""
DO INSTALL NECESSARY PACKAGES IN VIRTUAL ENVIRONMENT
- python -m venv venv
- source ./Scripts/venv/activate
- pip install langchain_community pypdf langchain_text_splitters 
"""

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

pdf_path = Path("./gate-pyqs.pdf")    # Put the path of the PDF

loader = PyPDFLoader(file_path = pdf_path)    # PyPDFLoader is a utility function of langchain which helps to load the PDF
docs = loader.load()    # Loads the PDF

text_splitter = RecursiveCharacterTextSplitter(    # RecursiveCharacterTextSplitter is a utility function which helps to split the PDF in chunks based on characters
    chunk_size=1000,    # Split 1000 characters per chunk
    chunk_overlap=200    # Overlap 200 characters per chunk to avoid context loss
)

split_docs = text_splitter.split_documents(docs)    # Split the PDF / Chunking

print("Number of documents before splitting:", len(docs))
print(docs[0])  # docs is a list of Document objects
print("Number of documents after splitting:", len(split_docs))
print(split_docs[0])    # split_docs is a list of Document objects

Vector Embedding

What is Vector Embedding?

It maps the semantic meaning of words in a sentence to multi-dimensional coordinates (often visualized in 2D or 3D). For example, in the sentences, Monkey eats banana and Man eats rice, ‘monkey' and 'man' are both animals, while 'banana' and 'rice' are food items. As a result, 'monkey' and 'man' would be positioned close to each other in one region of the space, and 'banana' and 'rice' in another.

Why is it needed?

To better understand sentences, it's helpful to relate the words and vector embeddings do this effectively.

How to do?

We’ll again use a utility function from LangChain to create vector embeddings.

"""
- pip install langchain-google-genai
"""

import os
from langchain_google_genai import GoogleGenerativeAIEmbeddings

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")    # Get a Gemini API Key

embedder = GoogleGenerativeAIEmbeddings(    # GoogleGenerativeAIEmbeddings is an utility function to create vector embeddings
    model="models/text-embedding-004",    # Google's embedding model
    google_api_key=GOOGLE_API_KEY,
    )

Storing in Vector Database

What is a vector database?

A vector database is a specialized type of database designed to store, index, and query vector embeddings.

How to install/use a vector database?

There are various vector databases available in the market, for example, ChromaDB, PineCone, Qdrant, etc. Here we’ll go with QdrantDB, because it’s lightweight & open-source.

Make sure that you have Docker installed and it’s up and running. Run the following commands in the terminal.

docker pull qdrant/qdrant    # Pull QdrantDB

docker run -p 6333:6333 -d qdrant/qdrant    # Run the iamge in detach mode & Port Mapping

Now go to http://localhost:6333/dashboard You’ll have a pre-made Qdrant dashboard running here.

How to make vector embeddings?

"""
- pip install langchain_qdrant
"""

from langchain_qdrant import QdrantVectorStore    # langchain_qdrant is a utility package for interacting with QdrantDB

vector_store = QdrantVectorStore.from_documents(    # Creates a collection and store embeddings into the database
    documents = split_docs,
    url = "http://localhost:6333",
    collection_name = "learning_langchain",
    embedding = embedder
)

retriever = QdrantVectorStore.from_existing_collection(    # Creates a retriever to do query operations on db
    url = "http://localhost:6333",
    collection_name = "learning_langchain",
    embedding = embedder
)

Now the system is ready to answer Shreya's boring questions. I hope you haven’t forgotten that GATE aspirant.

Take Shreya’s query as input & perform a Similarity Search (Retrieval)

After storing the embeddings of her data source in a vector database, it's time to take her questions and find relevant content from the database. This content can then be provided to the LLM so the model can deliver precise and accurate answers.

user_query = input(">> ")

relevant_chunks  = retriever.similarity_search(    # similarity_search is a function to find similar embeddings
    query = user_query
)

print("Search result:", relevant_chunks)

Generate a response from LLM (Generation)

Now that we have the relevant data source and the user query, it's time to create a suitable SYSTEM_PROMPT and provide everything to the LLM. The LLM will handle the rest, and we'll receive the response we want.

"""
- pip install openai
"""

from openai import OpenAI

# Create client for chatting
client = OpenAI(
    api_key=GOOGLE_API_KEY,    # Provide Gemini API key here
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

SYSTEM_PROMPT = """
You are a helpful assistant. You will be provided with a question and relevant context from a document. Your task is to provide a concise answer based on the context.
Context: {relevant_chunks}
"""

response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT.format(relevant_chunks=relevant_chunks)},
        {
            "role": "user",
            "content": user_query
        }
    ]
)

print(response.choices[0].message)

Now, if Shreya asks the same question, that is,

What was the most common control systems topic asked in last 5 years?

The similarity_search function will go to the database and find relevant chunks (control systems topic from last 5 years) from the data source and my LLM will now be able to answer that easily.

To summarize, here’s a flow of the whole process:

Click here to get the full code

Issues

Shreya was excited about the results she got while experimenting with her model. However, her excitement didn't last long because, during her testing, she asked another question:

Give me definitions, examples, plus tricky MCQs on LTI systems?

And in response, the model provided only definitions, which made Shreya return to the drawing board to adjust the architecture. In the next chapter, we'll see what Shreya did to improve her system.

This article introduces Retrieval-Augmented Generation (RAG), a technique that enhances the accuracy of large language models by incorporating external data. Using a real-life scenario of a GATE aspirant named Shreya, the article explains how RAG efficiently processes large documents by chunking and embedding them in a vector database. This approach retrieves relevant information to provide sharp, focused responses. The article also details the coding process for implementing RAG and highlights potential challenges in fine-tuning the system for comprehensive results.

DIY Mini Cursor: Simple Creation Guide

Raushan Kumar Thakur — Sat, 12 Apr 2025 13:05:47 GMT

Cursor is an AI-powered code editor you might already be familiar with. But have you ever paused to wonder how it actually works under the hood? What kind of "magic" powers an intelligent code companion? In this post, we’ll uncover the concept behind such tools and build a mini version of Cursor ourselves to truly understand the mechanics.

What is Agentic AI?

Before we dive into coding, it's important to understand the concept of AI Agents—a fundamental part of what makes tools like Cursor work.

At a high level, AI agents are intelligent systems enhanced with tools. These tools are built by us (developers) to extend the AI's native capabilities. While the base AI model provides reasoning, context understanding, and natural language generation, these agents can decide when and how to use specific tools to accomplish a goal.

Real-World Example: A Weather Agent

Let’s understand this with a practical use case.

Imagine you're building an AI agent that provides real-time weather updates.

By default, large language models (LLMs) like GPT or Gemini don’t have access to the internet or real-time data. But you can overcome this limitation by giving the model access to a tool—an external API endpoint that fetches live weather data.

How It Works

You can define a simple instruction like this for your AI agent:

"If someone asks for weather information, call the /get-weather?city= API and return the result."

Now, if someone says:

"What's the weather in Mohali right now?"

The AI will:

Detect the user intent (weather inquiry).
Trigger the API with the appropriate city (Mohali).
Parse the API response.
Respond with something like:
"The current temperature in Mohali is 23°C with clear skies."

This is the core of agentic AI—giving your LLM the autonomy to use tools intelligently.

Next Step: Make It Real

To bring this to life, you'll need an OpenAI API key (or you can use Gemini with a few small changes). We'll write a simple script where the AI uses an external weather API to answer real-time queries—just like an actual agent.

import os
import requests
import json
from dotenv import load_dotenv
from openai import OpenAI

# Load environment variables for openAI API Key
load_dotenv()

client = OpenAI()    # Create an openAI Client

# Create an API to fetch real-time weather data
def get_weather(city: str):    
    print("⛏️Tool Called: get_weather for : ", city)
    url = f"https://wttr.in/{city}?format=%C+%t"
    response = requests.get(url)
    if response.status_code == 200 :
        return response.text

    return "Something went wrong. Couldn't fetch weather"

# Make a dictionary of available tools
available_tools = {
    "get_weather": {
        "fn": get_weather,
        "description": "Takes a city name as an input and returns the current weather for the city"
    }
}

# Give a detailed system prompt to customize the behaviour of AI
system_prompt = f"""
    You are an helpful AI Assistant who is specialized in resolving user query.
    You work on start, plan, action, observe mode.
    For the given user query and available tools, plan the step by step execution, based on the planning,
    select the relevant tool from the available tool. and based on the tool selection you perform an action to call the tool.
    Wait for the observation and based on the observation from the tool call resolve the user query.

    Rules:
    - Follow the Output JSON Format.
    - Always perform one step at a time and wait for next input
    - Carefully analyse the user query

    Output JSON Format:
    {{
        "step": "string",
        "content": "string",
        "function": "The name of function if the step is action",
        "input": "The input parameter for the function",
    }}

    Available Tools:
    - get_weather: Takes a city name as an input and returns the current weather for the city

    Example:
    User Query: What is the weather of new york?
    Output: {{ "step": "plan", "content": "The user is interseted in weather data of new york" }}
    Output: {{ "step": "plan", "content": "From the available tools I should call get_weather" }}
    Output: {{ "step": "action", "function": "get_weather", "input": "new york" }}
    Output: {{ "step": "observe", "output": "12 Degree Cel" }}
    Output: {{ "step": "output", "content": "The weather for new york seems to be 12 degrees." }}
"""

# A messages list to store the conversation
messages = [
    {"role": "system", "content" : system_prompt}
]

#This is where all magic happens
while True:
    user_query = input('> ')
    messages.append({"role": "user", "content": user_query})

    while True: 
        response = client.chat.completions.create(
            model="openai/gpt-4o",
            response_format={"type": "json_object"},
            messages = messages,
        )
        parsed_response = json.loads(response.choices[0].message.content)

        messages.append({"role": "assistant", "content": json.dumps(parsed_response)})

        if parsed_response.get("step") == "plan":
            print(f"🧠 Thinking: ", parsed_response.get("content"))
            continue

        if parsed_response.get("step") == "action": 
            tool_name = parsed_response.get("function")
            if available_tools.get(tool_name, False) != False:
                fn_output = available_tools[tool_name]["fn"](parsed_response.get("input"))
                messages.append({"role": "assistant", "content": json.dumps({ "step": "observe", "output":  fn_output})})
                continue

        if parsed_response.get("step") == "output":
            print(f"🤖: {parsed_response.get("content")}")
            break

Mini Cursor

Now, let’s take the weather agent concept one step further—and this is where it starts getting exciting.

Imagine replacing the weather API with an API that interacts with your own terminal. That’s right—your AI agent can now send commands directly to your system through a controlled backend API. This is the core idea behind building a mini version of Cursor.

How It Works

Here’s what happens behind the scenes:

The AI decides what needs to be done (e.g., create a folder, write a file, run a script).
It sends a request to your API.
Your API executes the command on your local machine (via shell or OS-level commands).
The result is sent back to the model, which presents the output or continues the workflow.

Example Actions

Let’s say the AI wants to:

Create a folder:
It calls the API → API runs mkdir my-folder → Folder is created.
Write a file:
It sends file content + path → API gets OS-level permission → File is written.
Start a server:
It calls the API → API runs npm start → Server starts running.

In short, the LLM becomes an intelligent assistant that not only suggests code but also executes real commands, acting like an automated developer sidekick.

Ready to Try It?

Let’s turn this concept into code. Below is a script that allows your AI agent to run terminal commands securely via API.

import os
import subprocess
import platform
import shlex
import json
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

client = OpenAI()

def run_command(command: str, background=False):
    print("⛏️Tool Called: run_command for : ", command)

    if not background:
        # Run in foreground (blocking) - original behavior
        result = os.system(command)
        return result
    else:
        # Run in background (non-blocking) 
        # Commands like `npm start` need to keep running, If you run this command on primary terminal, It'll block your terminal and you'll not be able to chat further
        try:
            # Create a detached process based on the OS
            if platform.system() == "Windows":
                # For Windows, use CREATE_NEW_CONSOLE flag
                full_command = f'start /min cmd /c "{command}"'
                subprocess.Popen(full_command, shell=True)
                print(f"Process started in background with Windows 'start' command")
                return True
            else:
                # For Unix/Linux/Mac, use setsid to create new session
                command_parts = shlex.split(command)
                process = subprocess.Popen(
                    command_parts,
                    preexec_fn=os.setsid,  # Detaches from parent process
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.DEVNULL
                )
                print(f"Process started in background with PID: {process.pid}")

            return True

        except Exception as e:
            print(f"Error running command in background: {e}")
            return False

def write_to_file(input_json):
    try:
        if isinstance(input_json, str):
            params = json.loads(input_json)
        else:
            params = input_json

        filename = params.get("filename")
        content = params.get("content")
        print("⛏️Tool Called: write_to_file for : ", filename)


        if not filename or content is None:
            print("Error: Missing filename or content in parameters")
            return False

        os.makedirs(os.path.dirname(filename) or '.', exist_ok=True)

        with open(filename, 'w') as file:
            file.write(content)
        return True
    except Exception as e:
        print(f"Error writing to file: {e}")
        return False

available_tools = {
    "run_command": {
        "fn": run_command,
        "description": "Takes two parameters, 'command: string' and 'background: boolean' as input and executes command on system. If command need to be keep running in background (like npm start), pass it as True and it'll return True, if execution was successful, else it's false by default and returns result of command."
    },
    "write_to_file": {
        "fn": write_to_file,
        "description": "Takes a JSON input with 'filename' and 'content' keys. Creates or overwrites the file with the specified content and returns True or False according to the status."
    }
}

system_prompt = f"""
    You are an helpful AI Assistant who is specialized in resolving user query.
    You work on start, plan, action, observe mode.
    For the given user query and available tools, plan the step by step execution, based on the planning,
    select the relevant tool from the available tool. and based on the tool selection you perform an action to call the tool.
    Wait for the observation and based on the observation from the tool call resolve the user query.

    Rules:
    - Follow the Output JSON Format.
    - Always perform one step at a time and wait for next input
    - Carefully analyse the user query
    - Follow best folder structure and coding practices.
    - New project should always be created in a seperate folder. And all subsequent commands will be run in new folder. Eg: cd new_folder && npm i
    - Create seperate folders for database related files, controllers, middlewares, routes etc in backend project.
    - Create utilities to send structured API response and error for backend projects
    - Create seperate folders for components, hooks etc in frontend projects.
    - Try to use '-y' flag in commands, whenever required to reduce manual interruptions.
    - Never install any dependency by editing package.json file directly. Use npm install  command
    - To activate virtual-environment use 'cd new-folder && .\\venv_name\\Scripts\\activate' command
    - Use nodemon or --watch kind of tools to be watchful of changes. 

    Output JSON Format:
    {{
        "step": "string",
        "content": "string",
        "function": "The name of function if the step is action",
        "input": "The input parameter for the function",
    }}

    Available Tools:
    - run_command: Takes two parameters, 'command: string' and 'background: boolean' as input and executes command on system. If command need to be keep running in background (like npm start, uvicorn main:app --reload etc.), pass it as True and it'll return True, if execution was successful, else it's false by default and returns result of command.
    - write_to_file: Takes a JSON input with 'filename' and 'content' keys. Creates or overwrites the file with the specified content.    

    Example:
    User Query: Create a basic react project?
    Output: {{ "step": "plan", "content": "The user is interseted in creating a bsic react project" }}
    Output: {{ "step": "plan", "content": "Let me check if node is installed on users's system or not." }}
    Output: {{ "step": "action", "function": "run_command", "input": "node -v" }}
    Output: {{ "step": "observe", "output": "v22.14.0" }}
    Output: {{ "step": "plan", "content": "Since npm is giving the version i.e node is installed. Now I should call run_command again to create a react project in a seperate folder" }}
    Output: {{ "step": "action", "function": "run_command", "input": "npx create-react-app my-app -y" }}
    Output: {{ "step": "observe", "output": "Success! Created my-app" }}
    Output: {{ "step": "plan", "content": "Now I need to start the app after navigating to my-app directory" }}
    Output: {{ "step": "action", "function": "run_command", "input": "cd my-app && npm start, True" }}
    Output: {{ "step": "observe", "output": "True" }}
    Output: {{ "step": "output", "output": "Project created successfully!" }}

    Example:
    User Query: Create a test.txt file in temp folder and write Hello with each character in new line.
    Output: {{ "step": "plan", "content": "The user is interseted in creating a test.txt file in temp folder and write Hello in it" }}
    Output: {{ "step": "plan", "content": "The available tool I found is write_to_file" }}
    Output: {{ "step": "action", "function": "write_to_file", "input": "{{"filename": "temp/test.txt", "content": "H\\ne\\nl\\nl\\no"}} }}
    Output: {{ "step": "observe", "output": "True" }}
    Output: {{ "step": "output", "output": "File Created successfully" }}

"""

messages = [
    {"role": "system", "content" : system_prompt}
]


while True:
    user_query = input('> ')
    messages.append({"role": "user", "content": user_query})

    while True: 
        response = client.chat.completions.create(
            model="openai/gpt-4o-mini",
            response_format={"type": "json_object"},
            messages = messages,
        )
        parsed_response = json.loads(response.choices[0].message.content)

        messages.append({"role": "assistant", "content": json.dumps(parsed_response)})

        if parsed_response.get("step") == "plan":
            print(f"🧠 Thinking: ", parsed_response.get("content"))
            continue

        if parsed_response.get("step") == "action": 
            tool_name = parsed_response.get("function")
            if available_tools.get(tool_name, False) != False:
                fn_output = available_tools[tool_name]["fn"](parsed_response.get("input"))
                messages.append({"role": "assistant", "content": json.dumps({ "step": "observe", "output":  fn_output})})
                continue

        if parsed_response.get("step") == "output":
            print(f"🤖: {parsed_response.get("content")}")
            break

Final Thoughts

If you've carefully followed the code and logic, you'll notice something interesting—the core architecture hasn’t changed at all.

All we did was swap out the tools (APIs) the agent uses:

First, it was a weather API.
Then, it became a terminal command executor.

That’s it.

And just like that, you’ve built your own Mini Cursor.

It’s Not Magic, It’s Engineering

Cursor isn’t some black-box sorcery—it’s simply a well-orchestrated system of:

AI + tool access (via APIs),
Structured system prompts,
And intelligent orchestration logic.

Now that you understand the concept and have hands-on experience, you can imagine just how powerful things can get when you scale this architecture.

See It in Action

Curious how my version of Cursor works in real life?
Watch the demo video here:

https://twitter.com/rnkp_755/status/1910814493363384416

Beyond the Black Box of Generative LLMs

Raushan Kumar Thakur — Tue, 08 Apr 2025 10:10:35 GMT

GPT is a buzzword that is intimidating for freshers these days. Technical freshers feel anxious after witnessing its capabilities, concerned that it may threaten their jobs. In contrast, both technical and non-technical individuals are amazed, pondering, "How can a machine accomplish all this?" Let's delve deeper to understand the mechanism behind it.

What is Generative?

The term "generative" refers to the ability to create or produce something. As we have already experienced, Unlike traditional systems that retrieve information from the web, these large language models (LLMs) are designed to generate content independently.

What is Pre-Trained?

As the term suggests, these LLMs are pre-trained on some data, enabling them to generate responses. Generating a response simply means predicting the next letter repeatedly using mathematical calculations, not magical actions.

What are Transformers?

This term represents the entire mechanism behind how GPTs function. This is the core neural network architecture that GPT models are built upon. Let’s understand the underlying mechanisms step by step by referencing the research work by Google itself.

Input and Encoding

This is the initial stage of interacting with LLMs, whether for training or inference. This step involves receiving input from the user, converting it into machine language, and understanding the actual context the user is referring to. Here are the detailed steps involved in this:

Tokenization: As we know, machines only understand numbers. Therefore, it's essential to convert every user input into numbers first. This step is called Tokenization, and these numbers are called Tokens.

The high-level architecture of tokenization involves breaking down a sentence into chunks of words, symbols, or sometimes even small sentences. These chunks are then replaced with corresponding numbers from their vocabulary dictionary. Each LLM has its own dictionary for replacing these chunks.

For example, let's create an imaginary vocabulary dictionary and tokenized sentences to see how this process might work.

🚨

This example is meant to provide a better understanding of tokenization and does not reflect how tokenization occurs in the real world.

I want to simulate the traditional multi-tap process on old phone keypads where hitting '2' once gives 'a', twice gives 'b', thrice gives 'c', etc.

Here are the steps for this:
- Iterate through each character of the string.
- If the character is a letter (a-z), replace it with its corresponding multi-tap digit sequence.

If the character is any special character, append 1.

Example:

Hello Hashnode will become ['44', '33', '555', '555', '666', '0', '44', '2', '7777', '44', '66', '666', '3', '33', '1'], where 44 represents H, 33 maps e and so on.

Here's the equivalent Python script to do the same and try it out.

    import string

    t9_map = {
        'a': '2', 'b': '22', 'c': '222',
        'd': '3', 'e': '33', 'f': '333',
        'g': '4', 'h': '44', 'i': '444',
        'j': '5', 'k': '55', 'l': '555',
        'm': '6', 'n': '66', 'o': '666',
        'p': '7', 'q': '77', 'r': '777', 's': '7777',
        't': '8', 'u': '88', 'v': '888',
        'w': '9', 'x': '99', 'y': '999', 'z': '9999',
    }

    # Reverse mapping for detokenization
    reverse_t9_map = {v: k for k, v in t9_map.items()}

    def tokenize(text):
        text_lower = text.lower()
        tokens = []

        for char in text_lower:
            if 'a' <= char <= 'z':
                tokens.append(t9_map[char])
            elif char == ' ':
                tokens.append('0')
            else:
                tokens.append('1')

        return tokens

    def detokenize(tokens):
        result = ""
        for token in tokens:
            if token == '0':
                result += ' '
            elif token == '1':
                result += '?'  # symbol placeholder
            else:
                result += reverse_t9_map.get(token, '?')  # fallback in case of unknown
        return result

    # Example usage
    input_string = "Hello Hashnode!"
    tokens = tokenize(input_string)
    print("Tokens:", tokens)

    decoded = detokenize(tokens)
    print("Detokenized:", decoded)

    """
    Output:
    Tokens: ['44', '33', '555', '555', '666', '0', '44', '2', '7777', '44', '66', '666', '3', '33', '1']
    Detokenized: hello hashnode?
    """

💡

To experience how tokenization works in the real world, you can visit Tiktokenizer.

Want to understand how OpenAI tokenizes a message? Here's the code.

    import tiktoken

    encoder = tiktoken.encoding_for_model('gpt-4o')    # gpt model

    print("Vocab Size", encoder.n_vocab) # 2,00,019 (200K)

    text = "Hello Hashnode"
    tokens = encoder.encode(text)

    print("Tokens: ", tokens) # Tokens [13225, 10242, 7005]

    my_tokens = [13225, 10242, 7005]
    decoded = encoder.decode(my_tokens)
    print("Decoded: ", decoded)    # Decoded: Hello Hashnode

Vector Embedding: Vector embedding maps the semantic meaning of words in a sentence to multi-dimensional coordinates (often visualized in 2D or 3D). For example, in the sentences, Monkey eats banana and Man eats rice, ‘monkey' and 'man' are both animals, while 'banana' and 'rice' are food items. As a result, 'monkey' and 'man' would be positioned close to each other in one region of the space, and 'banana' and 'rice' in another. Moreover, the vector from 'monkey' to 'banana' would be similar in direction and magnitude to the vector from 'man' to 'rice', reflecting similar semantic relationships. It’s just mathematical matrices.
Positional Encoding: Positional Encoding involves adding positional information to token embeddings to help the model understand the order of words in a sentence. For example, consider the sentences The man is eating rice and The rice is eating man. Although the words (and thus their embeddings) are the same in both sentences, their meanings are entirely different due to the change in word order. Since vector embeddings alone do not capture positional context, positional encoding is crucial—it allows the model to distinguish between such cases by encoding each token's position in the sequence, thereby enabling a better understanding of the actual context.

Attention and Feed Forwarding (Encoding Phase)

Attention and Feed Forwarding is the next phase in the processing pipeline of Large Language Models (LLMs). At this stage, the model focuses on determining which parts of the input are most relevant to each token using the attention mechanism, and then applies feed-forward neural networks to transform these representations. This phase helps the model capture complex relationships between words and introduces non-linearity, allowing it to understand context and meaning beyond simple sequential patterns.

Multi-Head Attention: Multi-Head Attention builds upon the concept of self-attention, where each token in a sequence can interact with every other token to better understand contextual relationships. For example, consider the sentences 'The river bank' and 'The HDFC bank'. In both cases, the word 'bank' has the same token and embedding, and even its positional encoding would be similar since it appears at the end of the sentence. However, the meaning of 'bank' differs in each context. Self-attention helps the model capture these nuances by allowing the token 'bank' to attend to other tokens like 'river' or 'HDFC' for disambiguation.

Multi-Head Attention enhances this process by using multiple attention heads in parallel. Each head learns different types of relationships or focuses on different aspects of the input, enabling the model to capture richer and more diverse contextual information.
Feed Forwarding: Feed Forwarding in Large Language Models introduces non-linearity into the processing pipeline, allowing the model to interpret the context from multiple perspectives. For instance, imagine a scene where a dog is looking out the window while traveling in a car. Different parts of our brain might focus on various aspects of this moment: 'The car was white', 'The dog was a Labrador', 'The family was going on a trip', 'The dog was fascinated by the scenery', and so on. Similarly, during this phase, the model processes the contextual information through multiple dense layers to extract and represent diverse interpretations and deeper meaning.

This is what happens during the Input phase of LLM interaction, marking the end of the Encoding phase. Now, let's turn our attention to the Output phase and explore what happens in the decoder.

Output Embedding & Positional Encoding

The decoding phase is iterative — it generates one token at a time, and each newly generated token is used to predict the next one.

It begins with the tokens that have already been generated so far (this is often referred to as "shifted right" input). These tokens are passed through an output embedding layer, just like on the encoder side. Then, positional encoding is added to retain the order of the tokens and the combined representation (embedding + position) forms the input to the decoder stack.

Example:

Let’s say the user input is: “How are you?”
After the input is fully processed by the encoder, the decoder starts generating the response:

The decoder is triggered with a start token: [].
It predicts the first word: "I" → Output so far: [, I].
This gets fed back in → predicts "am" → [, I, am].
Repeats until the model outputs → [, I, am, fine, ].

Masked Multi-Head Attention

This is the first step in the decoder stack. It's very similar to the multi-head attention used in the encoder, with one key difference — masking.

Masking ensures that the model can’t look ahead. While generating the third word, for example, it shouldn't peek at the fourth. This keeps the generation process auto-regressive, i.e., predicting the next token using only the known ones. For example,while predicting the third word in [, I, am], the model must not access "fine" or yet. Masking hides those future tokens during attention.

Multi-Head Attention

This layer allows the decoder to attend to the encoder’s output — meaning it connects what’s being generated with what the user actually asked. It helps the model align the generated response with the input context.

Feed Forward + Add & Norm

Same as the encoder — this adds non-linearity and enables the model to understand richer patterns in data. Each token is passed through a Feed Forward Neural Network and a Add & Layer Normalization for stability and better learning.

Linear → Softmax

After decoding is done, the final token representations are passed through a Linear layer, converting them to a large vector (same size as the vocabulary). Then, a Softmax layer is applied to turn this vector into a probability distribution over all possible next words. For Example at some point, if the model sees a high probability like: [I: 2, am: 87, have: 4, was: 1, ...]. It chooses "am" as the predicted word.

This wraps up the explanation of Gen-AI. I hope you found it interesting. Thank you.

This article explores the fundamentals of how Generative Pre-trained Transformers (GPTs) function, focusing on key concepts such as tokenization, vector embedding, positional encoding, and attention mechanisms. By breaking down the encoding and decoding phases of Large Language Models (LLMs), it elucidates how these systems generate contextually relevant responses. Through examples and explanations, readers gain insight into the architecture and processes that enable GPTs to produce human-like text.