<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Gen-AI]]></title><description><![CDATA[Gen-AI]]></description><link>https://blog.raushan.info</link><generator>RSS for Node</generator><lastBuildDate>Wed, 15 Apr 2026 17:52:09 GMT</lastBuildDate><atom:link href="https://blog.raushan.info/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Understanding HyDE: A Guide to Hypothetical Document Embeddings]]></title><description><![CDATA[In the previous section, we saw that Shreya didn’t get the expected response when she asked:

Explain the difference between waveguide and coaxial cable in practical applications.

The system returned partial matches or generic definitions—not the cr...]]></description><link>https://blog.raushan.info/rag-hyde</link><guid isPermaLink="true">https://blog.raushan.info/rag-hyde</guid><category><![CDATA[ChaiCode]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Raushan Kumar Thakur]]></dc:creator><pubDate>Wed, 23 Apr 2025 06:46:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/f0JGorLOkw0/upload/2728d29c07ee0985daa73efc36fd79aa.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous section, we saw that Shreya didn’t get the expected response when she asked:</p>
<blockquote>
<p><strong><em>Explain the difference between waveguide and coaxial cable in practical applications.</em></strong></p>
</blockquote>
<p>The system returned <em>partial matches</em> or <em>generic definitions</em>—not the crisp, real-world comparison she expected.</p>
<h1 id="heading-hyde-hypothetical-document-embeddings">HyDE – Hypothetical Document Embeddings</h1>
<p>Shreya realized that the issue wasn’t the retrieval model or the LLM. It was that her query was too <em>real-world</em>, and her dataset was full of <strong>exam-oriented phrasing</strong>.</p>
<p>This is where <strong>Hypothetical Document Embeddings (HyDE)</strong> came to the rescue.</p>
<p>Instead of searching the vector database with the raw user query, <strong>HyDE first asks the LLM to generate a “document”</strong>—a short, hypothetical paragraph that might resemble the <em>ideal answer to the question</em>. Then <strong>that generated paragraph is embedded and used for retrieval</strong>.</p>
<h2 id="heading-steps">Steps</h2>
<p>Here are the steps followed in this approach:</p>
<ol>
<li><p>Take the user's query as input.</p>
</li>
<li><p>Provide it to an <code>LLM</code> and ask it to write a <code>Document</code> on the topic.</p>
</li>
<li><p>Use this document to perform a <code>similarity_search</code>.</p>
</li>
<li><p>Retrieve the <code>chunks</code> from the <code>similarity_search</code> in <code>Step-3</code> and provide them to the <code>LLM</code> along with the user's <code>original</code> query.</p>
</li>
<li><p>Return the response given by the <code>LLM</code> to the user.</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745389736457/8f352284-aa1b-454a-8013-6ba445098dd1.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-why-will-it-work">Why will it work?</h2>
<p>Before understanding <code>Why will it work?</code>, let’s recall what was the actual <code>issue</code> in the previous section because of which it wasn’t working.</p>
<p>Since the user's query was very real-world and contained broken English, while her document was full of <code>technical phrases</code> and <code>jargon</code>, the system struggled. When it applied <code>similarity_search</code> to the user's query, the matching chunks returned were not very good, leading to a lower quality response from the <code>LLM</code>.</p>
<p>Now, instead of directly using the user's query for <code>similarity_search</code>, we ask an <code>LLM</code> to write a document on the topic. The document created by the <code>LLM</code> will include all the <code>technical phrases</code> and <code>jargon</code> used in the industry. So, when we perform <code>similarity_search</code> on this document, the matching documents will be much more accurate and will cover the topic thoroughly. This ultimately leads to a better response from the <code>LLM</code>.</p>
<h2 id="heading-how-to-do">How to do?</h2>
<p>If you have followed the series till here, implementing this certainly would not be the big challenge for you. Still here’s the full code for you:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict
<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path
<span class="hljs-keyword">from</span> langchain_community.document_loaders <span class="hljs-keyword">import</span> PyPDFLoader
<span class="hljs-keyword">from</span> langchain_text_splitters <span class="hljs-keyword">import</span> RecursiveCharacterTextSplitter
<span class="hljs-keyword">from</span> langchain_google_genai <span class="hljs-keyword">import</span> GoogleGenerativeAIEmbeddings
<span class="hljs-keyword">from</span> langchain_qdrant <span class="hljs-keyword">import</span> QdrantVectorStore
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> OpenAI

GOOGLE_API_KEY = os.getenv(<span class="hljs-string">"GOOGLE_API_KEY"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_and_split_documents</span>(<span class="hljs-params">pdf_path</span>):</span>
    <span class="hljs-string">"""Load PDF and split into chunks"""</span>
    loader = PyPDFLoader(file_path=pdf_path)
    docs = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=<span class="hljs-number">1000</span>,
        chunk_overlap=<span class="hljs-number">200</span>
    )

    split_docs = text_splitter.split_documents(docs)

    print(<span class="hljs-string">"Number of documents before splitting:"</span>, len(docs))
    print(<span class="hljs-string">"Number of documents after splitting:"</span>, len(split_docs))

    <span class="hljs-keyword">return</span> split_docs

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">setup_vector_store</span>(<span class="hljs-params">split_docs, embedder</span>):</span>
    <span class="hljs-string">"""Initialize vector store with documents"""</span>
    vector_store = QdrantVectorStore.from_documents(
        documents=split_docs,
        url=<span class="hljs-string">"http://localhost:6333"</span>,
        collection_name=<span class="hljs-string">"learning_langchain"</span>,
        embedding=embedder
    )
    <span class="hljs-keyword">return</span> vector_store

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate_document</span>(<span class="hljs-params">client, user_query</span>):</span>
    <span class="hljs-string">"""Break out the user query into multiple smaller steps"""</span>
    GENERATE_DOCUMENT_SYSTEM_PROMPT = <span class="hljs-string">"""
    You are a helpful assistant. You will be provided with a question and you need to write a proper document on the topics included in it. Use proper technical phrases and terms used in the related industry. 
    """</span>

    response = client.chat.completions.create(
        model=<span class="hljs-string">"gemini-1.5-flash"</span>,
        messages=[
            {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: GENERATE_DOCUMENT_SYSTEM_PROMPT},
            {
                <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>,
                <span class="hljs-string">"content"</span>: user_query
            }
        ]
    )
    content = response.choices[<span class="hljs-number">0</span>].message.content
    print(<span class="hljs-string">"Generate Document response:"</span>, content)

    <span class="hljs-keyword">return</span> content

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">similarity_search</span>(<span class="hljs-params">vector_store, query</span>):</span>
    <span class="hljs-string">"""Perform similarity search for a given query"""</span>
    relevant_chunks = vector_store.similarity_search(query=query)
    <span class="hljs-keyword">return</span> relevant_chunks

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">retrieval_generation</span>(<span class="hljs-params">client, query, context_docs</span>):</span>
    <span class="hljs-string">"""Generate an answer based on query and context"""</span>
    <span class="hljs-comment"># Format context from documents</span>
    context = <span class="hljs-string">"\n\n"</span>.join([doc.page_content <span class="hljs-keyword">for</span> doc <span class="hljs-keyword">in</span> context_docs])
    print(context)

    GENERATION_SYSTEM_PROMPT = <span class="hljs-string">f"""
    You are a helpful assistant. You will be provided with a question and relevant context filtered according to user's query. 
    Your task is to provide a concise answer based on the context.

    Context: <span class="hljs-subst">{context}</span>
    """</span>

    response = client.chat.completions.create(
        model=<span class="hljs-string">"gemini-2.0-flash"</span>,
        messages=[
            {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: GENERATION_SYSTEM_PROMPT},
            {<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: query}
        ]
    )
    <span class="hljs-keyword">return</span> response.choices[<span class="hljs-number">0</span>].message.content

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">main</span>():</span>
    <span class="hljs-comment"># Initialize components</span>
    pdf_path = Path(<span class="hljs-string">"./nodejs.pdf"</span>)
    split_docs = load_and_split_documents(pdf_path)

    embedder = GoogleGenerativeAIEmbeddings(
        model=<span class="hljs-string">"models/text-embedding-004"</span>,
        google_api_key=GOOGLE_API_KEY,
    )

    vector_store = setup_vector_store(split_docs, embedder)

    <span class="hljs-comment"># Create client for chatting</span>
    client = OpenAI(
        api_key=GOOGLE_API_KEY,
        base_url=<span class="hljs-string">"https://generativelanguage.googleapis.com/v1beta/openai/"</span>
    )

    <span class="hljs-comment"># Main interaction loop</span>
    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
        user_query = input(<span class="hljs-string">"&gt;&gt; "</span>)
        <span class="hljs-keyword">if</span> user_query.lower() <span class="hljs-keyword">in</span> [<span class="hljs-string">"exit"</span>, <span class="hljs-string">"quit"</span>, <span class="hljs-string">"q"</span>]:
            <span class="hljs-keyword">break</span>

        <span class="hljs-comment"># Generate related questions</span>
        content = generate_document(client, user_query)

        <span class="hljs-comment"># Final generation that uses all previous context</span>
        relevant_chunks = similarity_search(vector_store, content)
        print(<span class="hljs-string">f"Final query: <span class="hljs-subst">{len(relevant_chunks)}</span> relevant chunks found."</span>)
        final_generation = retrieval_generation(client, content, relevant_chunks)
        print(<span class="hljs-string">f"Final Answer: <span class="hljs-subst">{final_generation}</span>"</span>)


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    main()
</code></pre>
<blockquote>
<p>In the article, I explore the use of Hypothetical Document Embeddings (HyDE) to improve document retrieval and information extraction from large datasets, especially when dealing with real-world queries that differ significantly from the technical jargon in the dataset. By generating a hypothetical document that fits the technical tone of industry-standard language, HyDE enhances the accuracy of similarity searches, leading to more relevant document retrieval and improved responses from language models. The article includes a detailed breakdown of the steps in this process and provides an implementation using Python, LangChain, and OpenAI’s generative AI models.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Chain of Thoughts rescue Shreya]]></title><description><![CDATA[In the previous section, we saw that although Shreya made a good improvement in her system and it worked well for a few prompts, it still struggled with complex tasks. In such cases, the LLM was hallucinating and not performing well.
Chain of Thought...]]></description><link>https://blog.raushan.info/rags-cot</link><guid isPermaLink="true">https://blog.raushan.info/rags-cot</guid><category><![CDATA[ChaiCode]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Raushan Kumar Thakur]]></dc:creator><pubDate>Wed, 23 Apr 2025 06:02:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/hJUl5BAhJec/upload/109994390d8c27df4c9d3c12023ddd6c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous section, we saw that although Shreya made a good improvement in her system and it worked well for a few prompts, it still struggled with complex tasks. In such cases, the LLM was hallucinating and not performing well.</p>
<h1 id="heading-chain-of-thoughts-less-abstract-query-transformation-technique">Chain of Thoughts / Less Abstract Query Transformation Technique</h1>
<p>As Shreya sat at her drawing board, pondering a solution, she remembered her mother's advice:</p>
<blockquote>
<p>Break complex problems into smaller subproblems and solve them one by one.</p>
</blockquote>
<p>Shreya had an idea and felt confident it might work. Here's her plan:</p>
<p>The question she had asked last night was:</p>
<blockquote>
<p><strong><em>Trace how digital logic topics expanded in the last five years.</em></strong></p>
</blockquote>
<p>What if the LLM takes her mother's advice seriously and uses this approach? For example, the question above can be broken down into these steps:</p>
<blockquote>
<ul>
<li><p>Identify syllabus changes per year</p>
</li>
<li><p>Summarize each trend</p>
</li>
<li><p>Stitch them into a timeline</p>
</li>
</ul>
</blockquote>
<p>After breaking it into subproblems, she is fully confident that her system will be able to perform the task perfectly.</p>
<p>In short, the main idea is to break the query into multiple, <code>less abstract</code> subqueries. This way, the <code>LLM</code> can better understand their task.</p>
<p>Let’s look at another example from <code>Google's white paper</code>. It suggests breaking down the prompt</p>
<blockquote>
<p>Think Machine Learning</p>
</blockquote>
<p>into:</p>
<blockquote>
<ul>
<li><p>First, think about the machine.</p>
</li>
<li><p>Next, think about learning.</p>
</li>
<li><p>Finally, think about machine learning.</p>
</li>
</ul>
</blockquote>
<p>Let's explore her approach in detail using the flow diagram she created:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745360539686/51a4abf1-ffae-4e5b-a153-d4e6b904eb4f.png" alt class="image--center mx-auto" /></p>
<p>Here are the steps she wants her system to follow:</p>
<ol>
<li><p>Take the user's query as input.</p>
</li>
<li><p>Give the query to the <code>LLM</code> and ask it to break it down into smaller <code>subproblems</code> or <code>steps</code> that can be solved easily.</p>
</li>
<li><p>Perform the next steps synchronously.</p>
</li>
<li><p>Take the first step given by the <code>LLM</code> and perform <code>Retrieval</code> and <code>Generation</code> steps just as in previous sections. You can use any of the <code>Fan-out</code>, <code>Reciprocal-rank fusion</code>, or even a simple generation technique. Let's say the generation was <code>G1</code>.</p>
</li>
<li><p>Now take the second step given by the <code>LLM</code>, append <code>G1</code> to it, and pass it to <code>Retrieval</code> and <code>Generation</code> steps, as done in step <code>4</code>. Let's say the generation was <code>G2</code>.</p>
</li>
<li><p>Follow the same pattern for all the steps given by the <code>LLM</code>. For example, <code>G2</code> will be appended in <code>Step-3</code>, <code>G3</code> in <code>Step-4</code>, and so on.</p>
</li>
<li><p>In the end, the final generation <code>Gn</code> will be given to the <code>LLM</code> along with the <code>original user's query</code> for the final generation.</p>
</li>
<li><p>This response can then be directly provided to the user.</p>
</li>
</ol>
<h2 id="heading-why-will-it-work">Why will it work?</h2>
<p>Some of you might be wondering why this unusual approach would even work. Let's use Shreya's query as an example to understand it better.</p>
<p>The query was:</p>
<blockquote>
<p><strong><em>Trace how digital logic topics expanded in the last five years.</em></strong></p>
</blockquote>
<p>Since this query wasn't simple enough to just find some chunks from a database, analyze a few paragraphs, and return an answer, it required a lot of computation before reaching a conclusion.</p>
<p>Let’s assume, during the <code>query-breaking phase</code>, the <code>LLM</code> broke the query down in these steps:</p>
<blockquote>
<ol>
<li><p>Identify syllabus changes per year.</p>
</li>
<li><p>Summarize each trend.</p>
</li>
<li><p>Stitch them into a timeline.</p>
</li>
</ol>
</blockquote>
<h3 id="heading-step-1-identify-syllabus-changes-per-year">Step-1 (Identify syllabus changes per year.)</h3>
<p>Let's start with the first step: <code>Identify syllabus changes per year</code>. When this query is processed through the <code>similarity_search</code> and generation steps, don't you think that with the accuracy Shreya has achieved in her <code>LLM</code> so far, her system will be able to answer it efficiently? Yes.</p>
<h3 id="heading-step-2-summarize-each-trend">Step-2 (Summarize each trend.)</h3>
<p>After successfully completing <code>Step-1</code>, the <code>Generation</code> has gathered all the data on how the <code>syllabus has changed</code> over the years. Now, if that data is provided along with <code>this step’s query</code>, which is <code>Summarize each trend</code>, don't you think the <code>LLM</code> will effectively summarize it and provide a clear response? Absolutely.</p>
<h3 id="heading-step-3-stitch-them-into-a-timeline">Step-3 (Stitch them into a timeline.)</h3>
<p>After successfully completing <code>Step-2</code>, the <code>Generation</code> has collected all the data on the syllabus changes and how these trends have developed. With this information, the <code>LLM</code> can certainly create a timeline. Do you agree?</p>
<p>After finishing this step, we have fully contextual raw data after many filtering processes. Now, we just need to <code>Polish</code> it according to the user’s <code>original</code> query, and that’s what we do.</p>
<p>Pass the <code>Generation</code> of the final step along with the user’s <code>original</code> query to the <code>LLM</code> and return the response to the user.</p>
<h2 id="heading-how-to-do">How to do?</h2>
<p>Implementing this is quite simple if you've been following the series up to this point. Here is the code snippet for breaking the query into smaller steps:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate_steps</span>(<span class="hljs-params">client, user_query</span>):</span>
    <span class="hljs-string">"""Break out the user query into multiple smaller steps"""</span>
    GENERATE_STEPS_SYSTEM_PROMPT = <span class="hljs-string">"""
    You are a helpful assistant. You will be provided with a question and you need to break it into 3 simpler &amp; sequential steps to solve the problem. What steps do you think would be best to solve the problem?

    Rules:
    - Follow the output JSON format.
    - The `content` in output JSON must be a list of steps.

    Example:
    User Query: How to handle file-uploads on server?
    Output: { "type": "steps", "content": ["Accept file from req.files. Take help of multer to do that.", "Upload file to the S3 bucket or any other db and take out public url", "Store that public url in actual database"] }
    """</span>

    response = client.chat.completions.create(
        model=<span class="hljs-string">"gemini-1.5-flash"</span>,
        response_format={<span class="hljs-string">"type"</span>: <span class="hljs-string">"json_object"</span>},
        messages=[
            {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: GENERATE_STEPS_SYSTEM_PROMPT},
            {
                <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>,
                <span class="hljs-string">"content"</span>: user_query
            }
        ]
    )
    content = response.choices[<span class="hljs-number">0</span>].message.content
    print(<span class="hljs-string">"Query Breaker response:"</span>, content)

    <span class="hljs-comment"># Parse the JSON response</span>
    parsed_response = json.loads(content)

    <span class="hljs-comment"># Extract the steps</span>
    steps = parsed_response[<span class="hljs-string">"content"</span>]
    print(<span class="hljs-string">"Generated steps:"</span>, steps)

    <span class="hljs-keyword">return</span> steps
</code></pre>
<h2 id="heading-get-the-full-code-herehttpsgithubcomrnkp755blogsblobmainrag-cotpy"><a target="_blank" href="https://github.com/rnkp755/blogs/blob/main/rag-cot.py">Get the full code here…</a></h2>
<h2 id="heading-issue">Issue</h2>
<p>Shreya was on a roll. With her system now able to answer complex queries using fan-out retrieval and even create thoughtful summaries through chain-of-thought prompts, she felt almost unstoppable.</p>
<p>One evening, while revising Electromagnetics, she typed:</p>
<blockquote>
<p>“Explain the difference between waveguide and coaxial cable in practical applications.”</p>
</blockquote>
<p>To her surprise, the system returned <em>partial matches</em> or <em>generic definitions</em>—not the crisp, real-world comparison she expected.</p>
<p>This made her realize that <strong>the job isn't done yet!</strong> She'll be back in the next chapter with a possible solution.</p>
<blockquote>
<p>Shreya encountered limitations in her language model system when faced with complex tasks. Inspired by her mother's advice, she developed a method to break down these tasks into smaller, manageable subproblems. Her approach involves using a less abstract query transformation technique to enhance the model's comprehension and performance. By iteratively processing each subproblem, Shreya's system aims to deliver a polished final response. Although she made significant progress, an issue with generating specific comparisons highlighted the ongoing challenge of refining the system's capabilities.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Reciprocal Rank Fusion Aids Shreya]]></title><description><![CDATA[In the previous section, we saw that Shreya faced a problem. When she asked a simple question expecting a straightforward answer, she was overwhelmed with too much information and responses she didn't request. To recap, the question was:

What was th...]]></description><link>https://blog.raushan.info/rags-reciprocal-rank-fusion</link><guid isPermaLink="true">https://blog.raushan.info/rags-reciprocal-rank-fusion</guid><category><![CDATA[ChaiCode]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Raushan Kumar Thakur]]></dc:creator><pubDate>Tue, 22 Apr 2025 21:24:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/ia68CL7u88g/upload/6cf216455e5b438f84457fa68b62509d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous section, we saw that Shreya faced a problem. When she asked a simple question expecting a straightforward answer, she was overwhelmed with too much information and responses she didn't request. To recap, the question was:</p>
<blockquote>
<p><strong><em>What was the most common control systems topic asked in last 5 years?</em></strong></p>
</blockquote>
<p>The response she received included:</p>
<blockquote>
<ul>
<li><p><strong><em>What was the most common control systems topic asked in last 5 years?</em></strong></p>
</li>
<li><p><strong><em>What are control systems and its usage.</em></strong></p>
</li>
<li><p><strong><em>What important topics does control systems include?</em></strong></p>
</li>
<li><p><strong><em>What was the most common topics asked in last 5 years? (From other subjects as well)</em></strong></p>
</li>
</ul>
</blockquote>
<p>Now she started looking for an improvement algorithm and thinks she has found one.</p>
<h1 id="heading-reciprocal-rank-fusion">Reciprocal Rank Fusion</h1>
<p>In the previous section, we transformed the user's query and found documents from the vector database using similarity search. Unlike before, where we dumped all the matched documents into the LLM, we will now rank the documents based on how often they appear in transformed queries and how early they appear in the order.</p>
<p>Here's the flowchart of the architecture. Let's go through this step by step:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745343176697/4aeb5aec-0e73-4467-98c2-c64272003616.png" alt="Reciprocal Rank Fusion Flowchart" class="image--center mx-auto" /></p>
<p>Before the retrieval step, as you can see, the flowchart is exactly the same as it was in the previous section. We take the user's query as input and transform it into several other queries.</p>
<h2 id="heading-retrieval-step">Retrieval Step</h2>
<p>The retrieval step is similar to the previous section, with a few minor changes highlighted in the flowchart. When we pass a query to our <code>similarity_search</code> method, the function might return multiple chunks because different parts of the PDF may relate to the query. The different chunks found by the function are shown in various colors in the diagram, such as <code>Red</code>, <code>Green</code>, <code>Yellow</code>, and <code>Blue</code>. We will refer to them by their color in the following paragraphs.</p>
<p>In this technique, I have also included the user's original query in <code>similarity_search</code>, which is always a good idea to implement.</p>
<p>Now, based on how often the chunks appear in <code>similarity_search</code> and their order, each chunk will be assigned a <code>reciprocal_rank_fusion(rrf_score)</code>. Only the <code>higher ranked</code> chunks will be sent to the LLM, while the rest will be ignored. The threshold <code>rrf_score</code> for including documents in the LLM call will be determined by the developer according to the use case.</p>
<h3 id="heading-ranking-algorithm">Ranking Algorithm</h3>
<p>The method for calculating <code>rrf_score</code> can vary depending on the use case. Here, I will explain the most common approach that is generally used.</p>
<p>What’s our goal?</p>
<ol>
<li><p>Favor documents that appear <strong>more frequently</strong>.</p>
</li>
<li><p>Favor documents that appear <strong>earlier (higher) in the lists</strong>.</p>
</li>
</ol>
<p>Instead of just counting how many times a document appears or averaging ranks, <strong>this approach uses a formula that gives more weight to documents that appear early in any list</strong>.</p>
<p>The formula is:</p>
<p>$$\text{Score}(d) = \sum_{i=1}^{n} \frac{1}{k + \text{rank}_{i,d}}$$</p><ul>
<li><p><code>rank i,d</code> is the rank of document <code>d</code> in list <code>i</code> (starting from 1).</p>
</li>
<li><p><code>k</code> is a constant to make sure that scores aren't dominated by rank 1s (usually <code>k=60</code> is used).</p>
</li>
<li><p>If a document doesn't appear in a list, it contributes <strong>zero</strong>.</p>
</li>
</ul>
<p>Let’s take an example to better understand this approach:</p>
<p>Assume we have 3 query results:</p>
<pre><code class="lang-plaintext">Query 1: [A, B, C, D]
Query 2: [B, C, E]
Query 3: [C, A, F]
</code></pre>
<p>Let’s use <code>k=60</code>.</p>
<blockquote>
<p>The <code>k</code> keeps the impact of one very early rank from being too extreme (so rank 1 isn’t 1.0 but more like 0.016).</p>
</blockquote>
<p>We now calculate scores:</p>
<h4 id="heading-for-document-a">For document A:</h4>
<ul>
<li><p>Appears in Query 1 at rank 1 → 1 / (60 + 1) = 1/61</p>
</li>
<li><p>Appears in Query 3 at rank 2 → 1 / (60 + 2) = 1/62</p>
</li>
<li><p>Total: ~0.0164 + ~0.0161 = <strong>0.0325</strong></p>
</li>
</ul>
<h4 id="heading-for-document-c">For document C:</h4>
<ul>
<li><p>Appears in Query 1 at rank 3 → 1 / (60 + 3) = 1/63</p>
</li>
<li><p>Appears in Query 2 at rank 2 → 1 / (60 + 2) = 1/62</p>
</li>
<li><p>Appears in Query 3 at rank 1 → 1 / (60 + 1) = 1/61</p>
</li>
<li><p>Total: ~0.0159 + 0.0161 + 0.0164 = <strong>0.0484</strong></p>
</li>
</ul>
<p>Similarly, we can do this for all documents and sort them by total score.</p>
<p>Once all scores are computed, since we only want to keep the <strong>best</strong> ones — so filter out those whose score is below a certain threshold.</p>
<p>Now that we have a good understanding of the ranking algorithm, let's see how we can implement this in code:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> collections <span class="hljs-keyword">import</span> defaultdict

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">reciprocal_rank_fusion</span>(<span class="hljs-params">rankings, k=<span class="hljs-number">60</span>, threshold=<span class="hljs-number">0.0</span></span>):</span>
    rrf_scores = defaultdict(float)

    <span class="hljs-comment"># Iterate over each ranking list</span>
    <span class="hljs-keyword">for</span> ranking <span class="hljs-keyword">in</span> rankings:
        <span class="hljs-keyword">for</span> rank, doc_id <span class="hljs-keyword">in</span> enumerate(ranking):
            <span class="hljs-comment"># Calculate RRF score: 1 / (k + rank + 1)</span>
            rrf_scores[doc_id] += <span class="hljs-number">1</span> / (k + rank + <span class="hljs-number">1</span>)

    <span class="hljs-comment"># Filter by threshold and sort by score in descending order</span>
    filtered = [(doc_id, score) <span class="hljs-keyword">for</span> doc_id, score <span class="hljs-keyword">in</span> rrf_scores.items() <span class="hljs-keyword">if</span> score &gt;= threshold]
    <span class="hljs-keyword">return</span> sorted(filtered, key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Example input: 3 ranked lists of documents from different queries</span>
rankings = [
    [<span class="hljs-string">"A"</span>, <span class="hljs-string">"B"</span>, <span class="hljs-string">"C"</span>, <span class="hljs-string">"D"</span>],  <span class="hljs-comment"># Query 1</span>
    [<span class="hljs-string">"B"</span>, <span class="hljs-string">"C"</span>, <span class="hljs-string">"E"</span>],        <span class="hljs-comment"># Query 2</span>
    [<span class="hljs-string">"C"</span>, <span class="hljs-string">"A"</span>, <span class="hljs-string">"F"</span>]         <span class="hljs-comment"># Query 3</span>
]

<span class="hljs-comment"># Apply RRF to combine rankings and filter based on threshold</span>
result = reciprocal_rank_fusion(rankings, k=<span class="hljs-number">60</span>, threshold=<span class="hljs-number">0.02</span>) <span class="hljs-comment"># Adjust threshold as needed</span>

<span class="hljs-comment"># Print the results</span>
print(<span class="hljs-string">"Ranked Documents based on Reciprocal Rank Fusion:"</span>)
<span class="hljs-keyword">for</span> doc, score <span class="hljs-keyword">in</span> result:
    print(<span class="hljs-string">f"Document: <span class="hljs-subst">{doc}</span>, RRF Score: <span class="hljs-subst">{score:<span class="hljs-number">.4</span>f}</span>"</span>)
</code></pre>
<p>The above code can be used to filter out chunks based on their <code>rrf_score</code>.</p>
<h2 id="heading-generation-step">Generation Step</h2>
<p>After filtering the <code>most relevant</code> chunks, the process is the same as we've been doing since chapter 1. Simply insert those chunks into the LLM along with the user’s query, and the LLM will handle the rest. Then, return its response to the user.</p>
<h2 id="heading-click-here-to-get-the-full-codehttpsgithubcomrnkp755blogsblobmainrag-reciprocal-rank-fusionpy"><a target="_blank" href="https://github.com/rnkp755/blogs/blob/main/rag-reciprocal-rank-fusion.py">Click here to get the full code…</a></h2>
<h2 id="heading-issue">Issue</h2>
<p>After doing this, Shreya was over the moon because she thought her system was now super powerful and could answer any type of query. She played with it for a few minutes, but her smile vanished when she asked:</p>
<blockquote>
<p>Trace how digital logic topics expanded in the last five years.</p>
</blockquote>
<p>The system didn't respond as she expected. However, since she is brave and a great problem-solver, she wasn't disheartened and went back to the drawing board to improve the system further.</p>
<p>Let's see what she did in the next chapter.</p>
<blockquote>
<p>In this article, we explore a solution to improve the retrieval and ranking of documents in response to user queries using the Reciprocal Rank Fusion (RRF) algorithm. Shreya's original problem was receiving overwhelming responses to her query about common control systems topics over the last five years. To address this, the RRF algorithm was introduced to rank documents based on their frequency and order of appearance across transformed queries. This process filters out less relevant documents before sending them to a language model for a concise response. Despite initially feeling triumphant with the implementation, Shreya encountered limitations when the system failed to adequately trace the expansion of digital logic topics over five years, prompting further refinement to enhance the system's capabilities.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Shreya Finds Excitement Again with Fan-Out Magic]]></title><description><![CDATA[In the last chapter, we saw how Shreya was discouraged by her system's response to one of her questions. She then returned to the drawing board to find ways to improve her system. Let's see what she discovered and whether it solved her issue.
Let me ...]]></description><link>https://blog.raushan.info/rags-fan-out-technique</link><guid isPermaLink="true">https://blog.raushan.info/rags-fan-out-technique</guid><category><![CDATA[ChaiCode]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Raushan Kumar Thakur]]></dc:creator><pubDate>Tue, 22 Apr 2025 14:42:28 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/fyWHYqu7D6A/upload/9f2d64d29741cb3c44e7b3a873cee072.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the last chapter, we saw how Shreya was discouraged by her system's response to one of her questions. She then returned to the drawing board to find ways to improve her system. Let's see what she discovered and whether it solved her issue.</p>
<p>Let me remind you, this was the question she asked:</p>
<blockquote>
<p>Give me definitions, examples, plus tricky MCQs on LTI systems?</p>
</blockquote>
<p>And the system responded with only definitions.</p>
<h1 id="heading-parallel-query-retrieval-fan-out-technique">Parallel Query Retrieval / Fan-out Technique</h1>
<p>In Parallel Query Retrieval, we create several different versions of the user's query, each focusing on a different aspect. Not clear yet? Don't worry, let's look at an example.</p>
<p>User’s Query:</p>
<blockquote>
<p>How does garbage collection work in Python?</p>
</blockquote>
<p>We'll input this query into an LLM and ask it to create multiple queries, each focusing on different aspects of the original question, such as:</p>
<blockquote>
<ul>
<li><p>What triggers garbage collection in Python?</p>
</li>
<li><p>What are Garbage collection algorithms in Python?</p>
</li>
<li><p>How memory leaks relate to GC?</p>
</li>
</ul>
</blockquote>
<p>This technique is called <code>Parallel Query Retrieval</code> or <code>Fan-out</code> technique.</p>
<h2 id="heading-how-will-this-help">How will this help?</h2>
<p>As we saw in Shreya’s case, when we directly passed her detailed query about <code>LTI Systems</code> to the LLM, it didn't respond well. Now, let's apply the transformation mentioned above to this question.</p>
<p>The actual query was:</p>
<blockquote>
<p>Give me definitions, examples, plus tricky MCQs on LTI systems?</p>
</blockquote>
<p>The transformed queries would be something like:</p>
<blockquote>
<ul>
<li><p>Give me definitions of LTI systems.</p>
</li>
<li><p>Give me examples of LTI systems.</p>
</li>
<li><p>Give me tricky MCQs on LTI systems.</p>
</li>
</ul>
</blockquote>
<p>Now, if I pass these three queries individually through the <code>Retrieval</code> and <code>Generation</code> steps learned in the previous chapter, don't you think Shreya will get a better response?</p>
<ul>
<li><p>Query 1 will focus only on the definitions.</p>
</li>
<li><p>Query 2 will focus only on the examples.</p>
</li>
<li><p>Query 3 will focus only on the MCQs.</p>
</li>
</ul>
<p>After compiling the LLM responses for all three queries, Shreya will achieve her objective.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">In the example, I've transformed the user's query into three distinct queries, but this number can be adjusted based on the specific requirements of the use case.</div>
</div>

<p>Understand the whole flow here:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745327078381/22fa8b94-4d3e-48b9-9ecb-faa0855544e6.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-how-to-do">How to do?</h2>
<p>Implementing this is quite simple by following these steps:</p>
<ol>
<li><p>Take the user's file input.</p>
</li>
<li><p>Perform the indexing process as described in the previous chapter.</p>
</li>
<li><p>Take the user's query.</p>
</li>
<li><p>Make an LLM call with an effective <code>SYSTEM_PROMPT</code> and ask it to regenerate the query into <code>3</code> or <code>n</code> queries, each focusing on different aspects of the original query.</p>
</li>
<li><p>Follow the Retrieval &amp; Generation steps from the last chapter again.</p>
</li>
</ol>
<p>Here’s a code snippet of Query Transformation:</p>
<pre><code class="lang-python">finalResponse = <span class="hljs-string">""</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">fan_out</span>():</span>
    user_query = input(<span class="hljs-string">"&gt;&gt; "</span>)

    <span class="hljs-keyword">global</span> finalResponse
    finalResponse = <span class="hljs-string">""</span>  <span class="hljs-comment"># Reset the final response for each new query</span>

    FAN_OUT_SYSTEM_PROMPT = <span class="hljs-string">"""
    You are a helpful assistant. You will be provided with a question and you need to generate 3 questions out of it focusing on different aspects of it or related to it. The focus should be on what user might be interested in and maybe he couldn't ask it directly. You need to generate them.

    Rules:
    - Follow the output JSON format.

    Example:
    User Query: How does garbage collection workin python?
    Output: {{ "q1": "What triggers garbage collection in python?", "q2": "Garbage collection algorithms in Python?", "q3": "How memory leaks relate to GC?" }}
    """</span>

    response = client.chat.completions.create(
        model=<span class="hljs-string">"gemini-2.0-flash"</span>,
        response_format={<span class="hljs-string">"type"</span>: <span class="hljs-string">"json_object"</span>},
        messages=[
            {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: FAN_OUT_SYSTEM_PROMPT},
            {
                <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>,
                <span class="hljs-string">"content"</span>: user_query
            }
        ]
    )
    content = response.choices[<span class="hljs-number">0</span>].message.content
    print(<span class="hljs-string">"Fan out response:"</span>, content)
    <span class="hljs-comment"># Parse the JSON response</span>
    parsed_response = json.loads(content)
    <span class="hljs-comment"># Extract the questions</span>
    questions = [parsed_response[<span class="hljs-string">"q1"</span>], parsed_response[<span class="hljs-string">"q2"</span>], parsed_response[<span class="hljs-string">"q3"</span>]]
    print(<span class="hljs-string">"Questions:"</span>, questions)
    <span class="hljs-comment"># Call the retrieval_generation function for each question</span>
    <span class="hljs-keyword">for</span> question <span class="hljs-keyword">in</span> questions:
        retrieval_generation(question)

    print(<span class="hljs-string">"Final response:"</span>, finalResponse)
</code></pre>
<h2 id="heading-get-the-full-code-herehttpsgithubcomrnkp755blogsblobmainrag-fan-outpy"><a target="_blank" href="https://github.com/rnkp755/blogs/blob/main/rag-fan-out.py">Get the full Code here…</a></h2>
<p>This setup was working well for Shreya, and she was jumping on the sofa in excitement.</p>
<h2 id="heading-issue">Issue</h2>
<p>Did she face any issues again? Yes, her joy was short-lived, and soon she encountered another problem. She realized she was receiving a lot more content that she wasn't interested in and hadn't asked for. She was frustrated to see such long responses, even when she asked a simple question.</p>
<p>For example, Now if she asks:</p>
<blockquote>
<p><strong><em>What was the most common control systems topic asked in last 5 years?</em></strong></p>
</blockquote>
<p>In response, she’s getting:</p>
<blockquote>
<ul>
<li><p><strong><em>What was the most common control systems topic asked in last 5 years?</em></strong></p>
</li>
<li><p>What are control systems and its usage.</p>
</li>
<li><p>What important topics does control systems include?</p>
</li>
<li><p>What was the most common topics asked in last 5 years? (From other subjects as well)</p>
</li>
</ul>
</blockquote>
<p>And she thought, this isn't a foolproof solution, and she needs to make more improvements. Let's see in the next chapter what idea she comes up with.</p>
<blockquote>
<p>Shreya, facing unsatisfactory responses from her system, explores the Parallel Query Retrieval or Fan-out Technique to enhance the quality of information retrieval. This approach involves breaking down queries into multiple focused sub-queries, which individually target different aspects of the original question. For instance, a comprehensive question on LTI systems is divided into queries asking for definitions, examples, and tricky MCQs. This method initially proves effective, but eventually leads to excessive and irrelevant information. The narrative outlines Shreya's ongoing challenge to refine her system's response quality.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Understanding RAG: A Comprehensive Intro and Shreya's Story]]></title><description><![CDATA[Welcome to the first blog of the series RAG—A powerful technique that enhances the accuracy and relevance of responses generated by large language models (LLMs) by incorporating information from external data sources relevant to user’s query.

💡
Pre...]]></description><link>https://blog.raushan.info/rags-basic-overview</link><guid isPermaLink="true">https://blog.raushan.info/rags-basic-overview</guid><category><![CDATA[ChaiCode]]></category><category><![CDATA[RAG ]]></category><dc:creator><![CDATA[Raushan Kumar Thakur]]></dc:creator><pubDate>Mon, 21 Apr 2025 20:33:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/P5mCQ4KACbM/upload/a86ad6c2fe2cc303e8f209df563b263f.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome to the first blog of the series <strong>RAG</strong>—A powerful technique that enhances the accuracy and relevance of responses generated by large language models (LLMs) by incorporating information from external data sources relevant to user’s query.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Prerequisite: To understand this series better, it's important to have a basic understanding of how large language models (LLMs) function. If you're not, <a target="_self" href="https://blog.raushan.info/inside-genai">click here</a> to get it.</div>
</div>

<h1 id="heading-meet-shreya-a-gate-aspirant">Meet Shreya: A Gate Aspirant</h1>
<p>To make this series more relatable and easy to understand, meet Shreya, a GATE aspirant. She has a PDF of previous years' GATE questions from the last 15 years, along with their answers. Now, she wants to know:</p>
<blockquote>
<p>What was the most common control systems topic asked in last 5 years?</p>
</blockquote>
<p>Manually going through such a long PDF isn't practical for her, so what's the solution? Should she copy and paste the entire document into ChatGPT? Of course not. The model has a limited context window and can't handle unlimited text at once. Even if she could somehow input all 500 pages, it would be very inefficient, and the model would return bloated or irrelevant answers. She only needs a focused answer, likely based on just 50 to 60 pages. Loading unnecessary data not only wastes computing resources but also reduces the quality of the output. This is where a RAG pipeline becomes essential.</p>
<hr />
<h1 id="heading-how-retrieval-augmented-generation-rag-solves-this">How Retrieval-Augmented Generation (RAG) solves this</h1>
<p>With a RAG-based setup, when Shreya wants to ask something from the PDF, she first needs to upload it to the system. The system then:</p>
<ol>
<li><p><strong>Chunks</strong> the whole PDF into small parts.</p>
</li>
<li><p><strong>Embeds</strong> those chunks in 3D vector space.</p>
</li>
<li><p><strong>Stores</strong> them in a vector database like Pinecone or Chroma.</p>
</li>
<li><p>Now the system is ready to answer Shreya's questions. Let’s say she asks the same question again. The system will:</p>
</li>
<li><p><strong>Input</strong> her query.</p>
</li>
<li><p><strong>Performs a similarity search</strong> on the vector database to fetch only the relevant chunks.</p>
</li>
<li><p><strong>Feeds</strong> those filtered chunks to the LLM.</p>
</li>
<li><p><strong>Generates</strong> a sharp, focused response.</p>
</li>
</ol>
<p>It’s the same as giving a cheat sheet to the model that’s been auto-curated for the specific question.</p>
<hr />
<p>Now let’s understand all these steps one by one and try to code it:</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">To run the code snippets in the blog, you'll need a <code>GEMINI_API_KEY</code>. Don't worry, it's free, so go ahead and get one. Also, you’ll need <code>Docker</code> installed on the system. So install that as well, if you haven't already. You can follow the Installation guide from <a target="_self" href="https://docs.docker.com/engine/install/">here</a>.</div>
</div>

<h2 id="heading-chunking">Chunking</h2>
<h3 id="heading-what-is-chunking">What is Chunking?</h3>
<p><strong>Chunking</strong>—Breaking something into small, manageable pieces, like eating a burger one bite at a time or splitting a long PDF into smaller sections in this case.</p>
<p>The logic for splitting can vary based on needs and circumstances. It can be done page by page, paragraph by paragraph, or even two paragraphs per chunk, or 1000 characters per chunk and so on. It completely depends on the developer.</p>
<h3 id="heading-why-is-it-needed">Why is it needed?</h3>
<p>Since the PDF Shreya had was very large, and if she wanted some precise answer, likely based on a few pages, dumping the whole PDF to LLM isn’t a great idea. As discussed above, it’ll lead to bloated or irrelevant answers. Also, it’ll waste computing resources.</p>
<h3 id="heading-what-issue-may-come">What issue may come?</h3>
<p><strong>Context Loss</strong> is the main issue when chunking. In multi-page PDFs, there might be sentences that start on, let's say, page 1 and continue on page 2. In these cases, chunking by page would lose the sentence's context. Let's understand this better with the help of an image:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745247471680/1f0ac102-5660-450c-892a-f6039a6cde34.png" alt class="image--center mx-auto" /></p>
<p>As you can see in the above image, chunk 1 doesn’t know about the other positions of `Hitesh Choudhary`, and chunk 2 doesn’t know what this content creator and CTO is.</p>
<p>To solve this, we’ll overlap some characters while chunking, i.e, include some characters from Chunk 1 in Chunk 2. As we can see in the next image, both chunks have enough context about their content.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745247588411/3616a1c3-851c-4ba9-a3a5-d2c917483aad.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-how-to-do">How to do?</h3>
<p>To do this, we’ll take the help of some built-in functions from <code>langchain</code>(If you don’t know, langchain has some built-in functions for developers to perform the common tasks in the world of LLMs.).</p>
<pre><code class="lang-python"><span class="hljs-string">"""
DO INSTALL NECESSARY PACKAGES IN VIRTUAL ENVIRONMENT
- python -m venv venv
- source ./Scripts/venv/activate
- pip install langchain_community pypdf langchain_text_splitters 
"""</span>

<span class="hljs-keyword">from</span> pathlib <span class="hljs-keyword">import</span> Path
<span class="hljs-keyword">from</span> langchain_community.document_loaders <span class="hljs-keyword">import</span> PyPDFLoader
<span class="hljs-keyword">from</span> langchain_text_splitters <span class="hljs-keyword">import</span> RecursiveCharacterTextSplitter

pdf_path = Path(<span class="hljs-string">"./gate-pyqs.pdf"</span>)    <span class="hljs-comment"># Put the path of the PDF</span>

loader = PyPDFLoader(file_path = pdf_path)    <span class="hljs-comment"># PyPDFLoader is a utility function of langchain which helps to load the PDF</span>
docs = loader.load()    <span class="hljs-comment"># Loads the PDF</span>

text_splitter = RecursiveCharacterTextSplitter(    <span class="hljs-comment"># RecursiveCharacterTextSplitter is a utility function which helps to split the PDF in chunks based on characters</span>
    chunk_size=<span class="hljs-number">1000</span>,    <span class="hljs-comment"># Split 1000 characters per chunk</span>
    chunk_overlap=<span class="hljs-number">200</span>    <span class="hljs-comment"># Overlap 200 characters per chunk to avoid context loss</span>
)

split_docs = text_splitter.split_documents(docs)    <span class="hljs-comment"># Split the PDF / Chunking</span>

print(<span class="hljs-string">"Number of documents before splitting:"</span>, len(docs))
print(docs[<span class="hljs-number">0</span>])  <span class="hljs-comment"># docs is a list of Document objects</span>
print(<span class="hljs-string">"Number of documents after splitting:"</span>, len(split_docs))
print(split_docs[<span class="hljs-number">0</span>])    <span class="hljs-comment"># split_docs is a list of Document objects</span>
</code></pre>
<h2 id="heading-vector-embedding">Vector Embedding</h2>
<h3 id="heading-what-is-vector-embedding">What is Vector Embedding?</h3>
<p>It maps the semantic meaning of words in a sentence to multi-dimensional coordinates (often visualized in 2D or 3D). For example, in the sentences, <strong><em>Monkey eats banana</em></strong> and <strong><em>Man eats rice</em></strong>, ‘monkey' and 'man' are both animals, while 'banana' and 'rice' are food items. As a result, 'monkey' and 'man' would be positioned close to each other in one region of the space, and 'banana' and 'rice' in another.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744063638604/5e55df93-2553-4000-8f55-9511d0a5b9a6.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-why-is-it-needed-1">Why is it needed?</h3>
<p>To better understand sentences, it's helpful to relate the words and vector embeddings do this effectively.</p>
<h3 id="heading-how-to-do-1">How to do?</h3>
<p>We’ll again use a utility function from LangChain to create vector embeddings.</p>
<pre><code class="lang-python"><span class="hljs-string">"""
- pip install langchain-google-genai
"""</span>

<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> langchain_google_genai <span class="hljs-keyword">import</span> GoogleGenerativeAIEmbeddings

GOOGLE_API_KEY = os.getenv(<span class="hljs-string">"GOOGLE_API_KEY"</span>)    <span class="hljs-comment"># Get a Gemini API Key</span>

embedder = GoogleGenerativeAIEmbeddings(    <span class="hljs-comment"># GoogleGenerativeAIEmbeddings is an utility function to create vector embeddings</span>
    model=<span class="hljs-string">"models/text-embedding-004"</span>,    <span class="hljs-comment"># Google's embedding model</span>
    google_api_key=GOOGLE_API_KEY,
    )
</code></pre>
<h2 id="heading-storing-in-vector-database">Storing in Vector Database</h2>
<h3 id="heading-what-is-a-vector-database">What is a vector database?</h3>
<p>A vector database is a specialized type of database designed to store, index, and query vector embeddings.</p>
<h3 id="heading-how-to-installuse-a-vector-database">How to install/use a vector database?</h3>
<p>There are various vector databases available in the market, for example, ChromaDB, PineCone, Qdrant, etc. Here we’ll go with QdrantDB, because it’s lightweight &amp; open-source.</p>
<p>Make sure that you have Docker installed and it’s up and running. Run the following commands in the terminal.</p>
<pre><code class="lang-yaml"><span class="hljs-string">docker</span> <span class="hljs-string">pull</span> <span class="hljs-string">qdrant/qdrant</span>    <span class="hljs-comment"># Pull QdrantDB</span>
</code></pre>
<pre><code class="lang-yaml"><span class="hljs-string">docker</span> <span class="hljs-string">run</span> <span class="hljs-string">-p</span> <span class="hljs-number">6333</span><span class="hljs-string">:6333</span> <span class="hljs-string">-d</span> <span class="hljs-string">qdrant/qdrant</span>    <span class="hljs-comment"># Run the iamge in detach mode &amp; Port Mapping</span>
</code></pre>
<p>Now go to <code>http://localhost:6333/dashboard</code> You’ll have a pre-made Qdrant dashboard running here.</p>
<h3 id="heading-how-to-make-vector-embeddings">How to make vector embeddings?</h3>
<pre><code class="lang-python"><span class="hljs-string">"""
- pip install langchain_qdrant
"""</span>

<span class="hljs-keyword">from</span> langchain_qdrant <span class="hljs-keyword">import</span> QdrantVectorStore    <span class="hljs-comment"># langchain_qdrant is a utility package for interacting with QdrantDB</span>

vector_store = QdrantVectorStore.from_documents(    <span class="hljs-comment"># Creates a collection and store embeddings into the database</span>
    documents = split_docs,
    url = <span class="hljs-string">"http://localhost:6333"</span>,
    collection_name = <span class="hljs-string">"learning_langchain"</span>,
    embedding = embedder
)

retriever = QdrantVectorStore.from_existing_collection(    <span class="hljs-comment"># Creates a retriever to do query operations on db</span>
    url = <span class="hljs-string">"http://localhost:6333"</span>,
    collection_name = <span class="hljs-string">"learning_langchain"</span>,
    embedding = embedder
)
</code></pre>
<p>Now the system is ready to answer Shreya's boring questions. I hope you haven’t forgotten that GATE aspirant.</p>
<hr />
<h2 id="heading-take-shreyas-query-as-input-amp-perform-a-similarity-search-retrieval">Take Shreya’s query as input &amp; perform a Similarity Search (Retrieval)</h2>
<p>After storing the embeddings of her data source in a vector database, it's time to take her questions and find relevant content from the database. This content can then be provided to the LLM so the model can deliver precise and accurate answers.</p>
<pre><code class="lang-python">user_query = input(<span class="hljs-string">"&gt;&gt; "</span>)

relevant_chunks  = retriever.similarity_search(    <span class="hljs-comment"># similarity_search is a function to find similar embeddings</span>
    query = user_query
)

print(<span class="hljs-string">"Search result:"</span>, relevant_chunks)
</code></pre>
<h2 id="heading-generate-a-response-from-llm-generation">Generate a response from LLM (Generation)</h2>
<p>Now that we have the relevant data source and the user query, it's time to create a suitable <code>SYSTEM_PROMPT</code> and provide everything to the LLM. The LLM will handle the rest, and we'll receive the response we want.</p>
<pre><code class="lang-python"><span class="hljs-string">"""
- pip install openai
"""</span>

<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> OpenAI

<span class="hljs-comment"># Create client for chatting</span>
client = OpenAI(
    api_key=GOOGLE_API_KEY,    <span class="hljs-comment"># Provide Gemini API key here</span>
    base_url=<span class="hljs-string">"https://generativelanguage.googleapis.com/v1beta/openai/"</span>
)

SYSTEM_PROMPT = <span class="hljs-string">"""
You are a helpful assistant. You will be provided with a question and relevant context from a document. Your task is to provide a concise answer based on the context.
Context: {relevant_chunks}
"""</span>

response = client.chat.completions.create(
    model=<span class="hljs-string">"gemini-2.0-flash"</span>,
    messages=[
        {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span>: SYSTEM_PROMPT.format(relevant_chunks=relevant_chunks)},
        {
            <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>,
            <span class="hljs-string">"content"</span>: user_query
        }
    ]
)

print(response.choices[<span class="hljs-number">0</span>].message)
</code></pre>
<p>Now, if Shreya asks the same question, that is,</p>
<blockquote>
<p>What was the most common control systems topic asked in last 5 years?</p>
</blockquote>
<p>The <code>similarity_search</code> function will go to the database and find relevant chunks (control systems topic from last 5 years) from the data source and my LLM will now be able to answer that easily.</p>
<p>To summarize, here’s a flow of the whole process:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1745267484871/bb9ee4ad-8bb3-425c-af28-803e3473befe.png" alt class="image--center mx-auto" /></p>
<p><img src="https://media.datacamp.com/legacy/v1704459771/image_552d84ab56.png" alt="What is Retrieval Augmented Generation (RAG)? | DataCamp" /></p>
<h2 id="heading-click-here-to-get-the-full-codehttpsgithubcomrnkp755blogsblobmainrag-onepy"><a target="_blank" href="https://github.com/rnkp755/blogs/blob/main/rag-one.py">Click here to get the full code</a></h2>
<h2 id="heading-issues">Issues</h2>
<p>Shreya was excited about the results she got while experimenting with her model. However, her excitement didn't last long because, during her testing, she asked another question:</p>
<blockquote>
<p>Give me definitions, examples, plus tricky MCQs on LTI systems?</p>
</blockquote>
<p>And in response, the model provided only definitions, which made Shreya return to the drawing board to adjust the architecture. In the next chapter, we'll see what Shreya did to improve her system.</p>
<blockquote>
<p>This article introduces Retrieval-Augmented Generation (RAG), a technique that enhances the accuracy of large language models by incorporating external data. Using a real-life scenario of a GATE aspirant named Shreya, the article explains how RAG efficiently processes large documents by chunking and embedding them in a vector database. This approach retrieves relevant information to provide sharp, focused responses. The article also details the coding process for implementing RAG and highlights potential challenges in fine-tuning the system for comprehensive results.</p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[DIY Mini Cursor: Simple Creation Guide]]></title><description><![CDATA[Cursor is an AI-powered code editor you might already be familiar with. But have you ever paused to wonder how it actually works under the hood? What kind of "magic" powers an intelligent code companion? In this post, we’ll uncover the concept behind...]]></description><link>https://blog.raushan.info/build-your-own-cursor</link><guid isPermaLink="true">https://blog.raushan.info/build-your-own-cursor</guid><category><![CDATA[ChaiCode]]></category><category><![CDATA[ChaiCohort]]></category><category><![CDATA[llm]]></category><category><![CDATA[mcp]]></category><category><![CDATA[genai]]></category><dc:creator><![CDATA[Raushan Kumar Thakur]]></dc:creator><pubDate>Sat, 12 Apr 2025 13:05:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/oXlXu2qukGE/upload/783d7dc6667e52d4d9d127659fc2f24d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Cursor</strong> is an AI-powered code editor you might already be familiar with. But have you ever paused to wonder <em>how</em> it actually works under the hood? What kind of "magic" powers an intelligent code companion? In this post, we’ll uncover the concept behind such tools and build a <strong>mini version of Cursor</strong> ourselves to truly understand the mechanics.</p>
<h2 id="heading-what-is-agentic-ai">What is Agentic AI?</h2>
<p>Before we dive into coding, it's important to understand the concept of <strong>AI Agents</strong>—a fundamental part of what makes tools like Cursor work.</p>
<p>At a high level, <strong>AI agents</strong> are intelligent systems enhanced with tools. These tools are built by us (developers) to extend the AI's native capabilities. While the base AI model provides reasoning, context understanding, and natural language generation, these agents can <strong>decide when and how to use specific tools</strong> to accomplish a goal.</p>
<hr />
<h2 id="heading-real-world-example-a-weather-agent">Real-World Example: A Weather Agent</h2>
<p>Let’s understand this with a practical use case.</p>
<p>Imagine you're building an AI agent that provides real-time weather updates.</p>
<p>By default, large language models (LLMs) like GPT or Gemini don’t have access to the internet or real-time data. But you can overcome this limitation by giving the model access to a <strong>tool</strong>—an external API endpoint that fetches live weather data.</p>
<h3 id="heading-how-it-works">How It Works</h3>
<p>You can define a simple instruction like this for your AI agent:</p>
<blockquote>
<p>"If someone asks for weather information, call the <code>/get-weather?city=&lt;CITY_NAME&gt;</code> API and return the result."</p>
</blockquote>
<p>Now, if someone says:</p>
<blockquote>
<p>"What's the weather in Mohali right now?"</p>
</blockquote>
<p>The AI will:</p>
<ol>
<li><p>Detect the user intent (<code>weather inquiry</code>).</p>
</li>
<li><p>Trigger the API with the appropriate city (<code>Mohali</code>).</p>
</li>
<li><p>Parse the API response.</p>
</li>
<li><p>Respond with something like:<br /> <code>"The current temperature in Mohali is 23°C with clear skies."</code></p>
</li>
</ol>
<p>This is the core of <strong>agentic AI</strong>—giving your LLM the autonomy to use tools <em>intelligently</em>.</p>
<hr />
<h3 id="heading-next-step-make-it-real">Next Step: Make It Real</h3>
<p>To bring this to life, you'll need an <strong>OpenAI API key</strong> (or you can use Gemini with a few small changes). We'll write a simple script where the AI uses an external weather API to answer real-time queries—just like an actual agent.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> OpenAI

<span class="hljs-comment"># Load environment variables for openAI API Key</span>
load_dotenv()

client = OpenAI()    <span class="hljs-comment"># Create an openAI Client</span>

<span class="hljs-comment"># Create an API to fetch real-time weather data</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_weather</span>(<span class="hljs-params">city: str</span>):</span>    
    print(<span class="hljs-string">"⛏️Tool Called: get_weather for : "</span>, city)
    url = <span class="hljs-string">f"https://wttr.in/<span class="hljs-subst">{city}</span>?format=%C+%t"</span>
    response = requests.get(url)
    <span class="hljs-keyword">if</span> response.status_code == <span class="hljs-number">200</span> :
        <span class="hljs-keyword">return</span> response.text

    <span class="hljs-keyword">return</span> <span class="hljs-string">"Something went wrong. Couldn't fetch weather"</span>

<span class="hljs-comment"># Make a dictionary of available tools</span>
available_tools = {
    <span class="hljs-string">"get_weather"</span>: {
        <span class="hljs-string">"fn"</span>: get_weather,
        <span class="hljs-string">"description"</span>: <span class="hljs-string">"Takes a city name as an input and returns the current weather for the city"</span>
    }
}

<span class="hljs-comment"># Give a detailed system prompt to customize the behaviour of AI</span>
system_prompt = <span class="hljs-string">f"""
    You are an helpful AI Assistant who is specialized in resolving user query.
    You work on start, plan, action, observe mode.
    For the given user query and available tools, plan the step by step execution, based on the planning,
    select the relevant tool from the available tool. and based on the tool selection you perform an action to call the tool.
    Wait for the observation and based on the observation from the tool call resolve the user query.

    Rules:
    - Follow the Output JSON Format.
    - Always perform one step at a time and wait for next input
    - Carefully analyse the user query

    Output JSON Format:
    {{
        "step": "string",
        "content": "string",
        "function": "The name of function if the step is action",
        "input": "The input parameter for the function",
    }}

    Available Tools:
    - get_weather: Takes a city name as an input and returns the current weather for the city

    Example:
    User Query: What is the weather of new york?
    Output: {{ "step": "plan", "content": "The user is interseted in weather data of new york" }}
    Output: {{ "step": "plan", "content": "From the available tools I should call get_weather" }}
    Output: {{ "step": "action", "function": "get_weather", "input": "new york" }}
    Output: {{ "step": "observe", "output": "12 Degree Cel" }}
    Output: {{ "step": "output", "content": "The weather for new york seems to be 12 degrees." }}
"""</span>

<span class="hljs-comment"># A messages list to store the conversation</span>
messages = [
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span> : system_prompt}
]

<span class="hljs-comment">#This is where all magic happens</span>
<span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
    user_query = input(<span class="hljs-string">'&gt; '</span>)
    messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user_query})

    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>: 
        response = client.chat.completions.create(
            model=<span class="hljs-string">"openai/gpt-4o"</span>,
            response_format={<span class="hljs-string">"type"</span>: <span class="hljs-string">"json_object"</span>},
            messages = messages,
        )
        parsed_response = json.loads(response.choices[<span class="hljs-number">0</span>].message.content)

        messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"assistant"</span>, <span class="hljs-string">"content"</span>: json.dumps(parsed_response)})

        <span class="hljs-keyword">if</span> parsed_response.get(<span class="hljs-string">"step"</span>) == <span class="hljs-string">"plan"</span>:
            print(<span class="hljs-string">f"🧠 Thinking: "</span>, parsed_response.get(<span class="hljs-string">"content"</span>))
            <span class="hljs-keyword">continue</span>

        <span class="hljs-keyword">if</span> parsed_response.get(<span class="hljs-string">"step"</span>) == <span class="hljs-string">"action"</span>: 
            tool_name = parsed_response.get(<span class="hljs-string">"function"</span>)
            <span class="hljs-keyword">if</span> available_tools.get(tool_name, <span class="hljs-literal">False</span>) != <span class="hljs-literal">False</span>:
                fn_output = available_tools[tool_name][<span class="hljs-string">"fn"</span>](parsed_response.get(<span class="hljs-string">"input"</span>))
                messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"assistant"</span>, <span class="hljs-string">"content"</span>: json.dumps({ <span class="hljs-string">"step"</span>: <span class="hljs-string">"observe"</span>, <span class="hljs-string">"output"</span>:  fn_output})})
                <span class="hljs-keyword">continue</span>

        <span class="hljs-keyword">if</span> parsed_response.get(<span class="hljs-string">"step"</span>) == <span class="hljs-string">"output"</span>:
            print(<span class="hljs-string">f"🤖: <span class="hljs-subst">{parsed_response.get(<span class="hljs-string">"content"</span>)}</span>"</span>)
            <span class="hljs-keyword">break</span>
</code></pre>
<hr />
<h2 id="heading-mini-cursor">Mini Cursor</h2>
<p>Now, let’s take the weather agent concept one step further—and this is where it starts getting exciting.</p>
<p>Imagine replacing the weather API with an API that interacts with <strong>your own terminal</strong>. That’s right—your AI agent can now send commands directly to your system through a controlled backend API. This is the core idea behind building a <strong>mini version of Cursor</strong>.</p>
<hr />
<h3 id="heading-how-it-works-1">How It Works</h3>
<p>Here’s what happens behind the scenes:</p>
<ol>
<li><p><strong>The AI decides</strong> what needs to be done (e.g., create a folder, write a file, run a script).</p>
</li>
<li><p>It <strong>sends a request</strong> to your API.</p>
</li>
<li><p>Your API <strong>executes the command</strong> on your local machine (via shell or OS-level commands).</p>
</li>
<li><p>The <strong>result</strong> is sent back to the model, which presents the output or continues the workflow.</p>
</li>
</ol>
<hr />
<h3 id="heading-example-actions">Example Actions</h3>
<p>Let’s say the AI wants to:</p>
<ul>
<li><p>Create a folder:<br />  It calls the API → API runs <code>mkdir my-folder</code> → Folder is created.</p>
</li>
<li><p>Write a file:<br />  It sends file content + path → API gets OS-level permission → File is written.</p>
</li>
<li><p>Start a server:<br />  It calls the API → API runs <code>npm start</code> → Server starts running.</p>
</li>
</ul>
<p>In short, the LLM becomes an intelligent assistant that not only <em>suggests</em> code but also <strong>executes real commands</strong>, acting like an automated developer sidekick.</p>
<h3 id="heading-ready-to-try-it">Ready to Try It?</h3>
<p>Let’s turn this concept into code. Below is a script that allows your AI agent to run terminal commands securely via API.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> subprocess
<span class="hljs-keyword">import</span> platform
<span class="hljs-keyword">import</span> shlex
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
<span class="hljs-keyword">from</span> openai <span class="hljs-keyword">import</span> OpenAI

load_dotenv()

client = OpenAI()

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_command</span>(<span class="hljs-params">command: str, background=False</span>):</span>
    print(<span class="hljs-string">"⛏️Tool Called: run_command for : "</span>, command)

    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> background:
        <span class="hljs-comment"># Run in foreground (blocking) - original behavior</span>
        result = os.system(command)
        <span class="hljs-keyword">return</span> result
    <span class="hljs-keyword">else</span>:
        <span class="hljs-comment"># Run in background (non-blocking) </span>
        <span class="hljs-comment"># Commands like `npm start` need to keep running, If you run this command on primary terminal, It'll block your terminal and you'll not be able to chat further</span>
        <span class="hljs-keyword">try</span>:
            <span class="hljs-comment"># Create a detached process based on the OS</span>
            <span class="hljs-keyword">if</span> platform.system() == <span class="hljs-string">"Windows"</span>:
                <span class="hljs-comment"># For Windows, use CREATE_NEW_CONSOLE flag</span>
                full_command = <span class="hljs-string">f'start /min cmd /c "<span class="hljs-subst">{command}</span>"'</span>
                subprocess.Popen(full_command, shell=<span class="hljs-literal">True</span>)
                print(<span class="hljs-string">f"Process started in background with Windows 'start' command"</span>)
                <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>
            <span class="hljs-keyword">else</span>:
                <span class="hljs-comment"># For Unix/Linux/Mac, use setsid to create new session</span>
                command_parts = shlex.split(command)
                process = subprocess.Popen(
                    command_parts,
                    preexec_fn=os.setsid,  <span class="hljs-comment"># Detaches from parent process</span>
                    stdout=subprocess.DEVNULL,
                    stderr=subprocess.DEVNULL
                )
                print(<span class="hljs-string">f"Process started in background with PID: <span class="hljs-subst">{process.pid}</span>"</span>)

            <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"Error running command in background: <span class="hljs-subst">{e}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">write_to_file</span>(<span class="hljs-params">input_json</span>):</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">if</span> isinstance(input_json, str):
            params = json.loads(input_json)
        <span class="hljs-keyword">else</span>:
            params = input_json

        filename = params.get(<span class="hljs-string">"filename"</span>)
        content = params.get(<span class="hljs-string">"content"</span>)
        print(<span class="hljs-string">"⛏️Tool Called: write_to_file for : "</span>, filename)


        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> filename <span class="hljs-keyword">or</span> content <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
            print(<span class="hljs-string">"Error: Missing filename or content in parameters"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

        os.makedirs(os.path.dirname(filename) <span class="hljs-keyword">or</span> <span class="hljs-string">'.'</span>, exist_ok=<span class="hljs-literal">True</span>)

        <span class="hljs-keyword">with</span> open(filename, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> file:
            file.write(content)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Error writing to file: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

available_tools = {
    <span class="hljs-string">"run_command"</span>: {
        <span class="hljs-string">"fn"</span>: run_command,
        <span class="hljs-string">"description"</span>: <span class="hljs-string">"Takes two parameters, 'command: string' and 'background: boolean' as input and executes command on system. If command need to be keep running in background (like npm start), pass it as True and it'll return True, if execution was successful, else it's false by default and returns result of command."</span>
    },
    <span class="hljs-string">"write_to_file"</span>: {
        <span class="hljs-string">"fn"</span>: write_to_file,
        <span class="hljs-string">"description"</span>: <span class="hljs-string">"Takes a JSON input with 'filename' and 'content' keys. Creates or overwrites the file with the specified content and returns True or False according to the status."</span>
    }
}

system_prompt = <span class="hljs-string">f"""
    You are an helpful AI Assistant who is specialized in resolving user query.
    You work on start, plan, action, observe mode.
    For the given user query and available tools, plan the step by step execution, based on the planning,
    select the relevant tool from the available tool. and based on the tool selection you perform an action to call the tool.
    Wait for the observation and based on the observation from the tool call resolve the user query.

    Rules:
    - Follow the Output JSON Format.
    - Always perform one step at a time and wait for next input
    - Carefully analyse the user query
    - Follow best folder structure and coding practices.
    - New project should always be created in a seperate folder. And all subsequent commands will be run in new folder. Eg: cd new_folder &amp;&amp; npm i
    - Create seperate folders for database related files, controllers, middlewares, routes etc in backend project.
    - Create utilities to send structured API response and error for backend projects
    - Create seperate folders for components, hooks etc in frontend projects.
    - Try to use '-y' flag in commands, whenever required to reduce manual interruptions.
    - Never install any dependency by editing package.json file directly. Use npm install &lt;pkg&gt; command
    - To activate virtual-environment use 'cd new-folder &amp;&amp; .\\venv_name\\Scripts\\activate' command
    - Use nodemon or --watch kind of tools to be watchful of changes. 

    Output JSON Format:
    {{
        "step": "string",
        "content": "string",
        "function": "The name of function if the step is action",
        "input": "The input parameter for the function",
    }}

    Available Tools:
    - run_command: Takes two parameters, 'command: string' and 'background: boolean' as input and executes command on system. If command need to be keep running in background (like npm start, uvicorn main:app --reload etc.), pass it as True and it'll return True, if execution was successful, else it's false by default and returns result of command.
    - write_to_file: Takes a JSON input with 'filename' and 'content' keys. Creates or overwrites the file with the specified content.    

    Example:
    User Query: Create a basic react project?
    Output: {{ "step": "plan", "content": "The user is interseted in creating a bsic react project" }}
    Output: {{ "step": "plan", "content": "Let me check if node is installed on users's system or not." }}
    Output: {{ "step": "action", "function": "run_command", "input": "node -v" }}
    Output: {{ "step": "observe", "output": "v22.14.0" }}
    Output: {{ "step": "plan", "content": "Since npm is giving the version i.e node is installed. Now I should call run_command again to create a react project in a seperate folder" }}
    Output: {{ "step": "action", "function": "run_command", "input": "npx create-react-app my-app -y" }}
    Output: {{ "step": "observe", "output": "Success! Created my-app" }}
    Output: {{ "step": "plan", "content": "Now I need to start the app after navigating to my-app directory" }}
    Output: {{ "step": "action", "function": "run_command", "input": "cd my-app &amp;&amp; npm start, True" }}
    Output: {{ "step": "observe", "output": "True" }}
    Output: {{ "step": "output", "output": "Project created successfully!" }}

    Example:
    User Query: Create a test.txt file in temp folder and write Hello with each character in new line.
    Output: {{ "step": "plan", "content": "The user is interseted in creating a test.txt file in temp folder and write Hello in it" }}
    Output: {{ "step": "plan", "content": "The available tool I found is write_to_file" }}
    Output: {{ "step": "action", "function": "write_to_file", "input": "{{"filename": "temp/test.txt", "content": "H\\ne\\nl\\nl\\no"}} }}
    Output: {{ "step": "observe", "output": "True" }}
    Output: {{ "step": "output", "output": "File Created successfully" }}

"""</span>

messages = [
    {<span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>, <span class="hljs-string">"content"</span> : system_prompt}
]


<span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
    user_query = input(<span class="hljs-string">'&gt; '</span>)
    messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>, <span class="hljs-string">"content"</span>: user_query})

    <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>: 
        response = client.chat.completions.create(
            model=<span class="hljs-string">"openai/gpt-4o-mini"</span>,
            response_format={<span class="hljs-string">"type"</span>: <span class="hljs-string">"json_object"</span>},
            messages = messages,
        )
        parsed_response = json.loads(response.choices[<span class="hljs-number">0</span>].message.content)

        messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"assistant"</span>, <span class="hljs-string">"content"</span>: json.dumps(parsed_response)})

        <span class="hljs-keyword">if</span> parsed_response.get(<span class="hljs-string">"step"</span>) == <span class="hljs-string">"plan"</span>:
            print(<span class="hljs-string">f"🧠 Thinking: "</span>, parsed_response.get(<span class="hljs-string">"content"</span>))
            <span class="hljs-keyword">continue</span>

        <span class="hljs-keyword">if</span> parsed_response.get(<span class="hljs-string">"step"</span>) == <span class="hljs-string">"action"</span>: 
            tool_name = parsed_response.get(<span class="hljs-string">"function"</span>)
            <span class="hljs-keyword">if</span> available_tools.get(tool_name, <span class="hljs-literal">False</span>) != <span class="hljs-literal">False</span>:
                fn_output = available_tools[tool_name][<span class="hljs-string">"fn"</span>](parsed_response.get(<span class="hljs-string">"input"</span>))
                messages.append({<span class="hljs-string">"role"</span>: <span class="hljs-string">"assistant"</span>, <span class="hljs-string">"content"</span>: json.dumps({ <span class="hljs-string">"step"</span>: <span class="hljs-string">"observe"</span>, <span class="hljs-string">"output"</span>:  fn_output})})
                <span class="hljs-keyword">continue</span>

        <span class="hljs-keyword">if</span> parsed_response.get(<span class="hljs-string">"step"</span>) == <span class="hljs-string">"output"</span>:
            print(<span class="hljs-string">f"🤖: <span class="hljs-subst">{parsed_response.get(<span class="hljs-string">"content"</span>)}</span>"</span>)
            <span class="hljs-keyword">break</span>
</code></pre>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>If you've carefully followed the code and logic, you'll notice something interesting—<strong>the core architecture hasn’t changed at all.</strong></p>
<p>All we did was swap out the tools (APIs) the agent uses:</p>
<ul>
<li><p>First, it was a weather API.</p>
</li>
<li><p>Then, it became a terminal command executor.</p>
</li>
</ul>
<p>That’s it.</p>
<p>And just like that, you’ve built your own <strong>Mini Cursor</strong>.</p>
<hr />
<h3 id="heading-its-not-magic-its-engineering">It’s Not Magic, It’s Engineering</h3>
<p>Cursor isn’t some black-box sorcery—it’s simply a well-orchestrated system of:</p>
<ul>
<li><p>AI + tool access (via APIs),</p>
</li>
<li><p>Structured system prompts,</p>
</li>
<li><p>And intelligent orchestration logic.</p>
</li>
</ul>
<p>Now that you understand the concept and have hands-on experience, you can imagine just how powerful things can get when you scale this architecture.</p>
<h3 id="heading-see-it-in-action">See It in Action</h3>
<p>Curious how my version of Cursor works in real life?<br />Watch the demo video here:</p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://twitter.com/rnkp_755/status/1910814493363384416">https://twitter.com/rnkp_755/status/1910814493363384416</a></div>
]]></content:encoded></item><item><title><![CDATA[Beyond the Black Box of Generative LLMs]]></title><description><![CDATA[GPT is a buzzword that is intimidating for freshers these days. Technical freshers feel anxious after witnessing its capabilities, concerned that it may threaten their jobs. In contrast, both technical and non-technical individuals are amazed, ponder...]]></description><link>https://blog.raushan.info/inside-genai</link><guid isPermaLink="true">https://blog.raushan.info/inside-genai</guid><category><![CDATA[ChaiCode]]></category><category><![CDATA[gpt]]></category><category><![CDATA[genai]]></category><dc:creator><![CDATA[Raushan Kumar Thakur]]></dc:creator><pubDate>Tue, 08 Apr 2025 10:10:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/nGoCBxiaRO0/upload/5148c41165cee7735f6f71b4a4eb4fa9.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>GPT is a buzzword that is intimidating for freshers these days. Technical freshers feel anxious after witnessing its capabilities, concerned that it may threaten their jobs. In contrast, both technical and non-technical individuals are amazed, pondering, "How can a machine accomplish all this?" Let's delve deeper to understand the mechanism behind it.</p>
<hr />
<h2 id="heading-what-is-generative">What is Generative?</h2>
<p>The term "generative" refers to the ability to create or produce something. As we have already experienced, Unlike traditional systems that retrieve information from the web, these large language models (LLMs) are designed to generate content independently.</p>
<h2 id="heading-what-is-pre-trained">What is Pre-Trained?</h2>
<p>As the term suggests, these LLMs are pre-trained on some data, enabling them to generate responses. Generating a response simply means predicting the next letter repeatedly using mathematical calculations, not magical actions.</p>
<h2 id="heading-what-are-transformers">What are Transformers?</h2>
<p>This term represents the entire mechanism behind how GPTs function. This is the core neural network architecture that GPT models are built upon. Let’s understand the underlying mechanisms step by step by referencing the <a target="_blank" href="https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">research work by Google</a> itself.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744102270683/9deaa63c-1731-4d21-a04d-1f9dee3b579a.png" alt="The Transformer - model architecture." class="image--center mx-auto" /></p>
<h3 id="heading-input-and-encoding">Input and Encoding</h3>
<p>This is the initial stage of interacting with LLMs, whether for training or inference. This step involves receiving input from the user, converting it into machine language, and understanding the actual context the user is referring to. Here are the detailed steps involved in this:</p>
<ol>
<li><p><strong>Tokenization</strong>: As we know, machines only understand numbers. Therefore, it's essential to convert every user input into numbers first. This step is called <strong>Tokenization</strong>, and these numbers are called <strong>Tokens.</strong></p>
<p> The high-level architecture of tokenization involves breaking down a sentence into chunks of words, symbols, or sometimes even small sentences. These chunks are then replaced with corresponding numbers from their vocabulary dictionary. Each LLM has its own dictionary for replacing these chunks.</p>
<p> <strong>For example</strong>, let's create an imaginary vocabulary dictionary and tokenized sentences to see how this process might work.</p>
 <div data-node-type="callout">
 <div data-node-type="callout-emoji">🚨</div>
 <div data-node-type="callout-text">This example is meant to provide a better understanding of tokenization and does not reflect how tokenization occurs in the real world.</div>
 </div>

<p> I want to simulate the traditional multi-tap process on old phone keypads where hitting '2' once gives 'a', twice gives 'b', thrice gives 'c', etc.</p>
<p> <img src="https://www.researchgate.net/profile/Shumin-Zhai/publication/221518150/figure/fig1/AS:305488823635968@1449845619238/The-standard-12-key-telephone-keypad-character-layout-follows-the-ITU-E161-standard-8_Q320.jpg" alt="The standard 12-key telephone keypad, character layout ..." class="image--center mx-auto" /></p>
<p> Here are the steps for this:</p>
<ul>
<li><p>Iterate through <em>each character</em> of the string.</p>
</li>
<li><p>If the character is a letter (a-z), replace it with its corresponding multi-tap digit sequence.</p>
</li>
</ul>
</li>
</ol>
<p>    If the character is any special character, append <code>1</code>.</p>
<p>    Example:</p>
<p>    <strong>Hello Hashnode</strong> will become ['44', '33', '555', '555', '666', '0', '44', '2', '7777', '44', '66', '666', '3', '33', '1'], where <code>44</code> represents <code>H</code>, <code>33</code> maps <code>e</code> and so on<strong>.</strong></p>
<p>    Here's the equivalent Python script to do the same and try it out.</p>
<pre><code class="lang-python">    <span class="hljs-keyword">import</span> string

    t9_map = {
        <span class="hljs-string">'a'</span>: <span class="hljs-string">'2'</span>, <span class="hljs-string">'b'</span>: <span class="hljs-string">'22'</span>, <span class="hljs-string">'c'</span>: <span class="hljs-string">'222'</span>,
        <span class="hljs-string">'d'</span>: <span class="hljs-string">'3'</span>, <span class="hljs-string">'e'</span>: <span class="hljs-string">'33'</span>, <span class="hljs-string">'f'</span>: <span class="hljs-string">'333'</span>,
        <span class="hljs-string">'g'</span>: <span class="hljs-string">'4'</span>, <span class="hljs-string">'h'</span>: <span class="hljs-string">'44'</span>, <span class="hljs-string">'i'</span>: <span class="hljs-string">'444'</span>,
        <span class="hljs-string">'j'</span>: <span class="hljs-string">'5'</span>, <span class="hljs-string">'k'</span>: <span class="hljs-string">'55'</span>, <span class="hljs-string">'l'</span>: <span class="hljs-string">'555'</span>,
        <span class="hljs-string">'m'</span>: <span class="hljs-string">'6'</span>, <span class="hljs-string">'n'</span>: <span class="hljs-string">'66'</span>, <span class="hljs-string">'o'</span>: <span class="hljs-string">'666'</span>,
        <span class="hljs-string">'p'</span>: <span class="hljs-string">'7'</span>, <span class="hljs-string">'q'</span>: <span class="hljs-string">'77'</span>, <span class="hljs-string">'r'</span>: <span class="hljs-string">'777'</span>, <span class="hljs-string">'s'</span>: <span class="hljs-string">'7777'</span>,
        <span class="hljs-string">'t'</span>: <span class="hljs-string">'8'</span>, <span class="hljs-string">'u'</span>: <span class="hljs-string">'88'</span>, <span class="hljs-string">'v'</span>: <span class="hljs-string">'888'</span>,
        <span class="hljs-string">'w'</span>: <span class="hljs-string">'9'</span>, <span class="hljs-string">'x'</span>: <span class="hljs-string">'99'</span>, <span class="hljs-string">'y'</span>: <span class="hljs-string">'999'</span>, <span class="hljs-string">'z'</span>: <span class="hljs-string">'9999'</span>,
    }

    <span class="hljs-comment"># Reverse mapping for detokenization</span>
    reverse_t9_map = {v: k <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> t9_map.items()}

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">tokenize</span>(<span class="hljs-params">text</span>):</span>
        text_lower = text.lower()
        tokens = []

        <span class="hljs-keyword">for</span> char <span class="hljs-keyword">in</span> text_lower:
            <span class="hljs-keyword">if</span> <span class="hljs-string">'a'</span> &lt;= char &lt;= <span class="hljs-string">'z'</span>:
                tokens.append(t9_map[char])
            <span class="hljs-keyword">elif</span> char == <span class="hljs-string">' '</span>:
                tokens.append(<span class="hljs-string">'0'</span>)
            <span class="hljs-keyword">else</span>:
                tokens.append(<span class="hljs-string">'1'</span>)

        <span class="hljs-keyword">return</span> tokens

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">detokenize</span>(<span class="hljs-params">tokens</span>):</span>
        result = <span class="hljs-string">""</span>
        <span class="hljs-keyword">for</span> token <span class="hljs-keyword">in</span> tokens:
            <span class="hljs-keyword">if</span> token == <span class="hljs-string">'0'</span>:
                result += <span class="hljs-string">' '</span>
            <span class="hljs-keyword">elif</span> token == <span class="hljs-string">'1'</span>:
                result += <span class="hljs-string">'?'</span>  <span class="hljs-comment"># symbol placeholder</span>
            <span class="hljs-keyword">else</span>:
                result += reverse_t9_map.get(token, <span class="hljs-string">'?'</span>)  <span class="hljs-comment"># fallback in case of unknown</span>
        <span class="hljs-keyword">return</span> result

    <span class="hljs-comment"># Example usage</span>
    input_string = <span class="hljs-string">"Hello Hashnode!"</span>
    tokens = tokenize(input_string)
    print(<span class="hljs-string">"Tokens:"</span>, tokens)

    decoded = detokenize(tokens)
    print(<span class="hljs-string">"Detokenized:"</span>, decoded)

    <span class="hljs-string">"""
    Output:
    Tokens: ['44', '33', '555', '555', '666', '0', '44', '2', '7777', '44', '66', '666', '3', '33', '1']
    Detokenized: hello hashnode?
    """</span>
</code></pre>
    <div data-node-type="callout">
    <div data-node-type="callout-emoji">💡</div>
    <div data-node-type="callout-text">To experience how tokenization works in the real world, you can visit <a target="_self" href="https://tiktokenizer.vercel.app/">Tiktokenizer</a>.</div>
    </div>

<p>    Want to understand how OpenAI tokenizes a message? Here's the code.</p>
<pre><code class="lang-python">    <span class="hljs-keyword">import</span> tiktoken

    encoder = tiktoken.encoding_for_model(<span class="hljs-string">'gpt-4o'</span>)    <span class="hljs-comment"># gpt model</span>

    print(<span class="hljs-string">"Vocab Size"</span>, encoder.n_vocab) <span class="hljs-comment"># 2,00,019 (200K)</span>

    text = <span class="hljs-string">"Hello Hashnode"</span>
    tokens = encoder.encode(text)

    print(<span class="hljs-string">"Tokens: "</span>, tokens) <span class="hljs-comment"># Tokens [13225, 10242, 7005]</span>

    my_tokens = [<span class="hljs-number">13225</span>, <span class="hljs-number">10242</span>, <span class="hljs-number">7005</span>]
    decoded = encoder.decode(my_tokens)
    print(<span class="hljs-string">"Decoded: "</span>, decoded)    <span class="hljs-comment"># Decoded: Hello Hashnode</span>
</code></pre>
<ol start="2">
<li><p><strong>Vector Embedding:</strong> Vector embedding maps the semantic meaning of words in a sentence to multi-dimensional coordinates (often visualized in 2D or 3D). For example, in the sentences, <strong><em>Monkey eats banana</em></strong> and <strong><em>Man eats rice</em></strong>, ‘monkey' and 'man' are both animals, while 'banana' and 'rice' are food items. As a result, 'monkey' and 'man' would be positioned close to each other in one region of the space, and 'banana' and 'rice' in another. Moreover, the vector from 'monkey' to 'banana' would be similar in direction and magnitude to the vector from 'man' to 'rice', reflecting similar semantic relationships. It’s just mathematical matrices.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744063638604/5e55df93-2553-4000-8f55-9511d0a5b9a6.png" alt class="image--center mx-auto" /></p>
</li>
<li><p><strong>Positional Encoding</strong>: <strong>Positional Encoding</strong> involves adding positional information to token embeddings to help the model understand the order of words in a sentence. For example, consider the sentences <em>The man is eating rice</em> and <em>The rice is eating man</em>. Although the words (and thus their embeddings) are the same in both sentences, their meanings are entirely different due to the change in word order. Since vector embeddings alone do not capture positional context, positional encoding is crucial—it allows the model to distinguish between such cases by encoding each token's position in the sequence, thereby enabling a better understanding of the actual context.</p>
</li>
</ol>
<h3 id="heading-attention-and-feed-forwarding-encoding-phase">Attention and Feed Forwarding (Encoding Phase)</h3>
<p><strong>Attention and Feed Forwarding</strong> is the next phase in the processing pipeline of Large Language Models (LLMs). At this stage, the model focuses on determining which parts of the input are most relevant to each token using the attention mechanism, and then applies feed-forward neural networks to transform these representations. This phase helps the model capture complex relationships between words and introduces non-linearity, allowing it to understand context and meaning beyond simple sequential patterns.</p>
<ol>
<li><p>Multi-Head Attention: <strong>Multi-Head Attention</strong> builds upon the concept of <strong>self-attention</strong>, where each token in a sequence can interact with every other token to better understand contextual relationships. For example, consider the sentences <em>'The river bank'</em> and <em>'The HDFC bank'</em>. In both cases, the word <em>'bank'</em> has the same token and embedding, and even its positional encoding would be similar since it appears at the end of the sentence. However, the meaning of <em>'bank'</em> differs in each context. Self-attention helps the model capture these nuances by allowing the token <em>'bank'</em> to attend to other tokens like <em>'river'</em> or <em>'HDFC'</em> for disambiguation.</p>
<p> <strong>Multi-Head Attention</strong> enhances this process by using multiple attention heads in parallel. Each head learns different types of relationships or focuses on different aspects of the input, enabling the model to capture richer and more diverse contextual information.</p>
</li>
<li><p>Feed Forwarding: <strong>Feed Forwarding</strong> in Large Language Models introduces <strong>non-linearity</strong> into the processing pipeline, allowing the model to interpret the context from multiple perspectives. For instance, imagine a scene where a dog is looking out the window while traveling in a car. Different parts of our brain might focus on various aspects of this moment: <em>'The car was white'</em>, <em>'The dog was a Labrador'</em>, <em>'The family was going on a trip'</em>, <em>'The dog was fascinated by the scenery'</em>, and so on. Similarly, during this phase, the model processes the contextual information through multiple dense layers to extract and represent diverse interpretations and deeper meaning.</p>
</li>
</ol>
<p>This is what happens during the Input phase of LLM interaction, marking the end of the Encoding phase. Now, let's turn our attention to the Output phase and explore what happens in the decoder.</p>
<hr />
<h3 id="heading-output-embedding-amp-positional-encoding">Output Embedding &amp; Positional Encoding</h3>
<p>The decoding phase is <strong>iterative</strong> — it generates one token at a time, and each newly generated token is used to predict the next one.</p>
<p>It begins with the tokens that have already been generated so far (this is often referred to as "shifted right" input). These tokens are passed through an <strong>output embedding layer</strong>, just like on the encoder side. Then, <strong>positional encoding</strong> is added to retain the order of the tokens and the combined representation (embedding + position) forms the input to the decoder stack.</p>
<p>Example:</p>
<p>Let’s say the user input is: <strong>“How are you?”</strong><br />After the input is fully processed by the encoder, the decoder starts generating the response:</p>
<ol>
<li><p>The decoder is triggered with a start token: <code>[&lt;start&gt;]</code>.</p>
</li>
<li><p>It predicts the first word: <code>"I"</code> → Output so far: <code>[&lt;start&gt;, I]</code>.</p>
</li>
<li><p>This gets fed back in → predicts <code>"am"</code> → <code>[&lt;start&gt;, I, am]</code>.</p>
</li>
<li><p>Repeats until the model outputs <code>&lt;end&gt;</code> → <code>[&lt;start&gt;, I, am, fine, &lt;end&gt;]</code>.</p>
</li>
</ol>
<h3 id="heading-masked-multi-head-attention">Masked Multi-Head Attention</h3>
<p>This is the first step in the decoder stack. It's very similar to the multi-head attention used in the encoder, with <strong>one key difference</strong> — <strong>masking</strong>.</p>
<p>Masking ensures that the model <strong>can’t look ahead</strong>. While generating the third word, for example, it shouldn't peek at the fourth. This keeps the generation process <strong>auto-regressive</strong>, i.e., predicting the next token using only the known ones. For example,while predicting the third word in <code>[&lt;start&gt;, I, am]</code>, the model <strong>must not</strong> access <code>"fine"</code> or <code>&lt;end&gt;</code> yet. Masking hides those future tokens during attention.</p>
<h3 id="heading-multi-head-attention">Multi-Head Attention</h3>
<p>This layer allows the decoder to <strong>attend to the encoder’s output</strong> — meaning it connects what’s being generated with what the user actually asked. It helps the model align the generated response with the input context.</p>
<h3 id="heading-feed-forward-add-amp-norm">Feed Forward + Add &amp; Norm</h3>
<p>Same as the encoder — this adds non-linearity and enables the model to understand richer patterns in data. Each token is passed through a <strong><em>Feed Forward Neural Network</em></strong> and a <strong><em>Add &amp; Layer Normalization</em></strong> for stability and better learning.</p>
<h3 id="heading-linear-softmax">Linear → Softmax</h3>
<p>After decoding is done, the final token representations are passed through a <strong>Linear layer</strong>, converting them to a large vector (same size as the vocabulary). Then, a <strong>Softmax</strong> layer is applied to turn this vector into a <strong>probability distribution</strong> over all possible next words. For Example at some point, if the model sees a high probability like: <code>[I: 2, am: 87, have: 4, was: 1, ...]</code>. It chooses <code>"am"</code> as the predicted word.</p>
<hr />
<p>This wraps up the explanation of Gen-AI. I hope you found it interesting. Thank you.</p>
<blockquote>
<p>This article explores the fundamentals of how Generative Pre-trained Transformers (GPTs) function, focusing on key concepts such as tokenization, vector embedding, positional encoding, and attention mechanisms. By breaking down the encoding and decoding phases of Large Language Models (LLMs), it elucidates how these systems generate contextually relevant responses. Through examples and explanations, readers gain insight into the architecture and processes that enable GPTs to produce human-like text.</p>
</blockquote>
]]></content:encoded></item></channel></rss>