Build a RAG Chatbot with memory

Take your chatbot to the next level with two powerful upgrades: personalized document uploads and memory-enhanced conversations for richer interactions.

and

Apr 08, 2025

Introduction

If you've been following along, you know we’ve been on a journey to explore how to build intelligent apps with LLMs using LangChain. Here’s a quick recap:

In Part 1, we explored how LangChain Framework simplifies building LMM powered applications by providing modular components like chains, retrievers, embeddings and vector stores.
In Part 2 , we walked you through a hands-on tutorial of how to build your first LLM application using LangChain.

In Part 3, we built a Retrieval Augmented Generation (RAG) powered chatbot using LangChain
In Part 4, we deployed our RAG chatbot using a two-tier architecture and asked questions about the “Attention is All You Need” research paper.

Ready to level up your RAG chatbot? Now, it is time for version 2.0 upgrade! 🚀

In this article, we will show you how to

a) Empower users to chat with their own documents: Imagine uploading your resume and asking questions to it directly!

b) Give your chatbot a memory: we'll enable your app to remember past conversations, providing a richer, more contextual experience.

c) Master efficient memory management: Learn to handle both user session history and vector store data like a pro, keeping your app fast and scalable.

By the end of this post, you’ll have the blueprint for a more powerful, more conversational, and more user-friendly AI assistant.

Note - For those seeking to build this application themselves after reading the post, you can find the complete code in this GitHub repo.

About the Authors:

Arun Subramanian: Arun is an Associate Principal of Analytics & Insights at Amazon Ads, where he leads development and deployment of innovative insights to optimize advertising performance at scale. He has over 12 years of experience and is skilled in crafting strategic analytics roadmap, nurturing talent, collaborating with cross-functional teams, and communicating complex insights to diverse stakeholders.

Manisha Arora: Manisha is a Data Science Lead at Google Ads, where she leads the Measurement & Incrementality vertical across Search, YouTube, and Shopping. She has 12 years experience in enabling data-driven decision making for product growth. Manisha is the founder of PrepVector that aims to democratize data science through knowledge sharing, structured courses, and community building.

Uploading a new document

Imagine you've just landed on our awesome Q&A chatbot, ready to get some insights from a document you have. The first thing you'll likely do is upload that file, right? Let's see what happens behind the scenes when you hit that "Upload a document" button in our sleek Streamlit frontend!

Frontend handles the user upload

On the Streamlit side, it's designed to be super intuitive. We've got that handy st.file_uploader("Upload your document", type="pdf") component sitting there, waiting for your PDF. When you select a file (eg., resume) and it gets uploaded, Streamlit takes care of the initial handling. It reads the file content and keeps it ready for us. Now, just uploading it to the browser isn't enough; we need to get it to our brainy backend (FastAPI) for processing. That's where our "Process Document" button comes into play! When you click this button, it triggers a request to our FastAPI server, sending your precious PDF along for its transformation journey. We'll usually show a little "Processing document..." message to keep you in the loop – nobody likes a silent app!

# Upload a document
uploaded_file = st.file_uploader("Upload a document", type="pdf")

if uploaded_file is not None:
    if st.button("Process Document"):
        st.info("Processing document...")
        url = "http://localhost:8000/process_pdf/"
        files = {"file": uploaded_file.getvalue()}

        try:
            response = requests.post(url, files=files)
            response.raise_for_status()
            result = response.json()
            st.success(result["message"])
            # Indicate file has been processed
            st.session_state.file_processed = True  
            # clear the question, after a new file has been processed.
            st.session_state.question = ""

        except requests.exceptions.RequestException as e:
            st.error(f"Error processing document: {e}")
            # reset to false if processing fails
            st.session_state.file_processed = False

Backend processes the document

Over on the FastAPI side, our /process_pdf endpoint springs into action as soon as it receives your file. FastAPI, being the efficient workhorse it is, handles the incoming file seamlessly. Inside our process_pdf function, the real magic begins. First, we save the uploaded PDF temporarily – just so we can work with it properly. Then, we load the content using our trusty PyPDFLoader. Next up is breaking down that potentially massive document into smaller, digestible chunks using our RecursiveCharacterTextSplitter. Think of it like turning a giant book into bite-sized paragraphs.

Now comes the crucial part for keeping things relevant: we need to make sure we're focusing on this new document. That's where our vector store and document IDs come in. If we've processed a document before, we use the stored document IDs (prev_ids) to tell Chroma to delete the embeddings from the previous file. This ensures our chatbot's knowledge base is fresh and focused on your newly uploaded document. Likewise, we reset our chat history by clearing the content of session store, sessionstore. After clearing the old vector store and session store data, we generate embeddings for the new document using our HuggingFace Embeddings model. Finally, we add these new embeddings, along with their unique IDs, to our Chroma vector store. Once all this processing is done, our FastAPI backend sends a "PDF processed successfully!" message back to the Streamlit frontend, letting you know it's ready for your burning questions!

@app.post("/process_pdf")
async def process_pdf_endpoint(file: UploadFile = File(...)):
    """
    Endpoint to process the uploaded PDF file and create a retrieval-augmented generation (RAG) chain.
    """
    try:
        # Save the uploaded file to the temp directory
        file_path = os.path.join(temp_dir, file.filename)
        if os.path.exists(file_path):
            os.remove(file_path)
        with open(file_path, "wb") as f:
            f.write(await file.read())

        # Process the PDF and create the RAG chain
        global rag_chain_instance
        if rag_chain_instance is not None:
            # Reset the previous instance if it exists
            rag_chain_instance = None  
        rag_chain_instance = process_pdf(file_path)

        return JSONResponse(content={"message": "PDF processed successfully!"})

    except Exception as e:
        # Debug
        print(f"Error in /process_pdf/: {e}")
        # Raise an HTTP exception.
        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")

Asking Questions and Getting Answers

Alright, you've uploaded your document and seen that reassuring "processed successfully" message. Now comes the fun part: getting some answers! You type your question into that friendly text box in our Streamlit app and then, with anticipation, you click that "Get Response" button. Let's peek under the hood and see what happens next!

Frontend sends the query:

On the Streamlit side, when you type your question into the st.text_input field, we store that text in our session state. But the magic truly happens when you hit that "Get Response" button. This action triggers a request from the Streamlit frontend to our FastAPI backend's /invoke endpoint.

# Ask a question
input_text = st.text_input("Please enter your question below and click 'Get Response'")

Backend brain in action: Retrieval and Generation:

Over in our FastAPI backend, the /invoke endpoint receives your question. The first thing we do is check if a document has been processed for this session – we don't want to answer questions about thin air! Assuming a document has been processed, we dive into the core of our RAG system.

First, our "Memory Management" component kicks in. Using your session ID (set as "default"), it retrieves the history of your conversation so far. This context is crucial for understanding follow-up questions. Next, your question is used to query our Chroma vector store. We generate an embedding for your question (using the same HuggingFace model as before) and then ask Chroma to find the most similar document chunks from the PDF you uploaded. This is the "Retrieval" part of RAG – finding the relevant information.

Now comes the "Generation" part. We take your original question, the relevant chunks retrieved from Chroma, and the conversation history from our memory, and we carefully construct a prompt. This prompt is then sent to our powerful language model, Groq's Gemma2-9b-It. The language model reads the prompt and generates a concise and informative answer based on the provided context.

Once we have the answer from Groq, we update the conversation history in our memory, adding your question and the chatbot's response. Finally, our FastAPI backend packages up the generated answer and the source content (the specific chunks from your PDF that were used to answer the question) and sends it all back to the Streamlit frontend.

@app.post("/invoke")
async def invoke(request: InvokeRequest):
    try:
        global rag_chain_instance
        if rag_chain_instance is None:
            raise HTTPException(status_code=400, detail="No PDF processed. Please upload a PDF first.")
       
        data = request.model_dump()
        result = rag_chain_instance.invoke(data, config = {"configurable": {"session_id" : "default"}})
        # Debug
        print(f"{sessionstore}")
        answer = result['answer']
        # extract page contents from documents
        sources = [doc.page_content for doc in result['context']]

        return JSONResponse(content={"answer": answer, "sources": sources})

    except Exception as e:
        # Debug
        print(f"Error in /invoke: {e}")
        # Raise an HTTP exception
        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")

Frontend displaying the wisdom:

Back in your Streamlit app, the response from FastAPI arrives, containing the answer and the sources. We then display the answer in a clear and readable format. But we don't stop there! To show you why the chatbot answered the way it did, we also display the relevant snippets from your uploaded document under a "Sources" heading. This transparency helps you understand the chatbot's reasoning and verify the information. And just like that, you've had a meaningful interaction with your own document!

Conclusion

Seriously, you HAVE to try this for yourself!

I uploaded my resume and was blown away by the insights – it pinpointed my strengths, nailed my working style, suggested ideal work cultures, and even gave me actionable tips to boost my resume!

It felt like having a super smart career coach... powered by AI.

Want to give it a spin? Here are a few prompts to try once you upload your resume:

Give me an elevator pitch about <your_name> and their skills?
What are their key strengths?
How can this resume be even better?
Based on this, what kind of roles should <your_name> be targeting?
What's their working style?

Resumes are one use case — but the same chatbot architecture can work for portfolios, research papers, product documentation, customer support manuals, and more.

If you come up with a creative use case, we’d LOVE to hear it.

A guest post by

Manisha Arora

Data Science Leader, Google | Founder - PrepVector | MIT & UT Austin With 11+ years in data science, I drive product growth at Google Ads. I share insights through courses, webinars, and content to support the next generation of data scientists.

Demystifying Machine Learning with Arun

Discussion about this post