AI Chatbot Pipeline Documentation

Pipeline Overview

User Query
   ↓
Encoder (Transformer)
   ↓
Vector Search
   ↓
FAQ priority match?
   ↓
Website content match?
   ↓
Answer synthesis (RAG)

This is a retrieval-first, safe, and efficient pipeline (no hallucination).

Tech Stack

FastAPI – API
Sentence-Transformer – encoder
FAISS – vector search
Any LLM – for final answer synthesis (optional)
FAQs stored separately from website content

Project Structure

app/
 ├── main.py
 ├── embeddings.py
 ├── vector_store.py
 ├── rag.py
 ├── data/
 │    ├── faqs.json
 │    ├── website_chunks.json

Load Encoder (Transformer)

# embeddings.py
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def encode(text: str):
    return model.encode(text, normalize_embeddings=True)

Vector Store (FAISS)

# vector_store.py
import faiss
import numpy as np

class VectorStore:
    def __init__(self, embeddings, texts):
        self.texts = texts
        self.index = faiss.IndexFlatIP(embeddings.shape[1])
        self.index.add(embeddings)

    def search(self, query_embedding, k=3):
        scores, idx = self.index.search(np.array([query_embedding]), k)
        results = []
        for i, score in zip(idx[0], scores[0]):
            results.append((self.texts[i], float(score)))
        return results

Load FAQ & Website Content

# main.py (setup part)
import json
import numpy as np
from embeddings import encode
from vector_store import VectorStore

faqs = json.load(open("data/faqs.json"))
web = json.load(open("data/website_chunks.json"))

faq_questions = [f["question"] for f in faqs]
faq_answers = [f["answer"] for f in faqs]

faq_embeddings = np.array([encode(q) for q in faq_questions])
web_embeddings = np.array([encode(w["text"]) for w in web])

faq_store = VectorStore(faq_embeddings, faq_answers)
web_store = VectorStore(web_embeddings, [w["text"] for w in web])

RAG Answer Synthesis

# rag.py
def synthesize_answer(context_chunks, query):
    context = "\n".join(context_chunks)

    prompt = f"""
Answer the question using ONLY the information below.
If the answer is not present, say "I don't have enough information."

Context:
{context}

Question:
{query}
"""
    # Call your LLM here
    return call_llm(prompt)

Pros and Cons of only RAG Approach

Pros:

Model can generate answers in a natural style.
Works without retrieving documents, fully self-contained.

Cons:

Very small dataset (50 Q&A pairs) → low coverage. Users may ask questions slightly differently.
Hard to update: every new FAQ requires retraining.
Prone to hallucinations if question is outside trained Q&A.

How RAG Changes the Game

You don’t need to train the model on the FAQs.
You store your 50 FAQs (or hundreds more) in a vector database.
When a user asks a question:
1. The retriever finds the most relevant FAQ(s).
2. The generator synthesizes the answer from the retrieved FAQ.

Benefits:

Model will never make up answers outside the retrieved content.
Easy to add new FAQs or website content without retraining.
Can handle paraphrased or unseen questions better.

Should You Keep the Fine-Tuned Model?

If your current fine-tuned transformer works well for tone/style, you can keep it as the generator in RAG.
But if coverage is low, RAG + a general-purpose pretrained LLM (like a base GPT or local model) is often better than fine-tuning on just 50 Q&A.

Suggested Transition

Keep your FAQs in a vector database.
Use your fine-tuned model (optional) for answer synthesis OR use a general LLM.
Let RAG handle retrieval + synthesis:
- FAQs get high priority.
- Website content or docs are secondary.
Optional: keep training/fine-tuning if you need a specific response style.

RAG + LLM Implementation



from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates

import json
import numpy as np
from openai import OpenAI

from embeddings import encode
from vector_store import VectorStore

# -------------------------
# App Setup
# -------------------------
app = FastAPI()
templates = Jinja2Templates(directory="templates")

client = OpenAI(api_key="YOUR_API_KEY")

# -------------------------
# Load Data
# -------------------------
faqs = json.load(open("data/faqs.json"))
web = json.load(open("data/website_chunks.json"))

faq_questions = [f["question"] for f in faqs]
faq_answers = [f["answer"] for f in faqs]

faq_embeddings = np.array([encode(q) for q in faq_questions])
web_embeddings = np.array([encode(w["text"]) for w in web])

faq_store = VectorStore(faq_embeddings, faq_answers)
web_store = VectorStore(web_embeddings, [w["text"] for w in web])

FAQ_THRESHOLD = 0.62
WEB_THRESHOLD = 0.50
HIGH_CONFIDENCE = 0.85

# -------------------------
# Simple Cache
# -------------------------
cache = {}

# -------------------------
# Keyword Routing
# -------------------------
def keyword_route(query: str):
    q = query.lower()

    if "payment" in q or "pay" in q or "paypal" in q:
        return "We accept credit cards, PayPal, and bank transfers."

    if "contact" in q or "email" in q or "support" in q:
        return "You can contact our support team at support@company.com."

    if "refund" in q or "return" in q:
        return "We offer a full refund within 30 days of purchase."

    if "track" in q or "tracking" in q:
        return "Track your order using the tracking link sent to your email."

    return None

# -------------------------
# Context Trimming
# -------------------------
MAX_CONTEXT_CHARS = 1500

def trim_context(contexts):
    trimmed = []
    total = 0

    for c in contexts:
        if total + len(c) > MAX_CONTEXT_CHARS:
            break
        trimmed.append(c)
        total += len(c)

    return trimmed

# -------------------------
# Model Selection
# -------------------------
def pick_model(query, contexts):
    if len(query.split()) <= 5:
        return "gpt-4o-mini"

    if len(contexts) > 3:
        return "gpt-4o"

    return "gpt-4o-mini"

# -------------------------
# RAG LLM Generation
# -------------------------
def generate_rag_response(query, contexts):
    contexts = trim_context(contexts)
    model = pick_model(query, contexts)

    context_text = "\n\n".join(contexts)

    prompt = f"""
You are a helpful customer support assistant.

Answer briefly (max 3 sentences).
Use ONLY the context below.
If the answer is not in the context, say "I don't know".

Context:
{context_text}

Question:
{query}

Answer:
"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2,
        max_tokens=200
    )

    return response.choices[0].message.content.strip()

# -------------------------
# Routes
# -------------------------
@app.get("/", response_class=HTMLResponse)
def home(request: Request):
    return templates.TemplateResponse("index.html", {"request": request})


@app.get("/getChatBotResponse")
def get_bot_response(msg: str):
    query = msg.strip()

    # Cache check
    if query in cache:
        return {
            "response": cache[query],
            "source": "cache",
            "confidence": 1.0
        }

    # Keyword routing (short queries)
    if len(query.split()) <= 2:
        keyword_answer = keyword_route(query)
        if keyword_answer:
            cache[query] = keyword_answer
            return {
                "response": keyword_answer,
                "source": "keyword",
                "confidence": 1.0
            }

    # Encode query
    query_embedding = encode(query)

    contexts = []
    scores = []

    # FAQ retrieval
    faq_results = faq_store.search(query_embedding, k=2)
    for ans, score in faq_results:
        if score >= FAQ_THRESHOLD:
            contexts.append(ans)
            scores.append(score)

    # High-confidence shortcut (skip LLM)
    if faq_results and faq_results[0][1] >= HIGH_CONFIDENCE:
        best_answer = faq_results[0][0]
        cache[query] = best_answer
        return {
            "response": best_answer,
            "source": "faq-direct",
            "confidence": faq_results[0][1]
        }

    # Website retrieval
    web_results = web_store.search(query_embedding, k=2)
    for text, score in web_results:
        if score >= WEB_THRESHOLD:
            contexts.append(text)
            scores.append(score)

    # No context fallback
    if not contexts:
        return {
            "response": "I don't have enough information to answer that.",
            "source": "none",
            "confidence": 0.0
        }

    # LLM generation (RAG)
    final_answer = generate_rag_response(query, contexts)

    # Save to cache
    cache[query] = final_answer

    return {
        "response": final_answer,
        "source": "rag+llm",
        "confidence": max(scores) if scores else 0.5
    }

Workflow

The query flow is:

User sends query to /getChatBotResponse?msg=...
Check cache: return cached response if exists (fastest)
Keyword routing: handle short queries like "payment" or "refund"
Encode query into embeddings using encode(query)
Search FAQs via vector similarity
If high confidence in FAQ → return FAQ answer directly
Search website content chunks if needed
Combine FAQ + website content as context
Trim context to limit token usage
Select model based on query and context
Generate final answer via RAG + LLM
Cache the response for future queries
Return final answer with source and confidence score

Search This Blog

AI Chatbot Pipeline Documentation

AI Chatbot Pipeline Documentation

Pipeline Overview

Tech Stack

Project Structure

Load Encoder (Transformer)

Vector Store (FAISS)

Load FAQ & Website Content

RAG Answer Synthesis

Pros and Cons of only RAG Approach

Pros:

Cons:

How RAG Changes the Game

Should You Keep the Fine-Tuned Model?

Suggested Transition

RAG + LLM Implementation

Workflow

Parent Topics

Contact Us

Popular Posts

Simulation of ASK, FSK, and PSK using MATLAB Simulink (with Online Simulator)

BER vs SNR for M-ary QAM, M-ary PSK, QPSK, BPSK, ...(MATLAB Code + Simulator)

Antenna Gain-Combining Methods - EGC, MRC, SC, and RMSGC

Constellation Diagrams of ASK, PSK, and FSK (with MATLAB Code + Simulator)

Coherence Bandwidth and Coherence Time (with MATLAB + Simulator)

OFDM Symbols and Subcarriers Explained

BER performance of QPSK with BPSK, 4-QAM, 16-QAM, 64-QAM, 256-QAM, etc (MATLAB + Simulator)

ASK, FSK, and PSK (with MATLAB + Online Simulator)