Skip to main content

AI Chatbot Pipeline Documentation


AI Chatbot Pipeline Documentation

Pipeline Overview

User Query
   ↓
Encoder (Transformer)
   ↓
Vector Search
   ↓
FAQ priority match?
   ↓
Website content match?
   ↓
Answer synthesis (RAG)
        

This is a retrieval-first, safe, and efficient pipeline (no hallucination).

Tech Stack

  • FastAPI – API
  • Sentence-Transformer – encoder
  • FAISS – vector search
  • Any LLM – for final answer synthesis (optional)
  • FAQs stored separately from website content

Project Structure

app/
 ├── main.py
 ├── embeddings.py
 ├── vector_store.py
 ├── rag.py
 ├── data/
 │    ├── faqs.json
 │    ├── website_chunks.json
        

Load Encoder (Transformer)

# embeddings.py
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def encode(text: str):
    return model.encode(text, normalize_embeddings=True)
        

Vector Store (FAISS)

# vector_store.py
import faiss
import numpy as np

class VectorStore:
    def __init__(self, embeddings, texts):
        self.texts = texts
        self.index = faiss.IndexFlatIP(embeddings.shape[1])
        self.index.add(embeddings)

    def search(self, query_embedding, k=3):
        scores, idx = self.index.search(np.array([query_embedding]), k)
        results = []
        for i, score in zip(idx[0], scores[0]):
            results.append((self.texts[i], float(score)))
        return results
        

Load FAQ & Website Content

# main.py (setup part)
import json
import numpy as np
from embeddings import encode
from vector_store import VectorStore

faqs = json.load(open("data/faqs.json"))
web = json.load(open("data/website_chunks.json"))

faq_questions = [f["question"] for f in faqs]
faq_answers = [f["answer"] for f in faqs]

faq_embeddings = np.array([encode(q) for q in faq_questions])
web_embeddings = np.array([encode(w["text"]) for w in web])

faq_store = VectorStore(faq_embeddings, faq_answers)
web_store = VectorStore(web_embeddings, [w["text"] for w in web])
        

RAG Answer Synthesis

# rag.py
def synthesize_answer(context_chunks, query):
    context = "\n".join(context_chunks)

    prompt = f"""
Answer the question using ONLY the information below.
If the answer is not present, say "I don't have enough information."

Context:
{context}

Question:
{query}
"""
    # Call your LLM here
    return call_llm(prompt)
        

Pros and Cons of only RAG Approach

Pros:

  • Model can generate answers in a natural style.
  • Works without retrieving documents, fully self-contained.

Cons:

  • Very small dataset (50 Q&A pairs) → low coverage. Users may ask questions slightly differently.
  • Hard to update: every new FAQ requires retraining.
  • Prone to hallucinations if question is outside trained Q&A.

How RAG Changes the Game

  • You don’t need to train the model on the FAQs.
  • You store your 50 FAQs (or hundreds more) in a vector database.
  • When a user asks a question:
    1. The retriever finds the most relevant FAQ(s).
    2. The generator synthesizes the answer from the retrieved FAQ.

Benefits:

  • Model will never make up answers outside the retrieved content.
  • Easy to add new FAQs or website content without retraining.
  • Can handle paraphrased or unseen questions better.

Should You Keep the Fine-Tuned Model?

  • If your current fine-tuned transformer works well for tone/style, you can keep it as the generator in RAG.
  • But if coverage is low, RAG + a general-purpose pretrained LLM (like a base GPT or local model) is often better than fine-tuning on just 50 Q&A.

Suggested Transition

  1. Keep your FAQs in a vector database.
  2. Use your fine-tuned model (optional) for answer synthesis OR use a general LLM.
  3. Let RAG handle retrieval + synthesis:
    • FAQs get high priority.
    • Website content or docs are secondary.
  4. Optional: keep training/fine-tuning if you need a specific response style.

RAG + LLM Implementation



from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates

import json
import numpy as np
from openai import OpenAI

from embeddings import encode
from vector_store import VectorStore

# -------------------------
# App Setup
# -------------------------
app = FastAPI()
templates = Jinja2Templates(directory="templates")

client = OpenAI(api_key="YOUR_API_KEY")

# -------------------------
# Load Data
# -------------------------
faqs = json.load(open("data/faqs.json"))
web = json.load(open("data/website_chunks.json"))

faq_questions = [f["question"] for f in faqs]
faq_answers = [f["answer"] for f in faqs]

faq_embeddings = np.array([encode(q) for q in faq_questions])
web_embeddings = np.array([encode(w["text"]) for w in web])

faq_store = VectorStore(faq_embeddings, faq_answers)
web_store = VectorStore(web_embeddings, [w["text"] for w in web])

FAQ_THRESHOLD = 0.62
WEB_THRESHOLD = 0.50
HIGH_CONFIDENCE = 0.85

# -------------------------
# Simple Cache
# -------------------------
cache = {}

# -------------------------
# Keyword Routing
# -------------------------
def keyword_route(query: str):
    q = query.lower()

    if "payment" in q or "pay" in q or "paypal" in q:
        return "We accept credit cards, PayPal, and bank transfers."

    if "contact" in q or "email" in q or "support" in q:
        return "You can contact our support team at support@company.com."

    if "refund" in q or "return" in q:
        return "We offer a full refund within 30 days of purchase."

    if "track" in q or "tracking" in q:
        return "Track your order using the tracking link sent to your email."

    return None

# -------------------------
# Context Trimming
# -------------------------
MAX_CONTEXT_CHARS = 1500

def trim_context(contexts):
    trimmed = []
    total = 0

    for c in contexts:
        if total + len(c) > MAX_CONTEXT_CHARS:
            break
        trimmed.append(c)
        total += len(c)

    return trimmed

# -------------------------
# Model Selection
# -------------------------
def pick_model(query, contexts):
    if len(query.split()) <= 5:
        return "gpt-4o-mini"

    if len(contexts) > 3:
        return "gpt-4o"

    return "gpt-4o-mini"

# -------------------------
# RAG LLM Generation
# -------------------------
def generate_rag_response(query, contexts):
    contexts = trim_context(contexts)
    model = pick_model(query, contexts)

    context_text = "\n\n".join(contexts)

    prompt = f"""
You are a helpful customer support assistant.

Answer briefly (max 3 sentences).
Use ONLY the context below.
If the answer is not in the context, say "I don't know".

Context:
{context_text}

Question:
{query}

Answer:
"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2,
        max_tokens=200
    )

    return response.choices[0].message.content.strip()

# -------------------------
# Routes
# -------------------------
@app.get("/", response_class=HTMLResponse)
def home(request: Request):
    return templates.TemplateResponse("index.html", {"request": request})


@app.get("/getChatBotResponse")
def get_bot_response(msg: str):
    query = msg.strip()

    # Cache check
    if query in cache:
        return {
            "response": cache[query],
            "source": "cache",
            "confidence": 1.0
        }

    # Keyword routing (short queries)
    if len(query.split()) <= 2:
        keyword_answer = keyword_route(query)
        if keyword_answer:
            cache[query] = keyword_answer
            return {
                "response": keyword_answer,
                "source": "keyword",
                "confidence": 1.0
            }

    # Encode query
    query_embedding = encode(query)

    contexts = []
    scores = []

    # FAQ retrieval
    faq_results = faq_store.search(query_embedding, k=2)
    for ans, score in faq_results:
        if score >= FAQ_THRESHOLD:
            contexts.append(ans)
            scores.append(score)

    # High-confidence shortcut (skip LLM)
    if faq_results and faq_results[0][1] >= HIGH_CONFIDENCE:
        best_answer = faq_results[0][0]
        cache[query] = best_answer
        return {
            "response": best_answer,
            "source": "faq-direct",
            "confidence": faq_results[0][1]
        }

    # Website retrieval
    web_results = web_store.search(query_embedding, k=2)
    for text, score in web_results:
        if score >= WEB_THRESHOLD:
            contexts.append(text)
            scores.append(score)

    # No context fallback
    if not contexts:
        return {
            "response": "I don't have enough information to answer that.",
            "source": "none",
            "confidence": 0.0
        }

    # LLM generation (RAG)
    final_answer = generate_rag_response(query, contexts)

    # Save to cache
    cache[query] = final_answer

    return {
        "response": final_answer,
        "source": "rag+llm",
        "confidence": max(scores) if scores else 0.5
    } 
        

Workflow

The query flow is:

  • User sends query to /getChatBotResponse?msg=...
  • Check cache: return cached response if exists (fastest)
  • Keyword routing: handle short queries like "payment" or "refund"
  • Encode query into embeddings using encode(query)
  • Search FAQs via vector similarity
  • If high confidence in FAQ → return FAQ answer directly
  • Search website content chunks if needed
  • Combine FAQ + website content as context
  • Trim context to limit token usage
  • Select model based on query and context
  • Generate final answer via RAG + LLM
  • Cache the response for future queries
  • Return final answer with source and confidence score

People are good at skipping over material they already know!

View Related Topics to







Contact Us

Name

Email *

Message *

Popular Posts

Simulation of ASK, FSK, and PSK using MATLAB Simulink (with Online Simulator)

📘 Overview 🧮 How to use MATLAB Simulink 🧮 Simulation of ASK using MATLAB Simulink 🧮 Simulation of FSK using MATLAB Simulink 🧮 Simulation of PSK using MATLAB Simulink 🧮 Simulator for ASK, FSK, and PSK 🧮 Digital Signal Processing Simulator 📚 Further Reading ASK, FSK & PSK HomePage MATLAB Simulation Simulation of Amplitude Shift Keying (ASK) using MATLAB Simulink In Simulink, we pick different components/elements from MATLAB Simulink Library. Then we connect the components and perform a particular operation. Result A sine wave source, a pulse generator, a product block, a mux, and a scope are shown in the diagram above. The pulse generator generates the '1' and '0' bit sequences. Sine wave sources produce a specific amplitude and frequency. The scope displays the modulated signal as well as the original bit sequence created by the pulse generator. Mux i...

BER vs SNR for M-ary QAM, M-ary PSK, QPSK, BPSK, ...(MATLAB Code + Simulator)

Bit Error Rate (BER) & SNR Guide Analyze communication system performance with our interactive simulators and MATLAB tools. 📘 Theory 🧮 Simulators 💻 MATLAB Code 📚 Resources BER Definition SNR Formula BER Calculator MATLAB Comparison 📂 Explore M-ary QAM, PSK, and QPSK Topics ▼ 🧮 Constellation Simulator: M-ary QAM 🧮 Constellation Simulator: M-ary PSK 🧮 BER calculation for ASK, FSK, and PSK 🧮 Approaches to BER vs SNR What is Bit Error Rate (BER)? The BER indicates how many corrupted bits are received compared to the total number of bits sent. It is the primary figure of merit for a...

Antenna Gain-Combining Methods - EGC, MRC, SC, and RMSGC

📘 Overview 🧮 Equal gain combining (EGC) 🧮 Maximum ratio combining (MRC) 🧮 Selective combining (SC) 🧮 Root mean square gain combining (RMSGC) 🧮 Zero-Forcing (ZF) Combining 🧮 MATLAB Code 📚 Further Reading  There are different antenna gain-combining methods. They are as follows. 1. Equal gain combining (EGC) 2. Maximum ratio combining (MRC) 3. Selective combining (SC) 4. Root mean square gain combining (RMSGC) 5. Zero-Forcing (ZF) Combining  1. Equal gain combining method Equal Gain Combining (EGC) is a diversity combining technique in which the receiver aligns the phase of the received signals from multiple antennas (or channels) but gives them equal amplitude weight before summing. This means each received signal is phase-corrected to be coherent with others, but no scaling is applied based on signal strength or channel quality (unlike MRC). Mathematically, for received signa...

Constellation Diagrams of ASK, PSK, and FSK (with MATLAB Code + Simulator)

Constellation Diagrams: ASK, FSK, and PSK Comprehensive guide to signal space representation, including interactive simulators and MATLAB implementations. 📘 Overview 🧮 Simulator ⚖️ Theory 📚 Resources Definitions Constellation Tool Key Points MATLAB Code 📂 Other Topics: M-ary PSK & QAM Diagrams ▼ 🧮 Simulator for M-ary PSK Constellation 🧮 Simulator for M-ary QAM Constellation BASK (Binary ASK) Modulation Transmits one of two signals: 0 or -√Eb, where Eb​ is the energy per bit. These signals represent binary 0 and 1. BFSK (Binary FSK) Modulation Transmits one ...

Coherence Bandwidth and Coherence Time (with MATLAB + Simulator)

🧮 Coherence Bandwidth 🧮 Coherence Time 🧮 MATLAB Code s 📚 Further Reading For Doppler Delay or Multi-path Delay Coherence time T coh ∝ 1 / v max (For slow fading, coherence time T coh is greater than the signaling interval.) Coherence bandwidth W coh ∝ 1 / Ï„ max (For frequency-flat fading, coherence bandwidth W coh is greater than the signaling bandwidth.) Where: T coh = coherence time W coh = coherence bandwidth v max = maximum Doppler frequency (or maximum Doppler shift) Ï„ max = maximum excess delay (maximum time delay spread) Notes: The notation v max −1 and Ï„ max −1 indicate inverse proportionality. Doppler spread refers to the range of frequency shifts caused by relative motion, determining T coh . Delay spread (or multipath delay spread) determines W coh . Frequency-flat fading occurs when W coh is greater than the signaling bandwidth. Coherence Bandwidth Coherence bandwidth is...

OFDM Symbols and Subcarriers Explained

This article explains how OFDM (Orthogonal Frequency Division Multiplexing) symbols and subcarriers work. It covers modulation, mapping symbols to subcarriers, subcarrier frequency spacing, IFFT synthesis, cyclic prefix, and transmission. Step 1: Modulation First, modulate the input bitstream. For example, with 16-QAM , each group of 4 bits maps to one QAM symbol. Suppose we generate a sequence of QAM symbols: s0, s1, s2, s3, s4, s5, …, s63 Step 2: Mapping Symbols to Subcarriers Assume N sub = 8 subcarriers. Each OFDM symbol in the frequency domain contains 8 QAM symbols (one per subcarrier): Mapping (example) OFDM symbol 1 → s0, s1, s2, s3, s4, s5, s6, s7 OFDM symbol 2 → s8, s9, s10, s11, s12, s13, s14, s15 … OFDM sym...

BER performance of QPSK with BPSK, 4-QAM, 16-QAM, 64-QAM, 256-QAM, etc (MATLAB + Simulator)

📘 Overview 📚 QPSK vs BPSK and QAM: A Comparison of Modulation Schemes in Wireless Communication 📚 Real-World Example 🧮 MATLAB Code 📚 Further Reading   QPSK provides twice the data rate compared to BPSK. However, the bit error rate (BER) is approximately the same as BPSK at low SNR values when gray coding is used. On the other hand, QPSK exhibits similar spectral efficiency to 4-QAM and 16-QAM under low SNR conditions. In very noisy channels, QPSK can sometimes achieve better spectral efficiency than 4-QAM or 16-QAM. In practical wireless communication scenarios, QPSK is commonly used along with QAM techniques, especially where adaptive modulation is applied. Modulation Bits/Symbol Points in Constellation Usage Notes BPSK 1 2 Very robust, used in weak signals QPSK 2 4 Balanced speed & reliability 4-QAM ...

ASK, FSK, and PSK (with MATLAB + Online Simulator)

📘 ASK Theory 📘 FSK Theory 📘 PSK Theory 📊 Comparison 🧮 MATLAB Codes 🎮 Simulator ASK or OFF ON Keying ASK is a simple (less complex) Digital Modulation Scheme where we vary the modulation signal's amplitude or voltage by the message signal's amplitude or voltage. We select two levels (two different voltage levels) for transmitting modulated message signals. Example: "+5 Volt" (upper level) and "0 Volt" (lower level). To transmit binary bit "1", the transmitter sends "+5 Volts", and for bit "0", it sends no power. The receiver uses filters to detect whether a binary "1" or "0" was transmitted. Fig 1: Output of ASK, FSK, and PSK modulation using MATLAB for a data stream "1 1 0 0 1 0 1 0" ( Get MATLAB Code ) ...