Skip to main content

AI Chatbot Pipeline Documentation


AI Chatbot Pipeline Documentation

Pipeline Overview

User Query
   ↓
Encoder (Transformer)
   ↓
Vector Search
   ↓
FAQ priority match?
   ↓
Website content match?
   ↓
Answer synthesis (RAG)
        

This is a retrieval-first, safe, and efficient pipeline (no hallucination).

Tech Stack

  • FastAPI – API
  • Sentence-Transformer – encoder
  • FAISS – vector search
  • Any LLM – for final answer synthesis (optional)
  • FAQs stored separately from website content

Project Structure

app/
 ├── main.py
 ├── embeddings.py
 ├── vector_store.py
 ├── rag.py
 ├── data/
 │    ├── faqs.json
 │    ├── website_chunks.json
        

Load Encoder (Transformer)

# embeddings.py
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def encode(text: str):
    return model.encode(text, normalize_embeddings=True)
        

Vector Store (FAISS)

# vector_store.py
import faiss
import numpy as np

class VectorStore:
    def __init__(self, embeddings, texts):
        self.texts = texts
        self.index = faiss.IndexFlatIP(embeddings.shape[1])
        self.index.add(embeddings)

    def search(self, query_embedding, k=3):
        scores, idx = self.index.search(np.array([query_embedding]), k)
        results = []
        for i, score in zip(idx[0], scores[0]):
            results.append((self.texts[i], float(score)))
        return results
        

Load FAQ & Website Content

# main.py (setup part)
import json
import numpy as np
from embeddings import encode
from vector_store import VectorStore

faqs = json.load(open("data/faqs.json"))
web = json.load(open("data/website_chunks.json"))

faq_questions = [f["question"] for f in faqs]
faq_answers = [f["answer"] for f in faqs]

faq_embeddings = np.array([encode(q) for q in faq_questions])
web_embeddings = np.array([encode(w["text"]) for w in web])

faq_store = VectorStore(faq_embeddings, faq_answers)
web_store = VectorStore(web_embeddings, [w["text"] for w in web])
        

RAG Answer Synthesis

# rag.py
def synthesize_answer(context_chunks, query):
    context = "\n".join(context_chunks)

    prompt = f"""
Answer the question using ONLY the information below.
If the answer is not present, say "I don't have enough information."

Context:
{context}

Question:
{query}
"""
    # Call your LLM here
    return call_llm(prompt)
        

Pros and Cons of only RAG Approach

Pros:

  • Model can generate answers in a natural style.
  • Works without retrieving documents, fully self-contained.

Cons:

  • Very small dataset (50 Q&A pairs) → low coverage. Users may ask questions slightly differently.
  • Hard to update: every new FAQ requires retraining.
  • Prone to hallucinations if question is outside trained Q&A.

How RAG Changes the Game

  • You don’t need to train the model on the FAQs.
  • You store your 50 FAQs (or hundreds more) in a vector database.
  • When a user asks a question:
    1. The retriever finds the most relevant FAQ(s).
    2. The generator synthesizes the answer from the retrieved FAQ.

Benefits:

  • Model will never make up answers outside the retrieved content.
  • Easy to add new FAQs or website content without retraining.
  • Can handle paraphrased or unseen questions better.

Should You Keep the Fine-Tuned Model?

  • If your current fine-tuned transformer works well for tone/style, you can keep it as the generator in RAG.
  • But if coverage is low, RAG + a general-purpose pretrained LLM (like a base GPT or local model) is often better than fine-tuning on just 50 Q&A.

Suggested Transition

  1. Keep your FAQs in a vector database.
  2. Use your fine-tuned model (optional) for answer synthesis OR use a general LLM.
  3. Let RAG handle retrieval + synthesis:
    • FAQs get high priority.
    • Website content or docs are secondary.
  4. Optional: keep training/fine-tuning if you need a specific response style.

RAG + LLM Implementation



from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from fastapi.templating import Jinja2Templates

import json
import numpy as np
from openai import OpenAI

from embeddings import encode
from vector_store import VectorStore

# -------------------------
# App Setup
# -------------------------
app = FastAPI()
templates = Jinja2Templates(directory="templates")

client = OpenAI(api_key="YOUR_API_KEY")

# -------------------------
# Load Data
# -------------------------
faqs = json.load(open("data/faqs.json"))
web = json.load(open("data/website_chunks.json"))

faq_questions = [f["question"] for f in faqs]
faq_answers = [f["answer"] for f in faqs]

faq_embeddings = np.array([encode(q) for q in faq_questions])
web_embeddings = np.array([encode(w["text"]) for w in web])

faq_store = VectorStore(faq_embeddings, faq_answers)
web_store = VectorStore(web_embeddings, [w["text"] for w in web])

FAQ_THRESHOLD = 0.62
WEB_THRESHOLD = 0.50
HIGH_CONFIDENCE = 0.85

# -------------------------
# Simple Cache
# -------------------------
cache = {}

# -------------------------
# Keyword Routing
# -------------------------
def keyword_route(query: str):
    q = query.lower()

    if "payment" in q or "pay" in q or "paypal" in q:
        return "We accept credit cards, PayPal, and bank transfers."

    if "contact" in q or "email" in q or "support" in q:
        return "You can contact our support team at support@company.com."

    if "refund" in q or "return" in q:
        return "We offer a full refund within 30 days of purchase."

    if "track" in q or "tracking" in q:
        return "Track your order using the tracking link sent to your email."

    return None

# -------------------------
# Context Trimming
# -------------------------
MAX_CONTEXT_CHARS = 1500

def trim_context(contexts):
    trimmed = []
    total = 0

    for c in contexts:
        if total + len(c) > MAX_CONTEXT_CHARS:
            break
        trimmed.append(c)
        total += len(c)

    return trimmed

# -------------------------
# Model Selection
# -------------------------
def pick_model(query, contexts):
    if len(query.split()) <= 5:
        return "gpt-4o-mini"

    if len(contexts) > 3:
        return "gpt-4o"

    return "gpt-4o-mini"

# -------------------------
# RAG LLM Generation
# -------------------------
def generate_rag_response(query, contexts):
    contexts = trim_context(contexts)
    model = pick_model(query, contexts)

    context_text = "\n\n".join(contexts)

    prompt = f"""
You are a helpful customer support assistant.

Answer briefly (max 3 sentences).
Use ONLY the context below.
If the answer is not in the context, say "I don't know".

Context:
{context_text}

Question:
{query}

Answer:
"""

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.2,
        max_tokens=200
    )

    return response.choices[0].message.content.strip()

# -------------------------
# Routes
# -------------------------
@app.get("/", response_class=HTMLResponse)
def home(request: Request):
    return templates.TemplateResponse("index.html", {"request": request})


@app.get("/getChatBotResponse")
def get_bot_response(msg: str):
    query = msg.strip()

    # Cache check
    if query in cache:
        return {
            "response": cache[query],
            "source": "cache",
            "confidence": 1.0
        }

    # Keyword routing (short queries)
    if len(query.split()) <= 2:
        keyword_answer = keyword_route(query)
        if keyword_answer:
            cache[query] = keyword_answer
            return {
                "response": keyword_answer,
                "source": "keyword",
                "confidence": 1.0
            }

    # Encode query
    query_embedding = encode(query)

    contexts = []
    scores = []

    # FAQ retrieval
    faq_results = faq_store.search(query_embedding, k=2)
    for ans, score in faq_results:
        if score >= FAQ_THRESHOLD:
            contexts.append(ans)
            scores.append(score)

    # High-confidence shortcut (skip LLM)
    if faq_results and faq_results[0][1] >= HIGH_CONFIDENCE:
        best_answer = faq_results[0][0]
        cache[query] = best_answer
        return {
            "response": best_answer,
            "source": "faq-direct",
            "confidence": faq_results[0][1]
        }

    # Website retrieval
    web_results = web_store.search(query_embedding, k=2)
    for text, score in web_results:
        if score >= WEB_THRESHOLD:
            contexts.append(text)
            scores.append(score)

    # No context fallback
    if not contexts:
        return {
            "response": "I don't have enough information to answer that.",
            "source": "none",
            "confidence": 0.0
        }

    # LLM generation (RAG)
    final_answer = generate_rag_response(query, contexts)

    # Save to cache
    cache[query] = final_answer

    return {
        "response": final_answer,
        "source": "rag+llm",
        "confidence": max(scores) if scores else 0.5
    } 
        

Workflow

The query flow is:

  • User sends query to /getChatBotResponse?msg=...
  • Check cache: return cached response if exists (fastest)
  • Keyword routing: handle short queries like "payment" or "refund"
  • Encode query into embeddings using encode(query)
  • Search FAQs via vector similarity
  • If high confidence in FAQ → return FAQ answer directly
  • Search website content chunks if needed
  • Combine FAQ + website content as context
  • Trim context to limit token usage
  • Select model based on query and context
  • Generate final answer via RAG + LLM
  • Cache the response for future queries
  • Return final answer with source and confidence score

People are good at skipping over material they already know!

View Related Topics to







Contact Us

Name

Email *

Message *

Popular Posts

Constellation Diagrams of ASK, PSK, and FSK with MATLAB Code + Simulator

📘 Overview of Energy per Bit (Eb / N0) 🧮 Online Simulator for constellation diagrams of ASK, FSK, and PSK 🧮 Theory behind Constellation Diagrams of ASK, FSK, and PSK 🧮 MATLAB Codes for Constellation Diagrams of ASK, FSK, and PSK 📚 Further Reading 📂 Other Topics on Constellation Diagrams of ASK, PSK, and FSK ... 🧮 Simulator for constellation diagrams of m-ary PSK 🧮 Simulator for constellation diagrams of m-ary QAM BASK (Binary ASK) Modulation: Transmits one of two signals: 0 or -√Eb, where Eb​ is the energy per bit. These signals represent binary 0 and 1.    BFSK (Binary FSK) Modulation: Transmits one of two signals: +√Eb​ ( On the y-axis, the phase shift of 90 degrees with respect to the x-axis, which is also termed phase offset ) or √Eb (on x-axis), where Eb​ is the energy per bit. These signals represent binary 0 and 1.  BPSK (Binary PSK) Modulation: Transmits one of two signals...

Fading : Slow & Fast and Large & Small Scale Fading (with MATLAB Code + Simulator)

📘 Overview 📘 LARGE SCALE FADING 📘 SMALL SCALE FADING 📘 SLOW FADING 📘 FAST FADING 🧮 MATLAB Codes 📚 Further Reading LARGE SCALE FADING The term 'Large scale fading' is used to describe variations in received signal power over a long distance, usually just considering shadowing.  Assume that a transmitter (say, a cell tower) and a receiver  (say, your smartphone) are in constant communication. Take into account the fact that you are in a moving vehicle. An obstacle, such as a tall building, comes between your cell tower and your vehicle's line of sight (LOS) path. Then you'll notice a decline in the power of your received signal on the spectrogram. Large-scale fading is the term for this type of phenomenon. SMALL SCALE FADING  Small scale fading is a term that describes rapid fluctuations in the received signal power on a small time scale. This includes multipath propagation effects as well as movement-induced Doppler fr...

BER vs SNR for M-ary QAM, M-ary PSK, QPSK, BPSK, ...(MATLAB Code + Simulator)

📘 Overview of BER and SNR 🧮 Online Simulator for BER calculation of m-ary QAM and m-ary PSK 🧮 MATLAB Code for BER calculation of M-ary QAM, M-ary PSK, QPSK, BPSK, ... 📚 Further Reading 📂 View Other Topics on M-ary QAM, M-ary PSK, QPSK ... 🧮 Online Simulator for Constellation Diagram of m-ary QAM 🧮 Online Simulator for Constellation Diagram of m-ary PSK 🧮 MATLAB Code for BER calculation of ASK, FSK, and PSK 🧮 MATLAB Code for BER calculation of Alamouti Scheme 🧮 Different approaches to calculate BER vs SNR What is Bit Error Rate (BER)? The abbreviation BER stands for Bit Error Rate, which indicates how many corrupted bits are received (after the demodulation process) compared to the total number of bits sent in a communication process. BER = (number of bits received in error) / (total number of tran...

Theoretical BER vs SNR for BPSK

Theoretical Bit Error Rate (BER) vs Signal-to-Noise Ratio (SNR) for BPSK in AWGN Channel Let’s simplify the explanation for the theoretical Bit Error Rate (BER) versus Signal-to-Noise Ratio (SNR) for Binary Phase Shift Keying (BPSK) in an Additive White Gaussian Noise (AWGN) channel. Key Points Fig. 1: Constellation Diagrams of BASK, BFSK, and BPSK [↗] BPSK Modulation Transmits one of two signals: +√Eb or −√Eb , where Eb is the energy per bit. These signals represent binary 0 and 1 . AWGN Channel The channel adds Gaussian noise with zero mean and variance N₀/2 (where N₀ is the noise power spectral density). Receiver Decision The receiver decides if the received signal is closer to +√Eb (for bit 0) or −√Eb (for bit 1) . Bit Error Rat...

Understanding the Q-function in BASK, BFSK, and BPSK

Understanding the Q-function in BASK, BFSK, and BPSK 1. Definition of the Q-function The Q-function is the tail probability of the standard normal distribution: Q(x) = (1 / √(2Ï€)) ∫ x ∞ e -t²/2 dt What is Q(1)? Q(1) ≈ 0.1587 This means there is about a 15.87% chance that a Gaussian random variable exceeds 1 standard deviation above the mean. What is Q(2)? Q(2) ≈ 0.0228 This means there is only a 2.28% chance that a Gaussian value exceeds 2 standard deviations above the mean. Difference Between Q(1) and Q(2) Even though the argument changes from 1 to 2 (a small increase), the probability drops drastically: Q(1) = 0.1587 → errors fairly likely Q(2) = 0.0228 → errors much rarer This shows how fast the tail of the Gaussian distribution decays. It’s also why BER drops drama...

Online Simulator for ASK, FSK, and PSK

Try our new Digital Signal Processing Simulator!   Start Simulator for binary ASK Modulation Message Bits (e.g. 1,0,1,0) Carrier Frequency (Hz) Sampling Frequency (Hz) Run Simulation Simulator for binary FSK Modulation Input Bits (e.g. 1,0,1,0) Freq for '1' (Hz) Freq for '0' (Hz) Sampling Rate (Hz) Visualize FSK Signal Simulator for BPSK Modulation ...

Pulse Shaping using Raised Cosine Filter (with MATLAB + Simulator)

  MATLAB Code for Raised Cosine Filter Pulse Shaping clc; clear; close all ; %% ===================================================== %% PARAMETERS %% ===================================================== N = 64; % Number of OFDM subcarriers cpLen = 16; % Cyclic prefix length modOrder = 4; % QPSK oversample = 8; % Oversampling factor span = 10; % RRC filter span in symbols rolloff = 0.25; % RRC roll-off factor %% ===================================================== %% Generate Baseband OFDM Symbols %% ===================================================== data = randi([0 modOrder-1], N, 1); % Random bits txSymbols = pskmod(data, modOrder, pi/4); % QPSK modulation % IFFT to get OFDM symbol tx_ofdm = ifft(txSymbols, N); % Add cyclic prefix tx_cp = [tx_ofdm(end-cpLen+1:end); tx_ofdm]; %% ===================================================== %% Oversample the Baseband Signal %% ===============================================...

Envelope Detector Online Simulator

General Envelope Detector Simulator Modulation Type: AM (with carrier) PAM Message Frequency (fm): Hz Carrier Frequency (fc): Hz Carrier Amplitude (Ac): Modulation Index (m = Am / Ac): Envelope Detection and Amplitude-Based Modulations Envelope detectors are fundamental for amplitude-based modulations because they exploit the fact that the instantaneous amplitude of the carrier encodes the message signal . 1. Why Envelope Detection Works In amplitude modulation (AM), the modulated signal has the form: s(t) = [A_c + m(t)] cos(2Ï€ f_c t) A_c is the carrier amplitude m(t) is the message signal f_c is the carrier frequency The envelope of this waveform is: Envelope = |A_c + m(t)| . An envelope detector (e.g., a diode + RC circuit) follows the peaks of the ca...