🧭 FAQ Chatbot Builder's Guide

FAQ Chatbot - A Quick Guide to Build a Fast, Accurate, Hardware-Efficient - FAQ Chatbot System Like HDFC's Ask Eva or ITR Filing Assistant!

Posted Feb 5, 2026 Updated Feb 6, 2026

37 min read

🤖 The Complete FAQ Chatbot Builder’s Guide

Build a Fast, Accurate, Hardware-Efficient FAQ System Like HDFC’s Ask Eva or ITR Filing Assistant

A beginner-friendly, end-to-end guide with everything you need to know

Part 1: Understanding the System

🎯 Introduction

What You’re Building

A professional FAQ chatbot that:

✅ Runs on your laptop (no expensive cloud costs)
✅ Answers in milliseconds (not seconds)
✅ Never hallucinates (grounded in your documents)
✅ Handles 5-10 concurrent users easily
✅ Works completely offline (after setup)

Think of systems like:

HDFC Bank’s “Ask Eva” - answers banking questions instantly
India’s ITR Filing Assistant - guides tax filing with accuracy
Company knowledge bases - internal FAQ systems

What You’ll Learn

By the end of this guide, you’ll know:

How to structure knowledge for instant retrieval
How to build a three-tier intelligent system
When to use FAISS vs Chroma (and why it matters)
How to minimize expensive AI calls
How to deploy on modest hardware

Key Philosophy

Don’t ask the AI first. Ask your own knowledge first.

This gives you:

⚡ Speed: Microseconds vs seconds
🎯 Accuracy: Deterministic, not probabilistic
💰 Cost: Minimal hardware, no cloud bills
🔒 Control: You decide what it knows
📊 Predictability: Same question = same answer

Time Investment

Reading this guide: 1-2 hours
Basic implementation: 2-3 days
Production-ready system: 1-2 weeks

Let’s begin! 🚀

🧠 Core Concepts: RAG Explained Simply

What is RAG?

RAG = Retrieval-Augmented Generation

In plain English:

Retrieval: Find relevant information first
Augmented: Add that information to your prompt
Generation: AI explains using only that information

Why Not Just Use ChatGPT Directly?

Approach	Problem
Direct AI	May invent facts (“hallucinate”)
Direct AI	Expensive (API costs add up)
Direct AI	Slow (5-10 seconds per query)
Direct AI	Unpredictable (different each time)

Our Approach	Benefit
Find exact FAQ	Instant (< 5ms)
Search docs	Accurate (from your sources)
AI explains last	Only when needed
Grounded	Cannot invent facts

The Three-Tier Strategy

User asks: "How do I reset my password?"

Tier 1 (FAQ Bank):
  ✅ Found exact match → Return answer (2ms)
  DONE! 60-80% of queries stop here.

Tier 2 (If unsure):
  ❓ Found similar questions
  → "Did you mean: How to reset password?"
  → User picks → Return answer (10ms)
  DONE! 15-25% resolve here.

Tier 3 (If still not found):
  🔍 Search documents for "password reset"
  → Find relevant sections
  → AI explains using those sections (2-5 seconds)
  DONE! Remaining 5-15% of queries.

Result: 95% of queries answer in under 50ms!

Visual Flow

┌──────────────────┐
│  User Question   │
└──────────────────┘
        ↓
    ┌────────────────────┐
    │ Normalize & Clean  │
    └────────────────────┘
        ↓
┌────────────────────────┐
│  LAYER 1: FAQ BANK     │  ← 60-80% stop here
│  Check: confidence ≥ 0.85? │
└────────────────────────┘
        ↓ No
┌────────────────────────┐
│  "Did You Mean?"       │  ← 15-25% resolve here
│  Show 3-5 suggestions  │
└────────────────────────┘
        ↓ None matched
┌────────────────────────┐
│  LAYER 2: Vector Search│  ← Find context
│  Retrieve relevant docs│
└────────────────────────┘
        ↓
┌────────────────────────┐
│  LAYER 3: LLM         │  ← Explain using context
│  Generate answer       │
└────────────────────────┘
        ↓
    ┌─────────┐
    │ Answer  │
    └─────────┘

🏗️ System Architecture Overview

The Complete Picture

┌─────────────────────────────────────────────────┐
│                 USER QUESTION                    │
└─────────────────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────┐
│  LAYER 1: FAQ BANK (SQLite + In-Memory)         │
│  • Technology: SQLite database                  │
│  • Storage: In-memory cache + disk              │
│  • Speed: < 5ms (microseconds)                  │
│  • Coverage: 60-80% of all queries              │
│  • Purpose: Known, stable answers               │
└─────────────────────────────────────────────────┘
                      ↓ (if confidence < 0.85)
┌─────────────────────────────────────────────────┐
│  "DID YOU MEAN?" CLARIFICATION LAYER            │
│  • Technology: Same FAQ database                │
│  • Strategy: Show 3-5 best matches              │
│  • Speed: < 10ms                                │
│  • Coverage: 15-25% resolve here                │
│  • Purpose: User self-correction                │
└─────────────────────────────────────────────────┘
                      ↓ (if no match selected)
┌─────────────────────────────────────────────────┐
│  LAYER 2: DOCUMENT SEARCH (Vector Database)     │
│  • Technology: FAISS or Chroma                  │
│  • Storage: Markdown docs → embeddings          │
│  • Speed: 50-150ms                              │
│  • Purpose: Find relevant context               │
│  • Process: Semantic similarity search          │
└─────────────────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────┐
│  LAYER 3: LLM INFERENCE (Last Resort)           │
│  • Technology: Mistral-7B-Instruct (GGUF)       │
│  • Engine: llama.cpp (CPU optimized)            │
│  • Speed: 1-5 seconds                           │
│  • Purpose: Explain using retrieved context     │
│  • Rule: Cannot invent facts                    │
└─────────────────────────────────────────────────┘

Component Breakdown

Component	What It Does	Technology	Speed
FAQ Database	Stores known Q&A pairs	SQLite	Microseconds
Memory Cache	Keeps FAQs in RAM	Python dict	Nanoseconds
Vector Store	Enables semantic search	FAISS/Chroma	Milliseconds
Embeddings	Convert text to numbers	sentence-transformers	One-time cost
LLM	Explains complex answers	Mistral-7B via llama.cpp	Seconds

💻 Hardware Requirements

Minimum Setup (Works, but slow)

CPU: Intel i5 (10th gen) or equivalent
RAM: 8 GB
Storage: 20 GB free (SSD recommended)
OS: Ubuntu 22.04, Windows 11 (WSL2), or macOS

Experience:

FAQ queries: Fast
LLM responses: 5-10 seconds
Can serve 2-3 concurrent users

Recommended Setup (Smooth experience)

CPU: Intel i7-1255U or better (12th gen+)
RAM: 16 GB
Storage: 30 GB free on SSD
OS: Ubuntu 22.04 or WSL2

Experience:

FAQ queries: < 5ms
LLM responses: 1-5 seconds
Can serve 5-10 concurrent users comfortably

What You DON’T Need

❌ GPU: Everything runs on CPU
❌ Cloud services: Fully offline after setup
❌ Expensive hardware: Laptops work great
❌ Constant internet: Only needed for initial downloads

Storage Breakdown

Item	Size
LLM model (quantized)	4-5 GB
Vector index	100-500 MB
SQLite database	5-50 MB
Python environment	2-3 GB
Documentation	10-100 MB
Total	~7-9 GB

Part 2: The Three-Layer Intelligence System

🎯 Layer 1: FAQ Bank (The Fast Path)

Purpose

Store and retrieve known, stable answers instantly.

When to Use This Layer

✅ Use for:

“How do I reset my password?”
“What are your business hours?”
“Where is my data stored?”
“Is internet required?”

❌ Don’t use for:

Complex questions needing reasoning
Questions requiring multiple sources
Subjective answers
Rapidly changing information

Technology Stack

SQLite Database (on disk)
       ↓
In-Memory Cache (Python dict)
       ↓
< 5ms response time

Example: How It Works

User asks: “How do I reset my password?”

System does:

Normalize: “how do i reset my password”
Check cache: cache["how do i reset my password"]
Found! Return answer
Total time: 2ms

Response:

  
{
  "type": "faq_answer",
  "answer": "Click 'Forgot Password' on the login screen, then check your email for reset link.",
  "confidence": 1.0,
  "source": "FAQ #12",
  "response_time_ms": 2
}

Performance Characteristics

Metric	Value
Average latency	2-5 ms
Throughput	1000+ queries/sec
Accuracy	100% (deterministic)
Coverage	60-80% of user queries

🔍 Layer 2: Document Search (The Knowledge Base)

Purpose

Find relevant information from your documentation when it’s not in the FAQ bank.

Technology Choice: Vector Database

This is where you need to choose between FAISS and Chroma.

🎯 Vector Database: FAISS vs Chroma

One of the most important decisions in your architecture

Quick Decision Guide

┌─────────────────────────────────────┐
│  Are you a beginner?                │
│  Building your first chatbot?       │
│  Need to ship fast?                 │
├─────────────────────────────────────┤
│  ✅ Choose CHROMA                   │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│  Have you built RAG systems before? │
│  Need maximum performance?          │
│  Handling 100K+ documents?          │
├─────────────────────────────────────┤
│  ✅ Choose FAISS                    │
└─────────────────────────────────────┘

What Are Vector Databases?

Simple explanation:

When you search Google, you type words and it finds pages with those exact words. That’s keyword search.

Vector databases do semantic search:

“How to reset password” matches “Forgot my login credentials”
“Install software” matches “Setup instructions”
Understands meaning, not just words

How it works:

Convert text to numbers (embeddings): “reset password” → [0.23, -0.45, 0.67, …]
Similar meanings = similar numbers
Search by finding closest numbers

FAISS: The Speed Champion

Created by: Meta (Facebook AI Research)
Age: Since 2017 (battle-tested)
Focus: Maximum speed and efficiency

FAISS Strengths

Feature	Rating	Why
Speed	⭐⭐⭐⭐⭐	Optimized C++, SIMD instructions
Memory	⭐⭐⭐⭐⭐	Very efficient, quantization support
Scale	⭐⭐⭐⭐⭐	Handles billions of vectors
Maturity	⭐⭐⭐⭐⭐	Production-proven at Meta scale

FAISS Weaknesses

Feature	Rating	Issue
Ease of use	⭐⭐⭐	Steeper learning curve
Metadata	⭐⭐	Manual tracking required
Setup	⭐⭐⭐	More complex installation

When to Choose FAISS

✅ Choose FAISS if:

You need maximum speed (< 1ms queries)
Working with large scale (100K+ documents)
Memory is limited (need quantization)
Building for production (proven reliability)
You have technical experience

FAISS Quick Example

  
import faiss
import numpy as np

# 1. Create index
dimension = 384  # embedding size
index = faiss.IndexFlatL2(dimension)

# 2. Add vectors
embeddings = get_embeddings(documents)  # shape: (N, 384)
index.add(embeddings.astype('float32'))

# 3. Search
query_embedding = get_embedding("reset password")
distances, indices = index.search(query_embedding, k=5)

# 4. Save (manual persistence)
faiss.write_index(index, "vector.index")

# 5. Load
index = faiss.read_index("vector.index")

Metadata handling (the tricky part):

  
# You need a separate structure
metadata_store = {
    0: {"text": "Doc 1", "category": "auth"},
    1: {"text": "Doc 2", "category": "setup"},
    # ... manual tracking
}

# After search, lookup metadata
for idx in indices[0]:
    print(metadata_store[idx])

Chroma: The Developer’s Choice

Created by: Chroma team
Age: Since 2022 (modern, actively developed)
Focus: Developer experience and ease of use

Chroma Strengths

Feature	Rating	Why
Ease of use	⭐⭐⭐⭐⭐	Pythonic, intuitive API
Metadata	⭐⭐⭐⭐⭐	Built-in filtering and queries
Setup	⭐⭐⭐⭐⭐	`pip install chromadb` and go
Persistence	⭐⭐⭐⭐⭐	Automatic, no manual save/load
Features	⭐⭐⭐⭐⭐	Collections, updates, deletes

Chroma Weaknesses

Feature	Rating	Issue
Speed	⭐⭐⭐⭐	3-5x slower than FAISS
Scale	⭐⭐⭐⭐	Best for < 1M vectors
Maturity	⭐⭐⭐	Newer (less production history)

When to Choose Chroma

✅ Choose Chroma if:

You’re a beginner (easier learning curve)
Building an MVP/prototype (ship faster)
Need metadata filtering (category, date, priority)
Want simpler code (less boilerplate)
Working with < 100K documents (performance is fine)

Chroma Quick Example

  
import chromadb

# 1. Setup (automatic persistence!)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("faqs")

# 2. Add documents WITH metadata (built-in!)
collection.add(
    documents=["How to reset password", "Installation guide"],
    metadatas=[
        {"category": "auth", "priority": "high"},
        {"category": "setup", "priority": "medium"}
    ],
    ids=["doc1", "doc2"]
)

# 3. Query with filtering (this is the magic!)
results = collection.query(
    query_texts=["forgot password"],
    n_results=5,
    where={"category": "auth"}  # Built-in metadata filter!
)

# 4. Update easily
collection.update(
    ids=["doc1"],
    documents=["How to reset your password (updated)"]
)

# 5. Delete easily
collection.delete(ids=["doc2"])

# NO MANUAL SAVE NEEDED - it's automatic!

Performance Comparison

Speed Benchmarks (10,000 documents, 384 dimensions)

Operation	FAISS	Chroma	Winner
Single query	0.5-1 ms	2-5 ms	FAISS (5x faster)
Batch 100 queries	~100 ms	~500 ms	FAISS
Index creation	~50 ms	~200 ms	FAISS
Add 1000 docs	~10 ms	~100 ms	FAISS

Memory Usage (100,000 documents)

Component	FAISS	Chroma
Index only	~150 MB	~180 MB
With metadata	+external DB	~200 MB (built-in)
Total	~200 MB	~250 MB

Code Complexity

Task	FAISS	Chroma	Winner
Basic setup	20 lines	5 lines	Chroma (4x simpler)
With metadata	50+ lines	10 lines	Chroma
Updates/deletes	Complex	2 lines	Chroma

Scenario-Based Decisions

Scenario 1: First-Time Builder, Small FAQ System

Requirements:

500-5,000 FAQs
Need to ship in 1-2 weeks
Team: 1-2 developers (beginners)

Recommendation: Chroma 🎯

Why?:

5x faster development
Performance is plenty (2-5ms is fine)
Metadata filtering is easy
Less code to debug

Expected build time: 3-5 days

Scenario 2: Production System, Medium Scale

Requirements:

10,000-50,000 FAQs
Need category filtering
Expect 100+ users/day
Budget: 1 month build time

Recommendation: Chroma 🎯

Why?:

Built-in metadata perfect for categories
Performance still good at this scale
Easier maintenance
Team velocity matters

Migration note: Can move to FAISS later if needed

Scenario 3: Large Scale, Performance-Critical

Requirements:

100,000+ documents
Need < 50ms total response time
High traffic (1000+ queries/sec)
Team: Experienced developers

Recommendation: FAISS ⚡

Why?:

5x speed advantage critical at scale
Better memory efficiency
Proven at billion-vector scale
Worth the complexity investment

Build time: 2-4 weeks

Scenario 4: Complex Filtering Needs

Requirements:

Filter by multiple attributes (category AND date AND priority)
Frequent updates to documents
Need audit trails

Recommendation: Chroma 🎯

Why?:

  
# Chroma: Simple
results = collection.query(
    query_texts=["password"],
    where={
        "$and": [
            {"category": "auth"},
            {"priority": {"$gte": 3}},
            {"date": {"$gte": "2024-01-01"}}
        ]
    }
)

# FAISS: Complex
# 1. Search FAISS
# 2. Load metadata from separate DB
# 3. Filter results manually
# 4. Sort and return
# = 30+ lines of code

Our Recommendation for Your FAQ Chatbot

Phase 1 (Weeks 1-4): Start with Chroma

Reasons:

✅ Faster to build: 3-5 days vs 1-2 weeks
✅ Easier to debug: Simpler code
✅ Built-in features: Metadata, persistence, updates
✅ Good enough: 2-5ms is fast for FAQ chatbot
✅ Lower risk: Less to go wrong

Phase 2 (If needed): Migrate to FAISS

When to migrate:

⚠️ Hit 100K+ documents
⚠️ Need < 1ms query time
⚠️ Running out of memory
⚠️ Serving 500+ queries/second

Migration effort: 1-2 days (straightforward)

Installation & Setup

Chroma (Recommended for beginners)

  
# Installation (super simple!)
pip install chromadb

# Optional: Add server mode
pip install chromadb[server]

First code (ready in 5 minutes):

  
import chromadb
from sentence_transformers import SentenceTransformer

# 1. Setup
client = chromadb.PersistentClient(path="./faq_db")
collection = client.get_or_create_collection("faqs")

# 2. Add your FAQs
faqs = [
    "How do I reset my password?",
    "What are your business hours?",
    "Where is my data stored?"
]
metadata = [
    {"category": "auth", "priority": 5},
    {"category": "info", "priority": 3},
    {"category": "data", "priority": 4}
]

collection.add(
    documents=faqs,
    metadatas=metadata,
    ids=[f"faq_{i}" for i in range(len(faqs))]
)

# 3. Query
results = collection.query(
    query_texts=["forgot password"],
    n_results=3,
    where={"category": "auth"}
)

print(results)

That’s it! You’re up and running.

FAISS (For experienced developers)

  
# Installation
pip install faiss-cpu  # or faiss-gpu if you have GPU

# From source (for latest features)
git clone https://github.com/facebookresearch/faiss.git
cd faiss
cmake -B build .
make -C build

First code (takes longer to set up):

  
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# 1. Setup embedder
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Create index
dimension = 384
index = faiss.IndexFlatL2(dimension)

# 3. Prepare data
faqs = ["How do I reset password?", "Business hours?"]
embeddings = embedder.encode(faqs).astype('float32')
index.add(embeddings)

# 4. Manual metadata tracking
metadata = {
    0: {"text": faqs[0], "category": "auth"},
    1: {"text": faqs[1], "category": "info"}
}

# 5. Query
query = "forgot password"
query_emb = embedder.encode([query]).astype('float32')
distances, indices = index.search(query_emb, k=3)

# 6. Retrieve with metadata
for idx in indices[0]:
    print(metadata[idx])

# 7. Save manually
faiss.write_index(index, "faiss.index")
import json
with open("metadata.json", "w") as f:
    json.dump(metadata, f)

Notice: More code, more manual work.

Summary: FAISS vs Chroma

Aspect	FAISS	Chroma	For Beginners
Speed	⚡⚡⚡⚡⚡	⚡⚡⚡⚡	Chroma (fast enough)
Ease	⭐⭐⭐	⭐⭐⭐⭐⭐	Chroma (much easier)
Setup time	2-3 hours	30 minutes	Chroma
Code complexity	High	Low	Chroma
Metadata	Manual	Built-in	Chroma
Scale limit	Billions	Millions	FAISS (future-proof)
Best for	Production scale	MVP/Learning	Depends

Our verdict for this guide: Start with Chroma ✅

You can always migrate to FAISS later if you need maximum performance.

🧠 Layer 3: LLM Inference (The Explainer)

Purpose

When the question isn’t in your FAQ bank and requires explanation or synthesis, use the LLM.

Critical Rules

LLM cannot search: It only explains what you give it
Must use context: Only information from Layer 2
Cannot invent: Strictly grounded in documents
Last resort: Most expensive and slowest

Technology: Mistral-7B-Instruct

Why Mistral?

✅ Best quality-per-parameter for 7B models
✅ Excellent instruction following
✅ Runs well on CPU
✅ Permissive license
✅ Quantized versions available (~4 GB)

Via llama.cpp:

CPU-optimized inference
GGUF quantization support
Fast and efficient
No GPU needed

How It Works

User asks: "Why does login fail after password reset?"
       ↓
FAQ Bank: No exact match (confidence: 0.42)
       ↓
Document Search: Finds 3 relevant sections:
  1. "Password reset process"
  2. "Common login errors"
  3. "Cache clearing instructions"
       ↓
Build prompt:
"""
Answer using ONLY this context:

CONTEXT:
[Section 1: Password reset process...]
[Section 2: Common login errors...]
[Section 3: Cache clearing instructions...]

QUESTION: Why does login fail after password reset?

RULES:
- Use only the context above
- If not answered in context, say "I don't know"
- Be concise and accurate

ANSWER:
"""
       ↓
LLM generates: "Login may fail after password reset if 
your browser cache contains old credentials. Try 
clearing your browser cache and cookies, then log 
in again with your new password."
       ↓
Return answer with sources cited

Performance

Metric	Value
Latency	1-5 seconds
Throughput	1-2 requests/sec
Accuracy	High (when grounded)
Usage	5-15% of queries

💡 The “Did You Mean?” Feature

Why This Is Critical

Problem: User asks “forgot my login info”
System thinks: Not confident enough (score: 0.73)

Without this feature:

❌ System searches documents (slower)
❌ Maybe invokes LLM (expensive)
❌ User frustrated (should’ve been simple)

With this feature:

✅ Show 3 similar FAQs
✅ User picks correct one
✅ Instant answer (10ms)
✅ Better experience

When to Trigger

Confidence thresholds:

≥ 0.85: Answer directly (high confidence)
0.65-0.84: Show “Did you mean?” (medium confidence)
< 0.65: Escalate to document search (low confidence)

Example Flow

User: "forgot my login info"
  ↓
System calculates confidence: 0.73
  ↓
Trigger: "Did you mean?"
  ↓
Show user:
  1️⃣ How do I reset my password?
  2️⃣ What happens if login fails?
  3️⃣ How do I retrieve my username?
  4️⃣ None of these - search documentation
  ↓
User clicks: 1️⃣
  ↓
Return FAQ #12 answer instantly
  ↓
Total time: 8ms (instead of 2000ms with LLM!)

Implementation

  
def answer_question(user_question):
    normalized = normalize(user_question)
    matches = score_all_faqs(normalized)
    best = matches[0]
    
    # High confidence: answer directly
    if best.score >= 0.85:
        return {
            "type": "direct_answer",
            "answer": best.answer
        }
    
    # Medium confidence: ask for clarification
    if best.score >= 0.65:
        return {
            "type": "did_you_mean",
            "message": "I want to be sure I understand correctly.",
            "suggestions": [
                {"id": m.id, "question": m.question}
                for m in matches[:3]
            ]
        }
    
    # Low confidence: search documents
    return search_documents(user_question)

Impact

Metric	Improvement
LLM calls	-40% to -70%
Average latency	-60%
User satisfaction	+35%
Hardware load	-50%

Part 3: Technical Design

🗄️ Database Schema & Indexing Strategy

FAQ Table (SQLite)

  
CREATE TABLE faq (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  question TEXT NOT NULL,
  normalized_question TEXT NOT NULL,  -- lowercase, no punctuation
  answer TEXT NOT NULL,
  category TEXT,                      -- 'auth', 'setup', etc.
  keywords TEXT,                      -- 'reset,password,login'
  priority INTEGER DEFAULT 1,         -- 1-5 (for tie-breaking)
  last_updated DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Performance indexes (CRITICAL!)
CREATE INDEX idx_faq_normalized ON faq(normalized_question);
CREATE INDEX idx_faq_category ON faq(category);
CREATE INDEX idx_faq_keywords ON faq(keywords);

Why Each Column Matters

Column	Purpose	Example
`normalized_question`	Fast exact matching	“how do i reset my password”
`keywords`	Quick filtering	“reset,password,credentials”
`category`	Group related FAQs	“authentication”
`priority`	Resolve ties	5 = critical, 1 = low

Normalization Function

  
import re

def normalize(text: str) -> str:
    """Prepare text for matching."""
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text)     # Collapse spaces
    return text.strip()

# Examples:
normalize("How do I Reset my Password?")
# → "how do i reset my password"

normalize("What's   your  business hours?!")
# → "whats your business hours"

📊 Confidence Scoring System

The Formula

  
final_score = (
    0.5 × exact_match_score +
    0.3 × keyword_overlap_score +
    0.2 × embedding_similarity_score
)

Component Breakdown

1. Exact Match (50% weight)

  
def exact_match_score(user_q, faq_q):
    if normalize(user_q) == normalize(faq_q):
        return 1.0
    else:
        return 0.0

Why 50%? Exact matches should strongly favor direct answers.

2. Keyword Overlap (30% weight)

  
def keyword_overlap_score(user_q, faq_keywords):
    user_words = set(user_q.lower().split())
    faq_words = set(faq_keywords.split(','))
    
    common = user_words & faq_words
    return len(common) / len(user_words) if user_words else 0.0

Example:

User: “reset password login”
FAQ keywords: “reset,password,credentials”
Common: {reset, password}
Score: 2/3 = 0.67

3. Embedding Similarity (20% weight)

  
from sentence_transformers import SentenceTransformer, util

embedder = SentenceTransformer('all-MiniLM-L6-v2')

def embedding_similarity_score(user_q, faq_q):
    emb1 = embedder.encode(user_q)
    emb2 = embedder.encode(faq_q)
    similarity = util.cos_sim(emb1, emb2).item()
    return similarity

Why 20%? Catches semantic similarity but less reliable than exact/keyword.

Complete Example

  
User question: "how to reset password"
FAQ question:  "how do i reset my password"

Calculations:
  exact_match = 0.0 (not identical)
  keyword_overlap = 1.0 (all words match)
  embedding_sim = 0.92 (very similar)

Final score:
  = (0.5 × 0.0) + (0.3 × 1.0) + (0.2 × 0.92)
  = 0.0 + 0.3 + 0.184
  = 0.484

Decision: 0.484 < 0.65 → Escalate to document search

💾 Caching Policies

What to Cache

Item	Storage	Lifetime	Why Cache?
All FAQs	In-memory dict	App lifetime	1000x faster than disk
FAQ embeddings	NumPy array	App lifetime	Avoid recomputation
Recent queries	LRU cache (256)	1 hour	Repeat queries common
Document chunks	In-memory + disk	Until update	Balance speed/memory

Cache Architecture

Application Startup
       ↓
┌─────────────────────────┐
│  Load SQLite → Memory   │  (5-50 MB)
└─────────────────────────┘
       ↓
┌─────────────────────────┐
│  Precompute Embeddings  │  (20-100 MB)
└─────────────────────────┘
       ↓
┌─────────────────────────┐
│  Load Vector Index      │  (100-500 MB)
└─────────────────────────┘
       ↓
Ready to serve queries in < 5ms!

Implementation

  
from functools import lru_cache
import sqlite3

# Global cache
faq_cache = {}
faq_embeddings = {}

def load_faq_cache():
    """Load all FAQs into memory at startup."""
    conn = sqlite3.connect('faq.db')
    cursor = conn.execute("SELECT * FROM faq")
    
    for row in cursor:
        key = row['normalized_question']
        faq_cache[key] = {
            'id': row['id'],
            'question': row['question'],
            'answer': row['answer'],
            'category': row['category'],
            'priority': row['priority']
        }
    
    print(f"Loaded {len(faq_cache)} FAQs into memory")

# LRU cache for repeat queries
@lru_cache(maxsize=256)
def get_answer(question: str):
    """Cache recent answers."""
    normalized = normalize(question)
    return faq_cache.get(normalized)

Cache Invalidation

Event	Action
FAQ updated	Clear FAQ cache, reload
Document updated	Rebuild vector index
App restart	Reload all caches
Memory pressure	Evict LRU items

⚡ Concurrency & Throughput Design

The Challenge

Different layers have vastly different performance characteristics:

Layer	Latency	CPU Usage	Concurrency
FAQ lookup	< 5ms	Minimal	Unlimited
Vector search	50-150ms	Moderate	10-20
LLM inference	1-5 sec	Heavy	1-2 max

Thread Pool Architecture

FastAPI Server (Uvicorn)
       │
       ├── FAQ Worker Pool
       │   └── ThreadPoolExecutor(10 workers)
       │       └── In-memory lookups (ultra-fast)
       │
       ├── Vector Search Pool
       │   └── ThreadPoolExecutor(5 workers)
       │       └── FAISS/Chroma queries (medium)
       │
       └── LLM Queue (CRITICAL!)
           └── Semaphore(2)  ← MAX 2 CONCURRENT
               └── llama.cpp inference (slow, CPU-heavy)

LLM Gatekeeping (Most Important!)

Problem: LLM uses 100% CPU for 1-5 seconds.

Solution: Strict concurrency limit.

  
import asyncio
from asyncio import Semaphore

# Allow max 2 concurrent LLM calls
llm_semaphore = Semaphore(2)
llm_queue_size = 0
LLM_QUEUE_MAX = 5

async def call_llm(prompt: str):
    global llm_queue_size
    
    # Check queue size
    if llm_queue_size >= LLM_QUEUE_MAX:
        return {
            "error": "System busy",
            "message": "Too many requests. Try again in 10 seconds.",
            "retry_after": 10
        }
    
    # Gate concurrency
    llm_queue_size += 1
    async with llm_semaphore:
        try:
            result = await run_llama_cpp(prompt)
            return result
        finally:
            llm_queue_size -= 1

Expected Throughput

Query Type	Latency	Throughput	Concurrency
FAQ hit	2-5ms	1000+ req/sec	Unlimited
Did you mean	5-10ms	500+ req/sec	High
Doc search	50-150ms	50-100 req/sec	10-20
LLM call	1-5 sec	1-2 req/sec	2 max

Overall system: Serves 5-10 concurrent users comfortably.

Part 4: Implementation Guide

📝 Step-by-Step Setup

Phase 1: Environment Setup (30 minutes)

1.1 Install System Dependencies

Ubuntu/WSL2:

  
sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential cmake python3 python3-venv python3-pip

macOS:

brew install python3 cmake git

1.2 Create Python Environment

  
cd ~
python3 -m venv faq-chatbot-env
source faq-chatbot-env/bin/activate  # Linux/Mac
# faq-chatbot-env\Scripts\activate  # Windows

pip install --upgrade pip

Phase 2: Install llama.cpp (30 minutes)

  
# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build
make

# Test
./main --help

Phase 3: Download Model (30 minutes)

Recommended: Mistral-7B-Instruct (GGUF Q4)

  
mkdir -p ~/models/mistral-7b

# Download from Hugging Face (TheBloke)
# Visit: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
# Download: mistral-7b-instruct-v0.2.Q4_K_M.gguf (~4.4 GB)

# Move to models directory
mv mistral-7b-instruct-v0.2.Q4_K_M.gguf ~/models/mistral-7b/

Test the model:

  
cd ~/llama.cpp
./main -m ~/models/mistral-7b/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
       -p "Hello, how are you?" \
       -n 50 \
       -t 8

Phase 4: Install Python Packages (15 minutes)

  
source ~/faq-chatbot-env/bin/activate

# Core packages
pip install \
    chromadb \
    sentence-transformers \
    fastapi \
    uvicorn \
    python-multipart

# Optional: FAISS (if you decide to use it instead)
# pip install faiss-cpu

Phase 5: Create Database (30 minutes)

5.1 Create Schema

  
# create_database.py
import sqlite3

conn = sqlite3.connect('faq.db')
cursor = conn.cursor()

# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS faq (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  question TEXT NOT NULL,
  normalized_question TEXT NOT NULL,
  answer TEXT NOT NULL,
  category TEXT,
  keywords TEXT,
  priority INTEGER DEFAULT 1,
  last_updated DATETIME DEFAULT CURRENT_TIMESTAMP
)
''')

# Create indexes
cursor.execute('CREATE INDEX IF NOT EXISTS idx_faq_normalized ON faq(normalized_question)')
cursor.execute('CREATE INDEX IF NOT EXISTS idx_faq_category ON faq(category)')

conn.commit()
print("Database created successfully!")

Run it:

python create_database.py

5.2 Add Sample FAQs

  
# populate_faqs.py
import sqlite3
import re

def normalize(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

faqs = [
    {
        "question": "How do I reset my password?",
        "answer": "Click 'Forgot Password' on the login screen, then check your email for a reset link.",
        "category": "authentication",
        "keywords": "reset,password,forgot,login,credentials",
        "priority": 5
    },
    {
        "question": "What are your business hours?",
        "answer": "We are open Monday-Friday, 9 AM to 6 PM (EST).",
        "category": "information",
        "keywords": "hours,open,time,schedule",
        "priority": 3
    },
    {
        "question": "Where is my data stored?",
        "answer": "All data is stored locally on your device. We do not use cloud storage.",
        "category": "data",
        "keywords": "data,storage,privacy,local,cloud",
        "priority": 4
    },
]

conn = sqlite3.connect('faq.db')
cursor = conn.cursor()

for faq in faqs:
    cursor.execute('''
        INSERT INTO faq (question, normalized_question, answer, category, keywords, priority)
        VALUES (?, ?, ?, ?, ?, ?)
    ''', (
        faq['question'],
        normalize(faq['question']),
        faq['answer'],
        faq['category'],
        faq['keywords'],
        faq['priority']
    ))

conn.commit()
print(f"Added {len(faqs)} FAQs to database!")

Run it:

python populate_faqs.py

Phase 6: Set Up Vector Database (30 minutes)

Using Chroma (Recommended)

  
# setup_vector_db.py
import chromadb
from sentence_transformers import SentenceTransformer

# Initialize
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="faq_documents",
    metadata={"description": "FAQ chatbot documentation"}
)

# Sample documents (organize in docs/ folder as .md files)
documents = [
    "To reset your password, navigate to the login page and click 'Forgot Password'. Enter your email address and you'll receive a reset link within 5 minutes.",
    "Our customer service is available Monday through Friday from 9 AM to 6 PM Eastern Time. We respond to emails within 24 hours.",
    "All user data is stored locally on your device using SQLite. We do not transmit your data to external servers or use cloud storage."
]

metadata = [
    {"file": "auth.md", "section": "Password Reset", "category": "authentication"},
    {"file": "info.md", "section": "Business Hours", "category": "information"},
    {"file": "data.md", "section": "Data Storage", "category": "data"}
]

# Add to Chroma
collection.add(
    documents=documents,
    metadatas=metadata,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print("Vector database set up successfully!")
print(f"Added {len(documents)} documents")

# Test query
results = collection.query(
    query_texts=["How to reset password?"],
    n_results=3
)
print("\nTest query results:")
print(results)

Run it:

python setup_vector_db.py

📂 Document Organization Best Practices

Folder Structure

faq_docs/
├── 01_overview.md           # System introduction
├── 02_getting_started.md    # Quick start guide
├── 03_authentication.md     # Login, password, security
├── 04_installation.md       # Setup instructions
├── 05_basic_usage.md        # Common operations
├── 06_advanced_features.md  # Power user guide
├── 07_troubleshooting.md    # Common issues
├── 08_api_reference.md      # Technical details
├── 90_faq_core.md          # Dedicated FAQ file
└── 99_glossary.md          # Definitions

Writing Good Markdown

❌ Bad (Hard to Retrieve)

## Installation
Here's everything about installation including prerequisites,
step-by-step process, common errors, fixes, and advanced options...
[500 lines of mixed content]

✅ Good (Easy to Retrieve)

  
## Installation Prerequisites

Before installing, ensure you have:
- Python 3.10 or higher
- 20 GB free disk space
- Git installed

## Installation Steps

1. Clone the repository:
   ```bash
   git clone https://github.com/example/repo.git

Install dependencies:
1 pip install -r requirements.txt
Run setup:
1 python setup.py

Common Installation Errors

Error: Module Not Found

Symptom: ModuleNotFoundError: No module named 'xyz'

Cause: Missing dependency

Fix: Run pip install -r requirements.txt

Error: Permission Denied

Symptom: Permission denied when writing to directory

Cause: Insufficient permissions

Fix: Run with sudo or adjust directory permissions

### Key Principles

1. **One heading = one idea**
2. **Use hierarchical structure** (##, ###, ####)
3. **Keep sections under 200-500 words**
4. **Be specific and concrete**
5. **Include examples**

---

## 🏛️ Production-Grade Folder Structure

faq_chatbot/ ├── app/ │ ├── init.py │ ├── main.py # Application entry │ │ │ ├── api/ # API layer │ │ ├── init.py │ │ ├── routes.py # FastAPI endpoints │ │ └── models.py # Pydantic models │ │ │ ├── core/ # Business logic │ │ ├── init.py │ │ ├── faq_engine.py # FAQ matching │ │ ├── confidence.py # Scoring │ │ ├── cache.py # Caching │ │ └── llm_gate.py # LLM concurrency │ │ │ ├── retrieval/ # Document layer │ │ ├── init.py │ │ ├── embeddings.py │ │ ├── vector_store.py # Chroma interface │ │ └── doc_loader.py │ │ │ ├── prompts/ # Prompt templates │ │ └── system_prompt.txt │ │ │ └── utils/ │ ├── init.py │ ├── text.py # Normalization │ └── logger.py │ ├── data/ │ ├── faq.db # SQLite │ ├── docs/ # Markdown files │ └── chroma_db/ # Vector store │ ├── models/ │ └── mistral-7b/ │ └── model.gguf │ ├── scripts/ │ ├── create_database.py │ ├── populate_faqs.py │ └── setup_vector_db.py │ ├── tests/ │ ├── test_faq.py │ ├── test_confidence.py │ └── test_retrieval.py │ ├── requirements.txt ├── config.yaml └── README.md

---

## 💻 Complete Code Implementation

### main.py (Application Entry)

```python
# app/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import sqlite3
import chromadb
from sentence_transformers import SentenceTransformer
import subprocess
import asyncio
from asyncio import Semaphore

app = FastAPI(title="FAQ Chatbot")

# Global state
faq_cache = {}
embedder = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = None
chroma_collection = None
llm_semaphore = Semaphore(2)  # Max 2 concurrent LLM calls

# Models
class Query(BaseModel):
    question: str

class Answer(BaseModel):
    type: str  # 'faq_answer', 'did_you_mean', 'llm_answer'
    answer: Optional[str] = None
    suggestions: Optional[List[dict]] = None
    sources: Optional[List[str]] = None
    confidence: Optional[float] = None

# Startup
@app.on_event("startup")
async def startup():
    global faq_cache, chroma_client, chroma_collection
    
    # Load FAQs into memory
    print("Loading FAQ database...")
    conn = sqlite3.connect('data/faq.db')
    conn.row_factory = sqlite3.Row
    cursor = conn.execute("SELECT * FROM faq")
    
    for row in cursor:
        faq_cache[row['normalized_question']] = dict(row)
    
    print(f"Loaded {len(faq_cache)} FAQs into memory")
    
    # Initialize Chroma
    print("Loading vector database...")
    chroma_client = chromadb.PersistentClient(path="./data/chroma_db")
    chroma_collection = chroma_client.get_or_create_collection("faq_documents")
    print("Vector database loaded")

# Helper functions
def normalize(text: str) -> str:
    import re
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def score_faq(user_q: str, faq: dict) -> float:
    # Exact match
    if user_q == faq['normalized_question']:
        exact_score = 1.0
    else:
        exact_score = 0.0
    
    # Keyword overlap
    user_words = set(user_q.split())
    faq_words = set(faq['keywords'].split(','))
    common = user_words & faq_words
    keyword_score = len(common) / len(user_words) if user_words else 0.0
    
    # Embedding similarity
    user_emb = embedder.encode(user_q)
    faq_emb = embedder.encode(faq['normalized_question'])
    from sentence_transformers import util
    embedding_score = util.cos_sim(user_emb, faq_emb).item()
    
    # Final score
    final = 0.5 * exact_score + 0.3 * keyword_score + 0.2 * embedding_score
    return final

async def call_llm(prompt: str) -> str:
    async with llm_semaphore:
        result = subprocess.run(
            [
                "./llama.cpp/main",
                "-m", "models/mistral-7b/model.gguf",
                "-p", prompt,
                "-n", "256",
                "-t", "8",
                "--temp", "0.2"
            ],
            capture_output=True,
            text=True,
            timeout=30
        )
        return result.stdout.strip()

# Main endpoint
@app.post("/chat", response_model=Answer)
async def chat(query: Query):
    user_q = query.question
    normalized_q = normalize(user_q)
    
    # Layer 1: FAQ Bank
    scores = []
    for faq in faq_cache.values():
        score = score_faq(normalized_q, faq)
        scores.append((score, faq))
    
    scores.sort(reverse=True, key=lambda x: x[0])
    best_score, best_faq = scores[0]
    
    # High confidence: direct answer
    if best_score >= 0.85:
        return Answer(
            type="faq_answer",
            answer=best_faq['answer'],
            confidence=best_score,
            sources=[f"FAQ #{best_faq['id']}"]
        )
    
    # Medium confidence: "Did you mean?"
    if best_score >= 0.65:
        suggestions = [
            {"id": faq['id'], "question": faq['question'], "score": score}
            for score, faq in scores[:3]
        ]
        return Answer(
            type="did_you_mean",
            suggestions=suggestions
        )
    
    # Layer 2: Document search
    results = chroma_collection.query(
        query_texts=[user_q],
        n_results=5
    )
    
    if not results['documents'][0]:
        return Answer(
            type="not_found",
            answer="I couldn't find information about this in the documentation."
        )
    
    # Layer 3: LLM
    context = "\n\n".join(results['documents'][0])
    prompt = f"""Answer the question using ONLY the context below.

CONTEXT:
{context}

QUESTION: {user_q}

RULES:
- Use only the context provided
- If not answered in context, say "I don't know"
- Be concise and accurate

ANSWER:"""
    
    answer = await call_llm(prompt)
    
    return Answer(
        type="llm_answer",
        answer=answer,
        sources=results['metadatas'][0]
    )

# Run with: uvicorn app.main:app --reload

Testing

  
# Start server
uvicorn app.main:app --reload --port 8000

# Test with curl
curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"question": "How do I reset my password?"}'

Part 5: Advanced Topics

🎓 Model Selection Guide

Open-Source Models Comparison

Model	Size	Speed (CPU)	Quality	License	Best For
Mistral-7B-Instruct ⭐	7B	Fast	Excellent	Permissive	FAQ chatbots
LLaMA-2-7B-Chat	7B	Medium	Very Good	Meta	Long context
Falcon-7B-Instruct	7B	Medium	Good	Apache 2.0	Open license
Vicuna-7B	7B	Medium	Good	LLaMA-based	Conversational

Quantization Options

Format	Size	Quality Loss	Speed	Recommendation
Q4_K_M	4.4 GB	Minimal	Fast	⭐ Best balance
Q5_K_M	5.2 GB	Negligible	Medium	Higher quality
Q3_K_M	3.5 GB	Noticeable	Very fast	Low memory only

Our choice: Mistral-7B-Instruct Q4_K_M

⚡ Performance Optimization

Layer-by-Layer Optimization

Layer 1: FAQ Bank

Optimizations:

✅ Load all FAQs into memory (nanosecond access)
✅ Precompute embeddings once
✅ Use indexes on normalized_question
✅ Cache recent queries (LRU)

Result: < 5ms per query

Layer 2: Vector Search

Optimizations:

✅ Use Chroma for simplicity (good enough)
✅ Keep index in memory
✅ Limit to top 5 results
✅ Filter by category first

Result: 50-150ms per query

Layer 3: LLM

Optimizations:

✅ Use quantized model (Q4)
✅ Limit concurrent calls (max 2)
✅ Reduce max tokens (256)
✅ Lower temperature (0.2)

Result: 1-5 seconds per query

Memory Usage

Component	Memory	Optimization
FAQ cache	5-50 MB	Keep all (small)
Embeddings	20-100 MB	Precompute once
Vector index	100-500 MB	Use Chroma
LLM model	4-5 GB	Q4 quantization
Total	~5-6 GB	Fits in 8GB RAM

🚀 Deployment & Scaling

Single Server (Your Laptop)

Capacity:

5-10 concurrent users
1000+ FAQ queries/sec
50-100 doc searches/sec
1-2 LLM calls/sec

Good for:

MVP/prototype
Small teams (< 20 users)
Demo purposes

Scaling Up

When to scale:

20 concurrent users
100K documents
Need < 50ms response time
Geographic distribution

Options:

Vertical: Bigger server (32-64 GB RAM)
Horizontal: Multiple API servers + shared DB
Migrate to FAISS: If vector search is bottleneck
Add GPU: If LLM calls are bottleneck

📋 Next Steps & Roadmap

Week 1: Foundation

Read this guide
Set up environment
Install llama.cpp
Download model
Test inference

Week 2: Database

Create SQLite schema
Write 20-30 core FAQs
Set up Chroma
Test FAQ matching
Test document search

Week 3: Integration

Build FastAPI app
Implement 3-layer logic
Add “Did you mean?”
Test end-to-end
Basic UI

Week 4: Polish

Add logging
Error handling
Performance tuning
User testing
Documentation

Future Enhancements

Admin UI: Manage FAQs without code
Analytics: Track which FAQs are used
Auto-promote: Move common doc queries to FAQ
Multi-language: Support other languages
Voice: Add speech-to-text
Mobile app: iOS/Android clients

✅ Final Checklist

Before Launch

Technical:

All dependencies installed
Model working (test inference)
Database populated (20+ FAQs)
Documents indexed
API endpoints tested

Quality:

FAQ coverage comprehensive
Confidence thresholds tuned
“Did you mean?” working
LLM answers grounded
No hallucination

Performance:

FAQ queries < 5ms
Doc queries < 150ms
LLM responses < 5s
Memory usage acceptable

User Experience:

Clear error messages
Sources cited
Mobile-friendly
Accessible

🎯 Key Takeaways

The Five Principles

FAQ First: 60-80% should hit cache
Clarify Before Guessing: Use “Did you mean?”
Context Over Creativity: LLM explains, doesn’t invent
Cache Everything: Speed comes from avoiding work
Start Simple: Chroma → FAISS if needed

Success Metrics

FAQ hit rate: > 60%
“Did you mean?” resolution: > 15%
LLM usage: < 15%
Average latency: < 100ms
User satisfaction: > 85%

What You’ve Learned

✅ How to build a three-tier intelligent system
✅ When to use FAISS vs Chroma (and why Chroma for beginners)
✅ How to organize knowledge for fast retrieval
✅ How to minimize expensive AI calls
✅ How to deploy on modest hardware
✅ How to scale when needed

📚 Additional Resources

Documentation

Communities

r/LocalLLaMA (Reddit)
LlamaIndex Discord
Chroma Discord
FastAPI Discord

Next Learning

Prompt engineering techniques
Advanced RAG patterns
Vector database optimization
Production deployment best practices

🙏 Conclusion

You now have everything you need to build a professional FAQ chatbot that:

Runs on your laptop
Answers in milliseconds
Never hallucinates
Scales with your needs

The three-tier architecture (FAQ → Document → LLM) combined with Chroma for simplicity gives you the best balance of:

⚡ Speed (microseconds to seconds)
🎯 Accuracy (grounded in your docs)
💰 Cost (no cloud bills)
🔧 Maintainability (simple architecture)
📈 Scalability (grow as needed)

Remember: Start with Chroma, focus on great FAQs, and let the system guide users to answers quickly.

Good luck building! 🚀

Built with ❤️ for developers who want to create intelligent systems on realistic hardware

Guides, FAQ Chatbot

This post is licensed under CC BY 4.0 by the author.