π§ FAQ Chatbot Builder's Guide
FAQ Chatbot - A Quick Guide to Build a Fast, Accurate, Hardware-Efficient - FAQ Chatbot System Like HDFC's Ask Eva or ITR Filing Assistant!
π€ The Complete FAQ Chatbot Builderβs Guide
Build a Fast, Accurate, Hardware-Efficient FAQ System Like HDFCβs Ask Eva or ITR Filing Assistant
A beginner-friendly, end-to-end guide with everything you need to know
Part 1: Understanding the System
π― Introduction
What Youβre Building
A professional FAQ chatbot that:
- β Runs on your laptop (no expensive cloud costs)
- β Answers in milliseconds (not seconds)
- β Never hallucinates (grounded in your documents)
- β Handles 5-10 concurrent users easily
- β Works completely offline (after setup)
Think of systems like:
- HDFC Bankβs βAsk Evaβ - answers banking questions instantly
- Indiaβs ITR Filing Assistant - guides tax filing with accuracy
- Company knowledge bases - internal FAQ systems
What Youβll Learn
By the end of this guide, youβll know:
- How to structure knowledge for instant retrieval
- How to build a three-tier intelligent system
- When to use FAISS vs Chroma (and why it matters)
- How to minimize expensive AI calls
- How to deploy on modest hardware
Key Philosophy
Donβt ask the AI first. Ask your own knowledge first.
This gives you:
- β‘ Speed: Microseconds vs seconds
- π― Accuracy: Deterministic, not probabilistic
- π° Cost: Minimal hardware, no cloud bills
- π Control: You decide what it knows
- π Predictability: Same question = same answer
Time Investment
- Reading this guide: 1-2 hours
- Basic implementation: 2-3 days
- Production-ready system: 1-2 weeks
Letβs begin! π
π§ Core Concepts: RAG Explained Simply
What is RAG?
RAG = Retrieval-Augmented Generation
In plain English:
- Retrieval: Find relevant information first
- Augmented: Add that information to your prompt
- Generation: AI explains using only that information
Why Not Just Use ChatGPT Directly?
| Approach | Problem |
|---|---|
| Direct AI | May invent facts (βhallucinateβ) |
| Direct AI | Expensive (API costs add up) |
| Direct AI | Slow (5-10 seconds per query) |
| Direct AI | Unpredictable (different each time) |
| Our Approach | Benefit |
|---|---|
| Find exact FAQ | Instant (< 5ms) |
| Search docs | Accurate (from your sources) |
| AI explains last | Only when needed |
| Grounded | Cannot invent facts |
The Three-Tier Strategy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
User asks: "How do I reset my password?"
Tier 1 (FAQ Bank):
β
Found exact match β Return answer (2ms)
DONE! 60-80% of queries stop here.
Tier 2 (If unsure):
β Found similar questions
β "Did you mean: How to reset password?"
β User picks β Return answer (10ms)
DONE! 15-25% resolve here.
Tier 3 (If still not found):
π Search documents for "password reset"
β Find relevant sections
β AI explains using those sections (2-5 seconds)
DONE! Remaining 5-15% of queries.
Result: 95% of queries answer in under 50ms!
Visual Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
ββββββββββββββββββββ
β User Question β
ββββββββββββββββββββ
β
ββββββββββββββββββββββ
β Normalize & Clean β
ββββββββββββββββββββββ
β
ββββββββββββββββββββββββββ
β LAYER 1: FAQ BANK β β 60-80% stop here
β Check: confidence β₯ 0.85? β
ββββββββββββββββββββββββββ
β No
ββββββββββββββββββββββββββ
β "Did You Mean?" β β 15-25% resolve here
β Show 3-5 suggestions β
ββββββββββββββββββββββββββ
β None matched
ββββββββββββββββββββββββββ
β LAYER 2: Vector Searchβ β Find context
β Retrieve relevant docsβ
ββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββ
β LAYER 3: LLM β β Explain using context
β Generate answer β
ββββββββββββββββββββββββββ
β
βββββββββββ
β Answer β
βββββββββββ
ποΈ System Architecture Overview
The Complete Picture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER QUESTION β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1: FAQ BANK (SQLite + In-Memory) β
β β’ Technology: SQLite database β
β β’ Storage: In-memory cache + disk β
β β’ Speed: < 5ms (microseconds) β
β β’ Coverage: 60-80% of all queries β
β β’ Purpose: Known, stable answers β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β (if confidence < 0.85)
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β "DID YOU MEAN?" CLARIFICATION LAYER β
β β’ Technology: Same FAQ database β
β β’ Strategy: Show 3-5 best matches β
β β’ Speed: < 10ms β
β β’ Coverage: 15-25% resolve here β
β β’ Purpose: User self-correction β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β (if no match selected)
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2: DOCUMENT SEARCH (Vector Database) β
β β’ Technology: FAISS or Chroma β
β β’ Storage: Markdown docs β embeddings β
β β’ Speed: 50-150ms β
β β’ Purpose: Find relevant context β
β β’ Process: Semantic similarity search β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3: LLM INFERENCE (Last Resort) β
β β’ Technology: Mistral-7B-Instruct (GGUF) β
β β’ Engine: llama.cpp (CPU optimized) β
β β’ Speed: 1-5 seconds β
β β’ Purpose: Explain using retrieved context β
β β’ Rule: Cannot invent facts β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Component Breakdown
| Component | What It Does | Technology | Speed |
|---|---|---|---|
| FAQ Database | Stores known Q&A pairs | SQLite | Microseconds |
| Memory Cache | Keeps FAQs in RAM | Python dict | Nanoseconds |
| Vector Store | Enables semantic search | FAISS/Chroma | Milliseconds |
| Embeddings | Convert text to numbers | sentence-transformers | One-time cost |
| LLM | Explains complex answers | Mistral-7B via llama.cpp | Seconds |
π» Hardware Requirements
Minimum Setup (Works, but slow)
- CPU: Intel i5 (10th gen) or equivalent
- RAM: 8 GB
- Storage: 20 GB free (SSD recommended)
- OS: Ubuntu 22.04, Windows 11 (WSL2), or macOS
Experience:
- FAQ queries: Fast
- LLM responses: 5-10 seconds
- Can serve 2-3 concurrent users
Recommended Setup (Smooth experience)
- CPU: Intel i7-1255U or better (12th gen+)
- RAM: 16 GB
- Storage: 30 GB free on SSD
- OS: Ubuntu 22.04 or WSL2
Experience:
- FAQ queries: < 5ms
- LLM responses: 1-5 seconds
- Can serve 5-10 concurrent users comfortably
What You DONβT Need
- β GPU: Everything runs on CPU
- β Cloud services: Fully offline after setup
- β Expensive hardware: Laptops work great
- β Constant internet: Only needed for initial downloads
Storage Breakdown
| Item | Size |
|---|---|
| LLM model (quantized) | 4-5 GB |
| Vector index | 100-500 MB |
| SQLite database | 5-50 MB |
| Python environment | 2-3 GB |
| Documentation | 10-100 MB |
| Total | ~7-9 GB |
Part 2: The Three-Layer Intelligence System
π― Layer 1: FAQ Bank (The Fast Path)
Purpose
Store and retrieve known, stable answers instantly.
When to Use This Layer
β Use for:
- βHow do I reset my password?β
- βWhat are your business hours?β
- βWhere is my data stored?β
- βIs internet required?β
β Donβt use for:
- Complex questions needing reasoning
- Questions requiring multiple sources
- Subjective answers
- Rapidly changing information
Technology Stack
1
2
3
4
5
SQLite Database (on disk)
β
In-Memory Cache (Python dict)
β
< 5ms response time
Example: How It Works
User asks: βHow do I reset my password?β
System does:
- Normalize: βhow do i reset my passwordβ
- Check cache:
cache["how do i reset my password"] - Found! Return answer
- Total time: 2ms
Response:
1
2
3
4
5
6
7
{
"type": "faq_answer",
"answer": "Click 'Forgot Password' on the login screen, then check your email for reset link.",
"confidence": 1.0,
"source": "FAQ #12",
"response_time_ms": 2
}
Performance Characteristics
| Metric | Value |
|---|---|
| Average latency | 2-5 ms |
| Throughput | 1000+ queries/sec |
| Accuracy | 100% (deterministic) |
| Coverage | 60-80% of user queries |
π Layer 2: Document Search (The Knowledge Base)
Purpose
Find relevant information from your documentation when itβs not in the FAQ bank.
Technology Choice: Vector Database
This is where you need to choose between FAISS and Chroma.
π― Vector Database: FAISS vs Chroma
One of the most important decisions in your architecture
Quick Decision Guide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
βββββββββββββββββββββββββββββββββββββββ
β Are you a beginner? β
β Building your first chatbot? β
β Need to ship fast? β
βββββββββββββββββββββββββββββββββββββββ€
β β
Choose CHROMA β
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
β Have you built RAG systems before? β
β Need maximum performance? β
β Handling 100K+ documents? β
βββββββββββββββββββββββββββββββββββββββ€
β β
Choose FAISS β
βββββββββββββββββββββββββββββββββββββββ
What Are Vector Databases?
Simple explanation:
When you search Google, you type words and it finds pages with those exact words. Thatβs keyword search.
Vector databases do semantic search:
- βHow to reset passwordβ matches βForgot my login credentialsβ
- βInstall softwareβ matches βSetup instructionsβ
- Understands meaning, not just words
How it works:
- Convert text to numbers (embeddings): βreset passwordβ β [0.23, -0.45, 0.67, β¦]
- Similar meanings = similar numbers
- Search by finding closest numbers
FAISS: The Speed Champion
Created by: Meta (Facebook AI Research)
Age: Since 2017 (battle-tested)
Focus: Maximum speed and efficiency
FAISS Strengths
| Feature | Rating | Why |
|---|---|---|
| Speed | βββββ | Optimized C++, SIMD instructions |
| Memory | βββββ | Very efficient, quantization support |
| Scale | βββββ | Handles billions of vectors |
| Maturity | βββββ | Production-proven at Meta scale |
FAISS Weaknesses
| Feature | Rating | Issue |
|---|---|---|
| Ease of use | βββ | Steeper learning curve |
| Metadata | ββ | Manual tracking required |
| Setup | βββ | More complex installation |
When to Choose FAISS
β Choose FAISS if:
- You need maximum speed (< 1ms queries)
- Working with large scale (100K+ documents)
- Memory is limited (need quantization)
- Building for production (proven reliability)
- You have technical experience
FAISS Quick Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import faiss
import numpy as np
# 1. Create index
dimension = 384 # embedding size
index = faiss.IndexFlatL2(dimension)
# 2. Add vectors
embeddings = get_embeddings(documents) # shape: (N, 384)
index.add(embeddings.astype('float32'))
# 3. Search
query_embedding = get_embedding("reset password")
distances, indices = index.search(query_embedding, k=5)
# 4. Save (manual persistence)
faiss.write_index(index, "vector.index")
# 5. Load
index = faiss.read_index("vector.index")
Metadata handling (the tricky part):
1
2
3
4
5
6
7
8
9
10
# You need a separate structure
metadata_store = {
0: {"text": "Doc 1", "category": "auth"},
1: {"text": "Doc 2", "category": "setup"},
# ... manual tracking
}
# After search, lookup metadata
for idx in indices[0]:
print(metadata_store[idx])
Chroma: The Developerβs Choice
Created by: Chroma team
Age: Since 2022 (modern, actively developed)
Focus: Developer experience and ease of use
Chroma Strengths
| Feature | Rating | Why |
|---|---|---|
| Ease of use | βββββ | Pythonic, intuitive API |
| Metadata | βββββ | Built-in filtering and queries |
| Setup | βββββ | pip install chromadb and go |
| Persistence | βββββ | Automatic, no manual save/load |
| Features | βββββ | Collections, updates, deletes |
Chroma Weaknesses
| Feature | Rating | Issue |
|---|---|---|
| Speed | ββββ | 3-5x slower than FAISS |
| Scale | ββββ | Best for < 1M vectors |
| Maturity | βββ | Newer (less production history) |
When to Choose Chroma
β Choose Chroma if:
- Youβre a beginner (easier learning curve)
- Building an MVP/prototype (ship faster)
- Need metadata filtering (category, date, priority)
- Want simpler code (less boilerplate)
- Working with < 100K documents (performance is fine)
Chroma Quick Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import chromadb
# 1. Setup (automatic persistence!)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("faqs")
# 2. Add documents WITH metadata (built-in!)
collection.add(
documents=["How to reset password", "Installation guide"],
metadatas=[
{"category": "auth", "priority": "high"},
{"category": "setup", "priority": "medium"}
],
ids=["doc1", "doc2"]
)
# 3. Query with filtering (this is the magic!)
results = collection.query(
query_texts=["forgot password"],
n_results=5,
where={"category": "auth"} # Built-in metadata filter!
)
# 4. Update easily
collection.update(
ids=["doc1"],
documents=["How to reset your password (updated)"]
)
# 5. Delete easily
collection.delete(ids=["doc2"])
# NO MANUAL SAVE NEEDED - it's automatic!
Performance Comparison
Speed Benchmarks (10,000 documents, 384 dimensions)
| Operation | FAISS | Chroma | Winner |
|---|---|---|---|
| Single query | 0.5-1 ms | 2-5 ms | FAISS (5x faster) |
| Batch 100 queries | ~100 ms | ~500 ms | FAISS |
| Index creation | ~50 ms | ~200 ms | FAISS |
| Add 1000 docs | ~10 ms | ~100 ms | FAISS |
Memory Usage (100,000 documents)
| Component | FAISS | Chroma |
|---|---|---|
| Index only | ~150 MB | ~180 MB |
| With metadata | +external DB | ~200 MB (built-in) |
| Total | ~200 MB | ~250 MB |
Code Complexity
| Task | FAISS | Chroma | Winner |
|---|---|---|---|
| Basic setup | 20 lines | 5 lines | Chroma (4x simpler) |
| With metadata | 50+ lines | 10 lines | Chroma |
| Updates/deletes | Complex | 2 lines | Chroma |
Scenario-Based Decisions
Scenario 1: First-Time Builder, Small FAQ System
Requirements:
- 500-5,000 FAQs
- Need to ship in 1-2 weeks
- Team: 1-2 developers (beginners)
Recommendation: Chroma π―
Why?:
- 5x faster development
- Performance is plenty (2-5ms is fine)
- Metadata filtering is easy
- Less code to debug
Expected build time: 3-5 days
Scenario 2: Production System, Medium Scale
Requirements:
- 10,000-50,000 FAQs
- Need category filtering
- Expect 100+ users/day
- Budget: 1 month build time
Recommendation: Chroma π―
Why?:
- Built-in metadata perfect for categories
- Performance still good at this scale
- Easier maintenance
- Team velocity matters
Migration note: Can move to FAISS later if needed
Scenario 3: Large Scale, Performance-Critical
Requirements:
- 100,000+ documents
- Need < 50ms total response time
- High traffic (1000+ queries/sec)
- Team: Experienced developers
Recommendation: FAISS β‘
Why?:
- 5x speed advantage critical at scale
- Better memory efficiency
- Proven at billion-vector scale
- Worth the complexity investment
Build time: 2-4 weeks
Scenario 4: Complex Filtering Needs
Requirements:
- Filter by multiple attributes (category AND date AND priority)
- Frequent updates to documents
- Need audit trails
Recommendation: Chroma π―
Why?:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Chroma: Simple
results = collection.query(
query_texts=["password"],
where={
"$and": [
{"category": "auth"},
{"priority": {"$gte": 3}},
{"date": {"$gte": "2024-01-01"}}
]
}
)
# FAISS: Complex
# 1. Search FAISS
# 2. Load metadata from separate DB
# 3. Filter results manually
# 4. Sort and return
# = 30+ lines of code
Our Recommendation for Your FAQ Chatbot
Phase 1 (Weeks 1-4): Start with Chroma
Reasons:
- β Faster to build: 3-5 days vs 1-2 weeks
- β Easier to debug: Simpler code
- β Built-in features: Metadata, persistence, updates
- β Good enough: 2-5ms is fast for FAQ chatbot
- β Lower risk: Less to go wrong
Phase 2 (If needed): Migrate to FAISS
When to migrate:
- β οΈ Hit 100K+ documents
- β οΈ Need < 1ms query time
- β οΈ Running out of memory
- β οΈ Serving 500+ queries/second
Migration effort: 1-2 days (straightforward)
Installation & Setup
Chroma (Recommended for beginners)
1
2
3
4
5
# Installation (super simple!)
pip install chromadb
# Optional: Add server mode
pip install chromadb[server]
First code (ready in 5 minutes):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import chromadb
from sentence_transformers import SentenceTransformer
# 1. Setup
client = chromadb.PersistentClient(path="./faq_db")
collection = client.get_or_create_collection("faqs")
# 2. Add your FAQs
faqs = [
"How do I reset my password?",
"What are your business hours?",
"Where is my data stored?"
]
metadata = [
{"category": "auth", "priority": 5},
{"category": "info", "priority": 3},
{"category": "data", "priority": 4}
]
collection.add(
documents=faqs,
metadatas=metadata,
ids=[f"faq_{i}" for i in range(len(faqs))]
)
# 3. Query
results = collection.query(
query_texts=["forgot password"],
n_results=3,
where={"category": "auth"}
)
print(results)
Thatβs it! Youβre up and running.
FAISS (For experienced developers)
1
2
3
4
5
6
7
8
# Installation
pip install faiss-cpu # or faiss-gpu if you have GPU
# From source (for latest features)
git clone https://github.com/facebookresearch/faiss.git
cd faiss
cmake -B build .
make -C build
First code (takes longer to set up):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
# 1. Setup embedder
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# 2. Create index
dimension = 384
index = faiss.IndexFlatL2(dimension)
# 3. Prepare data
faqs = ["How do I reset password?", "Business hours?"]
embeddings = embedder.encode(faqs).astype('float32')
index.add(embeddings)
# 4. Manual metadata tracking
metadata = {
0: {"text": faqs[0], "category": "auth"},
1: {"text": faqs[1], "category": "info"}
}
# 5. Query
query = "forgot password"
query_emb = embedder.encode([query]).astype('float32')
distances, indices = index.search(query_emb, k=3)
# 6. Retrieve with metadata
for idx in indices[0]:
print(metadata[idx])
# 7. Save manually
faiss.write_index(index, "faiss.index")
import json
with open("metadata.json", "w") as f:
json.dump(metadata, f)
Notice: More code, more manual work.
Summary: FAISS vs Chroma
| Aspect | FAISS | Chroma | For Beginners |
|---|---|---|---|
| Speed | β‘β‘β‘β‘β‘ | β‘β‘β‘β‘ | Chroma (fast enough) |
| Ease | βββ | βββββ | Chroma (much easier) |
| Setup time | 2-3 hours | 30 minutes | Chroma |
| Code complexity | High | Low | Chroma |
| Metadata | Manual | Built-in | Chroma |
| Scale limit | Billions | Millions | FAISS (future-proof) |
| Best for | Production scale | MVP/Learning | Depends |
Our verdict for this guide: Start with Chroma β
You can always migrate to FAISS later if you need maximum performance.
π§ Layer 3: LLM Inference (The Explainer)
Purpose
When the question isnβt in your FAQ bank and requires explanation or synthesis, use the LLM.
Critical Rules
- LLM cannot search: It only explains what you give it
- Must use context: Only information from Layer 2
- Cannot invent: Strictly grounded in documents
- Last resort: Most expensive and slowest
Technology: Mistral-7B-Instruct
Why Mistral?
- β Best quality-per-parameter for 7B models
- β Excellent instruction following
- β Runs well on CPU
- β Permissive license
- β Quantized versions available (~4 GB)
Via llama.cpp:
- CPU-optimized inference
- GGUF quantization support
- Fast and efficient
- No GPU needed
How It Works
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
User asks: "Why does login fail after password reset?"
β
FAQ Bank: No exact match (confidence: 0.42)
β
Document Search: Finds 3 relevant sections:
1. "Password reset process"
2. "Common login errors"
3. "Cache clearing instructions"
β
Build prompt:
"""
Answer using ONLY this context:
CONTEXT:
[Section 1: Password reset process...]
[Section 2: Common login errors...]
[Section 3: Cache clearing instructions...]
QUESTION: Why does login fail after password reset?
RULES:
- Use only the context above
- If not answered in context, say "I don't know"
- Be concise and accurate
ANSWER:
"""
β
LLM generates: "Login may fail after password reset if
your browser cache contains old credentials. Try
clearing your browser cache and cookies, then log
in again with your new password."
β
Return answer with sources cited
Performance
| Metric | Value |
|---|---|
| Latency | 1-5 seconds |
| Throughput | 1-2 requests/sec |
| Accuracy | High (when grounded) |
| Usage | 5-15% of queries |
π‘ The βDid You Mean?β Feature
Why This Is Critical
Problem: User asks βforgot my login infoβ
System thinks: Not confident enough (score: 0.73)
Without this feature:
- β System searches documents (slower)
- β Maybe invokes LLM (expensive)
- β User frustrated (shouldβve been simple)
With this feature:
- β Show 3 similar FAQs
- β User picks correct one
- β Instant answer (10ms)
- β Better experience
When to Trigger
Confidence thresholds:
- β₯ 0.85: Answer directly (high confidence)
- 0.65-0.84: Show βDid you mean?β (medium confidence)
- < 0.65: Escalate to document search (low confidence)
Example Flow
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
User: "forgot my login info"
β
System calculates confidence: 0.73
β
Trigger: "Did you mean?"
β
Show user:
1οΈβ£ How do I reset my password?
2οΈβ£ What happens if login fails?
3οΈβ£ How do I retrieve my username?
4οΈβ£ None of these - search documentation
β
User clicks: 1οΈβ£
β
Return FAQ #12 answer instantly
β
Total time: 8ms (instead of 2000ms with LLM!)
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def answer_question(user_question):
normalized = normalize(user_question)
matches = score_all_faqs(normalized)
best = matches[0]
# High confidence: answer directly
if best.score >= 0.85:
return {
"type": "direct_answer",
"answer": best.answer
}
# Medium confidence: ask for clarification
if best.score >= 0.65:
return {
"type": "did_you_mean",
"message": "I want to be sure I understand correctly.",
"suggestions": [
{"id": m.id, "question": m.question}
for m in matches[:3]
]
}
# Low confidence: search documents
return search_documents(user_question)
Impact
| Metric | Improvement |
|---|---|
| LLM calls | -40% to -70% |
| Average latency | -60% |
| User satisfaction | +35% |
| Hardware load | -50% |
Part 3: Technical Design
ποΈ Database Schema & Indexing Strategy
FAQ Table (SQLite)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE TABLE faq (
id INTEGER PRIMARY KEY AUTOINCREMENT,
question TEXT NOT NULL,
normalized_question TEXT NOT NULL, -- lowercase, no punctuation
answer TEXT NOT NULL,
category TEXT, -- 'auth', 'setup', etc.
keywords TEXT, -- 'reset,password,login'
priority INTEGER DEFAULT 1, -- 1-5 (for tie-breaking)
last_updated DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Performance indexes (CRITICAL!)
CREATE INDEX idx_faq_normalized ON faq(normalized_question);
CREATE INDEX idx_faq_category ON faq(category);
CREATE INDEX idx_faq_keywords ON faq(keywords);
Why Each Column Matters
| Column | Purpose | Example |
|---|---|---|
normalized_question | Fast exact matching | βhow do i reset my passwordβ |
keywords | Quick filtering | βreset,password,credentialsβ |
category | Group related FAQs | βauthenticationβ |
priority | Resolve ties | 5 = critical, 1 = low |
Normalization Function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
def normalize(text: str) -> str:
"""Prepare text for matching."""
text = text.lower()
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text) # Collapse spaces
return text.strip()
# Examples:
normalize("How do I Reset my Password?")
# β "how do i reset my password"
normalize("What's your business hours?!")
# β "whats your business hours"
π Confidence Scoring System
The Formula
1
2
3
4
5
final_score = (
0.5 Γ exact_match_score +
0.3 Γ keyword_overlap_score +
0.2 Γ embedding_similarity_score
)
Component Breakdown
1. Exact Match (50% weight)
1
2
3
4
5
def exact_match_score(user_q, faq_q):
if normalize(user_q) == normalize(faq_q):
return 1.0
else:
return 0.0
Why 50%? Exact matches should strongly favor direct answers.
2. Keyword Overlap (30% weight)
1
2
3
4
5
6
def keyword_overlap_score(user_q, faq_keywords):
user_words = set(user_q.lower().split())
faq_words = set(faq_keywords.split(','))
common = user_words & faq_words
return len(common) / len(user_words) if user_words else 0.0
Example:
- User: βreset password loginβ
- FAQ keywords: βreset,password,credentialsβ
- Common: {reset, password}
- Score: 2/3 = 0.67
3. Embedding Similarity (20% weight)
1
2
3
4
5
6
7
8
9
from sentence_transformers import SentenceTransformer, util
embedder = SentenceTransformer('all-MiniLM-L6-v2')
def embedding_similarity_score(user_q, faq_q):
emb1 = embedder.encode(user_q)
emb2 = embedder.encode(faq_q)
similarity = util.cos_sim(emb1, emb2).item()
return similarity
Why 20%? Catches semantic similarity but less reliable than exact/keyword.
Complete Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
User question: "how to reset password"
FAQ question: "how do i reset my password"
Calculations:
exact_match = 0.0 (not identical)
keyword_overlap = 1.0 (all words match)
embedding_sim = 0.92 (very similar)
Final score:
= (0.5 Γ 0.0) + (0.3 Γ 1.0) + (0.2 Γ 0.92)
= 0.0 + 0.3 + 0.184
= 0.484
Decision: 0.484 < 0.65 β Escalate to document search
πΎ Caching Policies
What to Cache
| Item | Storage | Lifetime | Why Cache? |
|---|---|---|---|
| All FAQs | In-memory dict | App lifetime | 1000x faster than disk |
| FAQ embeddings | NumPy array | App lifetime | Avoid recomputation |
| Recent queries | LRU cache (256) | 1 hour | Repeat queries common |
| Document chunks | In-memory + disk | Until update | Balance speed/memory |
Cache Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Application Startup
β
βββββββββββββββββββββββββββ
β Load SQLite β Memory β (5-50 MB)
βββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββ
β Precompute Embeddings β (20-100 MB)
βββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββ
β Load Vector Index β (100-500 MB)
βββββββββββββββββββββββββββ
β
Ready to serve queries in < 5ms!
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from functools import lru_cache
import sqlite3
# Global cache
faq_cache = {}
faq_embeddings = {}
def load_faq_cache():
"""Load all FAQs into memory at startup."""
conn = sqlite3.connect('faq.db')
cursor = conn.execute("SELECT * FROM faq")
for row in cursor:
key = row['normalized_question']
faq_cache[key] = {
'id': row['id'],
'question': row['question'],
'answer': row['answer'],
'category': row['category'],
'priority': row['priority']
}
print(f"Loaded {len(faq_cache)} FAQs into memory")
# LRU cache for repeat queries
@lru_cache(maxsize=256)
def get_answer(question: str):
"""Cache recent answers."""
normalized = normalize(question)
return faq_cache.get(normalized)
Cache Invalidation
| Event | Action |
|---|---|
| FAQ updated | Clear FAQ cache, reload |
| Document updated | Rebuild vector index |
| App restart | Reload all caches |
| Memory pressure | Evict LRU items |
β‘ Concurrency & Throughput Design
The Challenge
Different layers have vastly different performance characteristics:
| Layer | Latency | CPU Usage | Concurrency |
|---|---|---|---|
| FAQ lookup | < 5ms | Minimal | Unlimited |
| Vector search | 50-150ms | Moderate | 10-20 |
| LLM inference | 1-5 sec | Heavy | 1-2 max |
Thread Pool Architecture
1
2
3
4
5
6
7
8
9
10
11
12
13
FastAPI Server (Uvicorn)
β
βββ FAQ Worker Pool
β βββ ThreadPoolExecutor(10 workers)
β βββ In-memory lookups (ultra-fast)
β
βββ Vector Search Pool
β βββ ThreadPoolExecutor(5 workers)
β βββ FAISS/Chroma queries (medium)
β
βββ LLM Queue (CRITICAL!)
βββ Semaphore(2) β MAX 2 CONCURRENT
βββ llama.cpp inference (slow, CPU-heavy)
LLM Gatekeeping (Most Important!)
Problem: LLM uses 100% CPU for 1-5 seconds.
Solution: Strict concurrency limit.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import asyncio
from asyncio import Semaphore
# Allow max 2 concurrent LLM calls
llm_semaphore = Semaphore(2)
llm_queue_size = 0
LLM_QUEUE_MAX = 5
async def call_llm(prompt: str):
global llm_queue_size
# Check queue size
if llm_queue_size >= LLM_QUEUE_MAX:
return {
"error": "System busy",
"message": "Too many requests. Try again in 10 seconds.",
"retry_after": 10
}
# Gate concurrency
llm_queue_size += 1
async with llm_semaphore:
try:
result = await run_llama_cpp(prompt)
return result
finally:
llm_queue_size -= 1
Expected Throughput
| Query Type | Latency | Throughput | Concurrency |
|---|---|---|---|
| FAQ hit | 2-5ms | 1000+ req/sec | Unlimited |
| Did you mean | 5-10ms | 500+ req/sec | High |
| Doc search | 50-150ms | 50-100 req/sec | 10-20 |
| LLM call | 1-5 sec | 1-2 req/sec | 2 max |
Overall system: Serves 5-10 concurrent users comfortably.
Part 4: Implementation Guide
π Step-by-Step Setup
Phase 1: Environment Setup (30 minutes)
1.1 Install System Dependencies
Ubuntu/WSL2:
1
2
sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential cmake python3 python3-venv python3-pip
macOS:
1
brew install python3 cmake git
1.2 Create Python Environment
1
2
3
4
5
6
cd ~
python3 -m venv faq-chatbot-env
source faq-chatbot-env/bin/activate # Linux/Mac
# faq-chatbot-env\Scripts\activate # Windows
pip install --upgrade pip
Phase 2: Install llama.cpp (30 minutes)
1
2
3
4
5
6
7
8
9
# Clone repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build
make
# Test
./main --help
Phase 3: Download Model (30 minutes)
Recommended: Mistral-7B-Instruct (GGUF Q4)
1
2
3
4
5
6
7
8
mkdir -p ~/models/mistral-7b
# Download from Hugging Face (TheBloke)
# Visit: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
# Download: mistral-7b-instruct-v0.2.Q4_K_M.gguf (~4.4 GB)
# Move to models directory
mv mistral-7b-instruct-v0.2.Q4_K_M.gguf ~/models/mistral-7b/
Test the model:
1
2
3
4
5
cd ~/llama.cpp
./main -m ~/models/mistral-7b/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
-p "Hello, how are you?" \
-n 50 \
-t 8
Phase 4: Install Python Packages (15 minutes)
1
2
3
4
5
6
7
8
9
10
11
12
source ~/faq-chatbot-env/bin/activate
# Core packages
pip install \
chromadb \
sentence-transformers \
fastapi \
uvicorn \
python-multipart
# Optional: FAISS (if you decide to use it instead)
# pip install faiss-cpu
Phase 5: Create Database (30 minutes)
5.1 Create Schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# create_database.py
import sqlite3
conn = sqlite3.connect('faq.db')
cursor = conn.cursor()
# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS faq (
id INTEGER PRIMARY KEY AUTOINCREMENT,
question TEXT NOT NULL,
normalized_question TEXT NOT NULL,
answer TEXT NOT NULL,
category TEXT,
keywords TEXT,
priority INTEGER DEFAULT 1,
last_updated DATETIME DEFAULT CURRENT_TIMESTAMP
)
''')
# Create indexes
cursor.execute('CREATE INDEX IF NOT EXISTS idx_faq_normalized ON faq(normalized_question)')
cursor.execute('CREATE INDEX IF NOT EXISTS idx_faq_category ON faq(category)')
conn.commit()
print("Database created successfully!")
Run it:
1
python create_database.py
5.2 Add Sample FAQs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# populate_faqs.py
import sqlite3
import re
def normalize(text):
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\s+', ' ', text)
return text.strip()
faqs = [
{
"question": "How do I reset my password?",
"answer": "Click 'Forgot Password' on the login screen, then check your email for a reset link.",
"category": "authentication",
"keywords": "reset,password,forgot,login,credentials",
"priority": 5
},
{
"question": "What are your business hours?",
"answer": "We are open Monday-Friday, 9 AM to 6 PM (EST).",
"category": "information",
"keywords": "hours,open,time,schedule",
"priority": 3
},
{
"question": "Where is my data stored?",
"answer": "All data is stored locally on your device. We do not use cloud storage.",
"category": "data",
"keywords": "data,storage,privacy,local,cloud",
"priority": 4
},
]
conn = sqlite3.connect('faq.db')
cursor = conn.cursor()
for faq in faqs:
cursor.execute('''
INSERT INTO faq (question, normalized_question, answer, category, keywords, priority)
VALUES (?, ?, ?, ?, ?, ?)
''', (
faq['question'],
normalize(faq['question']),
faq['answer'],
faq['category'],
faq['keywords'],
faq['priority']
))
conn.commit()
print(f"Added {len(faqs)} FAQs to database!")
Run it:
1
python populate_faqs.py
Phase 6: Set Up Vector Database (30 minutes)
Using Chroma (Recommended)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# setup_vector_db.py
import chromadb
from sentence_transformers import SentenceTransformer
# Initialize
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="faq_documents",
metadata={"description": "FAQ chatbot documentation"}
)
# Sample documents (organize in docs/ folder as .md files)
documents = [
"To reset your password, navigate to the login page and click 'Forgot Password'. Enter your email address and you'll receive a reset link within 5 minutes.",
"Our customer service is available Monday through Friday from 9 AM to 6 PM Eastern Time. We respond to emails within 24 hours.",
"All user data is stored locally on your device using SQLite. We do not transmit your data to external servers or use cloud storage."
]
metadata = [
{"file": "auth.md", "section": "Password Reset", "category": "authentication"},
{"file": "info.md", "section": "Business Hours", "category": "information"},
{"file": "data.md", "section": "Data Storage", "category": "data"}
]
# Add to Chroma
collection.add(
documents=documents,
metadatas=metadata,
ids=[f"doc_{i}" for i in range(len(documents))]
)
print("Vector database set up successfully!")
print(f"Added {len(documents)} documents")
# Test query
results = collection.query(
query_texts=["How to reset password?"],
n_results=3
)
print("\nTest query results:")
print(results)
Run it:
1
python setup_vector_db.py
π Document Organization Best Practices
Folder Structure
1
2
3
4
5
6
7
8
9
10
11
faq_docs/
βββ 01_overview.md # System introduction
βββ 02_getting_started.md # Quick start guide
βββ 03_authentication.md # Login, password, security
βββ 04_installation.md # Setup instructions
βββ 05_basic_usage.md # Common operations
βββ 06_advanced_features.md # Power user guide
βββ 07_troubleshooting.md # Common issues
βββ 08_api_reference.md # Technical details
βββ 90_faq_core.md # Dedicated FAQ file
βββ 99_glossary.md # Definitions
Writing Good Markdown
β Bad (Hard to Retrieve)
1
2
3
4
## Installation
Here's everything about installation including prerequisites,
step-by-step process, common errors, fixes, and advanced options...
[500 lines of mixed content]
β Good (Easy to Retrieve)
1
2
3
4
5
6
7
8
9
10
11
12
## Installation Prerequisites
Before installing, ensure you have:
- Python 3.10 or higher
- 20 GB free disk space
- Git installed
## Installation Steps
1. Clone the repository:
```bash
git clone https://github.com/example/repo.git
- Install dependencies:
1
pip install -r requirements.txt
- Run setup:
1
python setup.py
Common Installation Errors
Error: Module Not Found
Symptom: ModuleNotFoundError: No module named 'xyz'
Cause: Missing dependency
Fix: Run pip install -r requirements.txt
Error: Permission Denied
Symptom: Permission denied when writing to directory
Cause: Insufficient permissions
Fix: Run with sudo or adjust directory permissions
1
2
3
4
5
6
7
8
9
10
11
12
13
### Key Principles
1. **One heading = one idea**
2. **Use hierarchical structure** (##, ###, ####)
3. **Keep sections under 200-500 words**
4. **Be specific and concrete**
5. **Include examples**
---
## ποΈ Production-Grade Folder Structure
faq_chatbot/ βββ app/ β βββ init.py β βββ main.py # Application entry β β β βββ api/ # API layer β β βββ init.py β β βββ routes.py # FastAPI endpoints β β βββ models.py # Pydantic models β β β βββ core/ # Business logic β β βββ init.py β β βββ faq_engine.py # FAQ matching β β βββ confidence.py # Scoring β β βββ cache.py # Caching β β βββ llm_gate.py # LLM concurrency β β β βββ retrieval/ # Document layer β β βββ init.py β β βββ embeddings.py β β βββ vector_store.py # Chroma interface β β βββ doc_loader.py β β β βββ prompts/ # Prompt templates β β βββ system_prompt.txt β β β βββ utils/ β βββ init.py β βββ text.py # Normalization β βββ logger.py β βββ data/ β βββ faq.db # SQLite β βββ docs/ # Markdown files β βββ chroma_db/ # Vector store β βββ models/ β βββ mistral-7b/ β βββ model.gguf β βββ scripts/ β βββ create_database.py β βββ populate_faqs.py β βββ setup_vector_db.py β βββ tests/ β βββ test_faq.py β βββ test_confidence.py β βββ test_retrieval.py β βββ requirements.txt βββ config.yaml βββ README.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
## π» Complete Code Implementation
### main.py (Application Entry)
```python
# app/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import List, Optional
import sqlite3
import chromadb
from sentence_transformers import SentenceTransformer
import subprocess
import asyncio
from asyncio import Semaphore
app = FastAPI(title="FAQ Chatbot")
# Global state
faq_cache = {}
embedder = SentenceTransformer('all-MiniLM-L6-v2')
chroma_client = None
chroma_collection = None
llm_semaphore = Semaphore(2) # Max 2 concurrent LLM calls
# Models
class Query(BaseModel):
question: str
class Answer(BaseModel):
type: str # 'faq_answer', 'did_you_mean', 'llm_answer'
answer: Optional[str] = None
suggestions: Optional[List[dict]] = None
sources: Optional[List[str]] = None
confidence: Optional[float] = None
# Startup
@app.on_event("startup")
async def startup():
global faq_cache, chroma_client, chroma_collection
# Load FAQs into memory
print("Loading FAQ database...")
conn = sqlite3.connect('data/faq.db')
conn.row_factory = sqlite3.Row
cursor = conn.execute("SELECT * FROM faq")
for row in cursor:
faq_cache[row['normalized_question']] = dict(row)
print(f"Loaded {len(faq_cache)} FAQs into memory")
# Initialize Chroma
print("Loading vector database...")
chroma_client = chromadb.PersistentClient(path="./data/chroma_db")
chroma_collection = chroma_client.get_or_create_collection("faq_documents")
print("Vector database loaded")
# Helper functions
def normalize(text: str) -> str:
import re
text = text.lower()
text = re.sub(r'[^\w\s]', '', text)
text = re.sub(r'\s+', ' ', text)
return text.strip()
def score_faq(user_q: str, faq: dict) -> float:
# Exact match
if user_q == faq['normalized_question']:
exact_score = 1.0
else:
exact_score = 0.0
# Keyword overlap
user_words = set(user_q.split())
faq_words = set(faq['keywords'].split(','))
common = user_words & faq_words
keyword_score = len(common) / len(user_words) if user_words else 0.0
# Embedding similarity
user_emb = embedder.encode(user_q)
faq_emb = embedder.encode(faq['normalized_question'])
from sentence_transformers import util
embedding_score = util.cos_sim(user_emb, faq_emb).item()
# Final score
final = 0.5 * exact_score + 0.3 * keyword_score + 0.2 * embedding_score
return final
async def call_llm(prompt: str) -> str:
async with llm_semaphore:
result = subprocess.run(
[
"./llama.cpp/main",
"-m", "models/mistral-7b/model.gguf",
"-p", prompt,
"-n", "256",
"-t", "8",
"--temp", "0.2"
],
capture_output=True,
text=True,
timeout=30
)
return result.stdout.strip()
# Main endpoint
@app.post("/chat", response_model=Answer)
async def chat(query: Query):
user_q = query.question
normalized_q = normalize(user_q)
# Layer 1: FAQ Bank
scores = []
for faq in faq_cache.values():
score = score_faq(normalized_q, faq)
scores.append((score, faq))
scores.sort(reverse=True, key=lambda x: x[0])
best_score, best_faq = scores[0]
# High confidence: direct answer
if best_score >= 0.85:
return Answer(
type="faq_answer",
answer=best_faq['answer'],
confidence=best_score,
sources=[f"FAQ #{best_faq['id']}"]
)
# Medium confidence: "Did you mean?"
if best_score >= 0.65:
suggestions = [
{"id": faq['id'], "question": faq['question'], "score": score}
for score, faq in scores[:3]
]
return Answer(
type="did_you_mean",
suggestions=suggestions
)
# Layer 2: Document search
results = chroma_collection.query(
query_texts=[user_q],
n_results=5
)
if not results['documents'][0]:
return Answer(
type="not_found",
answer="I couldn't find information about this in the documentation."
)
# Layer 3: LLM
context = "\n\n".join(results['documents'][0])
prompt = f"""Answer the question using ONLY the context below.
CONTEXT:
{context}
QUESTION: {user_q}
RULES:
- Use only the context provided
- If not answered in context, say "I don't know"
- Be concise and accurate
ANSWER:"""
answer = await call_llm(prompt)
return Answer(
type="llm_answer",
answer=answer,
sources=results['metadatas'][0]
)
# Run with: uvicorn app.main:app --reload
Testing
1
2
3
4
5
6
7
# Start server
uvicorn app.main:app --reload --port 8000
# Test with curl
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"question": "How do I reset my password?"}'
Part 5: Advanced Topics
π Model Selection Guide
Open-Source Models Comparison
| Model | Size | Speed (CPU) | Quality | License | Best For |
|---|---|---|---|---|---|
| Mistral-7B-Instruct β | 7B | Fast | Excellent | Permissive | FAQ chatbots |
| LLaMA-2-7B-Chat | 7B | Medium | Very Good | Meta | Long context |
| Falcon-7B-Instruct | 7B | Medium | Good | Apache 2.0 | Open license |
| Vicuna-7B | 7B | Medium | Good | LLaMA-based | Conversational |
Quantization Options
| Format | Size | Quality Loss | Speed | Recommendation |
|---|---|---|---|---|
| Q4_K_M | 4.4 GB | Minimal | Fast | β Best balance |
| Q5_K_M | 5.2 GB | Negligible | Medium | Higher quality |
| Q3_K_M | 3.5 GB | Noticeable | Very fast | Low memory only |
Our choice: Mistral-7B-Instruct Q4_K_M
β‘ Performance Optimization
Layer-by-Layer Optimization
Layer 1: FAQ Bank
Optimizations:
- β Load all FAQs into memory (nanosecond access)
- β Precompute embeddings once
- β Use indexes on normalized_question
- β Cache recent queries (LRU)
Result: < 5ms per query
Layer 2: Vector Search
Optimizations:
- β Use Chroma for simplicity (good enough)
- β Keep index in memory
- β Limit to top 5 results
- β Filter by category first
Result: 50-150ms per query
Layer 3: LLM
Optimizations:
- β Use quantized model (Q4)
- β Limit concurrent calls (max 2)
- β Reduce max tokens (256)
- β Lower temperature (0.2)
Result: 1-5 seconds per query
Memory Usage
| Component | Memory | Optimization |
|---|---|---|
| FAQ cache | 5-50 MB | Keep all (small) |
| Embeddings | 20-100 MB | Precompute once |
| Vector index | 100-500 MB | Use Chroma |
| LLM model | 4-5 GB | Q4 quantization |
| Total | ~5-6 GB | Fits in 8GB RAM |
π Deployment & Scaling
Single Server (Your Laptop)
Capacity:
- 5-10 concurrent users
- 1000+ FAQ queries/sec
- 50-100 doc searches/sec
- 1-2 LLM calls/sec
Good for:
- MVP/prototype
- Small teams (< 20 users)
- Demo purposes
Scaling Up
When to scale:
20 concurrent users
100K documents
- Need < 50ms response time
- Geographic distribution
Options:
- Vertical: Bigger server (32-64 GB RAM)
- Horizontal: Multiple API servers + shared DB
- Migrate to FAISS: If vector search is bottleneck
- Add GPU: If LLM calls are bottleneck
π Next Steps & Roadmap
Week 1: Foundation
- Read this guide
- Set up environment
- Install llama.cpp
- Download model
- Test inference
Week 2: Database
- Create SQLite schema
- Write 20-30 core FAQs
- Set up Chroma
- Test FAQ matching
- Test document search
Week 3: Integration
- Build FastAPI app
- Implement 3-layer logic
- Add βDid you mean?β
- Test end-to-end
- Basic UI
Week 4: Polish
- Add logging
- Error handling
- Performance tuning
- User testing
- Documentation
Future Enhancements
- Admin UI: Manage FAQs without code
- Analytics: Track which FAQs are used
- Auto-promote: Move common doc queries to FAQ
- Multi-language: Support other languages
- Voice: Add speech-to-text
- Mobile app: iOS/Android clients
β Final Checklist
Before Launch
Technical:
- All dependencies installed
- Model working (test inference)
- Database populated (20+ FAQs)
- Documents indexed
- API endpoints tested
Quality:
- FAQ coverage comprehensive
- Confidence thresholds tuned
- βDid you mean?β working
- LLM answers grounded
- No hallucination
Performance:
- FAQ queries < 5ms
- Doc queries < 150ms
- LLM responses < 5s
- Memory usage acceptable
User Experience:
- Clear error messages
- Sources cited
- Mobile-friendly
- Accessible
π― Key Takeaways
The Five Principles
- FAQ First: 60-80% should hit cache
- Clarify Before Guessing: Use βDid you mean?β
- Context Over Creativity: LLM explains, doesnβt invent
- Cache Everything: Speed comes from avoiding work
- Start Simple: Chroma β FAISS if needed
Success Metrics
- FAQ hit rate: > 60%
- βDid you mean?β resolution: > 15%
- LLM usage: < 15%
- Average latency: < 100ms
- User satisfaction: > 85%
What Youβve Learned
β
How to build a three-tier intelligent system
β
When to use FAISS vs Chroma (and why Chroma for beginners)
β
How to organize knowledge for fast retrieval
β
How to minimize expensive AI calls
β
How to deploy on modest hardware
β
How to scale when needed
π Additional Resources
Documentation
Communities
- r/LocalLLaMA (Reddit)
- LlamaIndex Discord
- Chroma Discord
- FastAPI Discord
Next Learning
- Prompt engineering techniques
- Advanced RAG patterns
- Vector database optimization
- Production deployment best practices
π Conclusion
You now have everything you need to build a professional FAQ chatbot that:
- Runs on your laptop
- Answers in milliseconds
- Never hallucinates
- Scales with your needs
The three-tier architecture (FAQ β Document β LLM) combined with Chroma for simplicity gives you the best balance of:
- β‘ Speed (microseconds to seconds)
- π― Accuracy (grounded in your docs)
- π° Cost (no cloud bills)
- π§ Maintainability (simple architecture)
- π Scalability (grow as needed)
Remember: Start with Chroma, focus on great FAQs, and let the system guide users to answers quickly.
Good luck building! π
Built with β€οΈ for developers who want to create intelligent systems on realistic hardware
