Documents and FAQs are often long; to generate embeddings effectively and optimize similarity search, large text needs to be split into smaller segments (chunks).
The code uses RecursiveCharacterTextSplitter:
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0
)
chunks = splitter.split_documents(docs)
| Parameter | Value | Meaning |
|---|---|---|
chunk_size | 500 | Size of each chunk (characters) |
chunk_overlap | 0 | No overlap between chunks |
Recommended chunk size: 500–1000 tokens (depends on the embedding model)
Chunk overlap: If you need continuous context, set overlap to 50–100 characters
Trade-off:
emb = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
Chosen model: all-MiniLM-L6-v2
faq_store = FAISS.from_documents(chunks, emb)
FAISS (Facebook AI Similarity Search) provides:
results = faq_store.similarity_search(query, k=3)
Parameters:
query: User questionk=3: Returns top 3 most similar chunksIf your data changes (add/update documents), you need to:
| Criteria | Lightweight Model | Heavy Model |
|---|---|---|
| Speed | Fast | Slow |
| Accuracy | Good | Excellent |
| Cost | Low | High |
| Use cases | FAQ, chatbot | Research, legal, complex RAG |
If you need better Vietnamese support:
keepitreal/vietnamese-sbertsentence-transformers/paraphrase-multilingual-MiniLM-L12-v2from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
# Load documents
docs = load_faq_csv()
# Chunking
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=0
)
chunks = splitter.split_documents(docs)
# Embedding
emb = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Vector Store
faq_store = FAISS.from_documents(chunks, emb)
# Query
query = "How do I change my password?"
results = faq_store.similarity_search(query, k=3)