Best Practices - Vector Store
Best Practices for Preparing Files for Vector Store Ingestion
When building and maintaining a knowledge base for complex AI Agents, file preparation and organization are critical to ensure long-term scalability, searchability, and accuracy. Below are best practices we recommend for managing large file structures and consistently feeding high-quality data into your vector store.
1. File Naming Conventions
Use a clear and consistent naming structure so files are easy to locate, replace, or delete later.
Standard Format:
File Name = %Category%-%Contents%.md
Example:
SALES-Product_Descriptions.md
SUPPORT-Troubleshooting_Guide.md
OBJECTIONS-Pricing_Battlecards.md
Guidelines:
Keep names short, unique, and descriptive.
Only use dashes (-) or **underscores (_) **as separators.
Avoid special characters such as periods (other than
.md
or.json
), commas, parentheses, or spaces.Maintain uniqueness — since file names may be referenced directly in prompts, they must not collide with other files.
2. File Formats
Preferred formats:
Markdown (
.md
) → Great for structured documentation, FAQs, sales collateral, objection handling.JSON (
.json
) → Best for structured data, configurations, or mapping tables (e.g., intent taxonomies).
Avoid binary formats like
.pdf
or.docx
unless necessary. Convert them to Markdown or JSON to preserve structure and allow clean chunking.
3. Metadata & Context
Each file should include metadata headers so the system (and humans) can quickly identify the file’s purpose.
Recommended metadata fields:
--- category: SALES purpose: Product descriptions for sales collateral last_updated: 2025-09-21 owner: sales-team ---
This metadata can be used for search filtering, debugging, or selective retrieval.
4. Chunking Considerations
Write content in modular sections (short paragraphs, bullet points, headings). This makes automated chunking more natural and effective.
Use descriptive headings (##) in Markdown to create semantic breakpoints for chunking.
Avoid very long walls of text — aim for 500–800 tokens per section to balance context depth and retrieval efficiency.
Where possible, add overlap naturally in your writing (e.g., re-state context in each section) so that chunks are self-contained.
5. File Categories & Structure
Organize files by functional category so your vector store remains intuitive and scalable:
SALES → product descriptions, case studies, pricing guides, objection battle cards.
SUPPORT → troubleshooting guides, FAQs, setup instructions.
LEGAL/COMPLIANCE → policies, disclaimers, regulatory notes.
GENERAL → company background, mission statements, leadership bios.
Store these in a logical folder structure locally (before upload) so the team can manage them easily.
6. Version Control & Updates
Use a versioning convention in metadata, not file names. (e.g.,
last_updated
field)When replacing a file:
Remove the old version from the vector store.
Upload the new version with the same file name (for continuity).
Keep a changelog (separate
.md
file) that tracks updates across all knowledge base documents.
7. Content Quality Guidelines
Be explicit & factual → Avoid ambiguous language; AI retrieves best when content is clear.
Use Q&A or FAQ formats where possible for objection handling and FAQs.
Cross-reference within files using internal headings or bullet lists rather than links (since links may not resolve in vector retrieval).
Keep text plain — avoid embedded images, tables with excessive formatting, or non-standard characters.
8. Security & Compliance
Never include sensitive data (PII, customer details, private contracts).
Only upload approved, external-facing content for sales/collateral files.
For compliance-heavy industries, maintain a separate LEGAL/COMPLIANCE category and instruct the AI to always defer/escalate when sensitive queries arise.
9. Testing & Validation
After upload, query the vector store with sample user prompts to confirm retrieval works as expected.
Test edge cases such as:
“What’s your pricing?” → retrieves pricing file.
“Why should we choose you over Competitor X?” → retrieves objection-handling battle card.
“Tell me about troubleshooting login issues” → retrieves support guide.
Regularly audit retrieval quality and re-chunk/reformat files if necessary.
10. Summary
By following these best practices, you’ll create a scalable, maintainable, and reliable knowledge base for your AI agents. Consistency in file naming, formatting, metadata, and structure ensures that retrieval is accurate, updates are easy, and objection-handling or discovery processes remain smooth.
Last updated