Best Practices - Vector Store

Best Practices for Preparing Files for Vector Store Ingestion

When building and maintaining a knowledge base for complex AI Agents, file preparation and organization are critical to ensure long-term scalability, searchability, and accuracy. Below are best practices we recommend for managing large file structures and consistently feeding high-quality data into your vector store.


1. File Naming Conventions

  • Use a clear and consistent naming structure so files are easy to locate, replace, or delete later.

  • Standard Format:

    File Name = %Category%-%Contents%.md

    Example:

    • SALES-Product_Descriptions.md

    • SUPPORT-Troubleshooting_Guide.md

    • OBJECTIONS-Pricing_Battlecards.md

Guidelines:

  • Keep names short, unique, and descriptive.

  • Only use dashes (-) or **underscores (_) **as separators.

  • Avoid special characters such as periods (other than .md or .json), commas, parentheses, or spaces.

  • Maintain uniqueness — since file names may be referenced directly in prompts, they must not collide with other files.


2. File Formats

  • Preferred formats:

    • Markdown (.md) → Great for structured documentation, FAQs, sales collateral, objection handling.

    • JSON (.json) → Best for structured data, configurations, or mapping tables (e.g., intent taxonomies).

  • Avoid binary formats like .pdf or .docx unless necessary. Convert them to Markdown or JSON to preserve structure and allow clean chunking.


3. Metadata & Context

  • Each file should include metadata headers so the system (and humans) can quickly identify the file’s purpose.

  • Recommended metadata fields:

    ---
    category: SALES
    purpose: Product descriptions for sales collateral
    last_updated: 2025-09-21
    owner: sales-team
    ---
  • This metadata can be used for search filtering, debugging, or selective retrieval.


4. Chunking Considerations

  • Write content in modular sections (short paragraphs, bullet points, headings). This makes automated chunking more natural and effective.

  • Use descriptive headings (##) in Markdown to create semantic breakpoints for chunking.

  • Avoid very long walls of text — aim for 500–800 tokens per section to balance context depth and retrieval efficiency.

  • Where possible, add overlap naturally in your writing (e.g., re-state context in each section) so that chunks are self-contained.


5. File Categories & Structure

Organize files by functional category so your vector store remains intuitive and scalable:

  • SALES → product descriptions, case studies, pricing guides, objection battle cards.

  • SUPPORT → troubleshooting guides, FAQs, setup instructions.

  • LEGAL/COMPLIANCE → policies, disclaimers, regulatory notes.

  • GENERAL → company background, mission statements, leadership bios.

Store these in a logical folder structure locally (before upload) so the team can manage them easily.


6. Version Control & Updates

  • Use a versioning convention in metadata, not file names. (e.g., last_updated field)

  • When replacing a file:

    • Remove the old version from the vector store.

    • Upload the new version with the same file name (for continuity).

  • Keep a changelog (separate .md file) that tracks updates across all knowledge base documents.


7. Content Quality Guidelines

  • Be explicit & factual → Avoid ambiguous language; AI retrieves best when content is clear.

  • Use Q&A or FAQ formats where possible for objection handling and FAQs.

  • Cross-reference within files using internal headings or bullet lists rather than links (since links may not resolve in vector retrieval).

  • Keep text plain — avoid embedded images, tables with excessive formatting, or non-standard characters.


8. Security & Compliance

  • Never include sensitive data (PII, customer details, private contracts).

  • Only upload approved, external-facing content for sales/collateral files.

  • For compliance-heavy industries, maintain a separate LEGAL/COMPLIANCE category and instruct the AI to always defer/escalate when sensitive queries arise.


9. Testing & Validation

  • After upload, query the vector store with sample user prompts to confirm retrieval works as expected.

  • Test edge cases such as:

    • “What’s your pricing?” → retrieves pricing file.

    • “Why should we choose you over Competitor X?” → retrieves objection-handling battle card.

    • “Tell me about troubleshooting login issues” → retrieves support guide.

  • Regularly audit retrieval quality and re-chunk/reformat files if necessary.


10. Summary

By following these best practices, you’ll create a scalable, maintainable, and reliable knowledge base for your AI agents. Consistency in file naming, formatting, metadata, and structure ensures that retrieval is accurate, updates are easy, and objection-handling or discovery processes remain smooth.

Last updated