Part 8: Data Preparation & Quality Standards
The intelligence and effectiveness of your AI agents are directly proportional to the quality of the data they are trained on. An agent is composed of two core elements: its system instructions (its "job description") and its training data (its "knowledge base"). Both must be of high quality to ensure reliable performance. This section provides a framework for preparing and managing your data.
The Two Components of an Agent\'s Brain
System Instructions: This is a detailed prompt that defines the agent\'s role, responsibilities, boundaries, and communication style. It is the foundational context that governs all of the agent\'s actions.
Training Data: This is the body of information the agent uses to answer questions and perform tasks. For a customer support agent, this would be your knowledge base, product guides, and historical support tickets. For a code agent, it would be your entire codebase and technical documentation.
The Golden Rule: Convert to AI-Friendly Formats
To ensure your data is well-formed and easily understood by Large Language Models (LLMs), you must convert all documents and data into AI-friendly formats before training.
For Unstructured Data (e.g., Word docs, PDFs, web pages): Convert to Markdown. Markdown preserves essential formatting like headings, lists, and tables while removing proprietary file structures that can confuse the AI.
For Structured Data (e.g., spreadsheets, database exports): Convert to JSON (JavaScript Object Notation). JSON preserves the structured nature of the data, ensuring that rows and columns are not misinterpreted.
Assessing Data Quality: A Practical Guide
Not all data is created equal. Before you begin, you must assess the quality of your source material. Use the following table as a guide:
Data Source Example
Quality Level
Rationale
Public-facing Knowledge Base
High
This data is already curated, customer-facing, and is the current source of truth for answers.
Internal Technical Documentation
High
Typically well-structured and factually accurate, written by domain experts.
Recent Marketing Materials
Medium
Generally accurate but may contain promotional language that needs to be toned down.
Old PowerPoint Presentations
Low
Often contains more images than text, may be outdated, and lacks detailed context.
Raw Email Inboxes
Low
Unstructured, conversational, and contains a high degree of noise and irrelevant information.
Your goal is to prioritize high-quality sources first and then work your way down, using data cleaning techniques to improve lower-quality sources.
Best Practice: Use AI to Clean and Structure Data
For medium- and low-quality data sources, a powerful best practice is to use AI to clean and structure the data before creating embeddings and uploading it to your vector store. This process, often called "data pre-processing," can dramatically improve the performance of your agent.
Example Workflow: Cleaning a PowerPoint Presentation
Extract Raw Text: Pull all text content from the presentation, ignoring images.
Use an AI Prompt: Pass the raw text to an LLM with a prompt like: "Review the following text extracted from a PowerPoint presentation. Summarize the key points, remove any marketing language, and format the output as a structured Markdown document. If the information is outdated, flag it."
Review and Upload: Review the AI-generated Markdown file for accuracy and then upload it to your vector store.
Handling Large Volumes of Files: Merging and Metadata
When dealing with a large number of files, such as an entire codebase or a library of thousands of documents, it is a best practice to merge the files and add metadata headers. This makes it easier for the agent to locate relevant information during vector queries.
YAML Headers: A common method is to add a small YAML (YAML Ain\'t Markup Language) header at the top of each file\'s content before merging. This header contains metadata about the file.
Example: Merging Code Files
When uploading an entire codebase, you could prepend each file with a YAML header like this:
---
file_path: /src/components/buttons/PrimaryButton.js
last_modified: 2026-01-02
---
// ... content of PrimaryButton.js ...
By merging all files into a single document with these headers, the agent can more easily pinpoint the exact source of a piece of code when answering a question, leading to more accurate and context-aware responses.
Summary of Data Preparation Workflow
Identify Data Sources: Inventory all potential data sources for your agent.
Assess Quality: Categorize each source as high, medium, or low quality.
Convert to AI-Friendly Formats: Convert all data to Markdown or JSON.
Clean and Structure: Use AI-powered prompts to clean and structure medium- and low-quality data.
Add Metadata: For large file sets, add YAML headers to provide context.
Upload and Vectorize: Upload the prepared data into your platform\'s vector store.
By following this structured approach to data preparation, you will build a strong foundation for a highly effective and reliable AI agent.
Last updated

