Module 3: Document Preparation with raia Academy
Before a document can be added to a vector store, it needs to be in a clean, structured format that the AI can easily understand. This is where raia Academy comes in. Raia Academy is a powerful tool within the raiaAI platform that helps you transform documents from various formats into AI-ready markdown or JSON.
Why Document Preparation is Crucial
The principle of "garbage in, garbage out" applies directly to AI training. If you upload poorly formatted or unstructured documents, the resulting vectors will be noisy and confusing, leading to poor agent performance. Proper document preparation ensures that the information is clean, coherent, and optimized for semantic search.
Common Source Formats
Clients will often provide information in a variety of formats, such as:
Microsoft Word (.docx)
PDF (.pdf)
PowerPoint (.pptx)
Web pages (HTML)
Plain text (.txt)
Raia Academy is designed to handle these formats and convert them into a standardized structure.
The Goal: Clean Markdown or JSON
For most use cases, the ideal format for AI training is Markdown (.md). Markdown is a lightweight markup language that allows you to add structure to a document using simple syntax. It is easy for both humans and machines to read.
In some advanced cases, particularly for Analyst Agents that need to process structured data, JSON (JavaScript Object Notation) may be the preferred format. JSON is a text-based format for representing structured data.
Using raia Academy for Transformation
Raia Academy provides a user-friendly interface for uploading and transforming documents.
The Transformation Process:
Upload: Upload your source document (e.g., a .pdf or .docx file) to raia Academy.
Analyze: Raia Academy will analyze the document and extract its content, including text, headings, tables, and lists.
Convert: The tool will automatically convert the extracted content into clean markdown, preserving the original structure of the document as much as possible.
Review and Edit: You can then review the generated markdown and make any necessary edits. This is an important step to ensure that the final document is clean and well-structured.
Export: Once you are satisfied with the result, you can export the clean markdown file, ready to be uploaded to your agent's knowledge base.
Key Benefits of Using raia Academy:
Efficiency: Automates the time-consuming process of manually cleaning and formatting documents.
Consistency: Ensures that all your training documents are in a standardized format.
Quality: Removes unnecessary formatting and artifacts that can confuse the AI.
Structure Preservation: Maintains the hierarchical structure of your documents (headings, lists, etc.), which is important for semantic understanding.
By mastering raia Academy, you will be able to quickly and efficiently prepare high-quality training data, which is a critical step in building effective AI agents.
Last updated