Lesson 5.1 — Creating Ground Truth QA Sets

Introduction: The Foundation of Objective Evaluation

In the world of AI, you can't improve what you can't measure. And to measure the performance of our AI agent, we need a reliable, objective standard to compare it against. This standard is known as ground truth. A ground truth dataset is a collection of data that has been verified to be accurate and is considered the "gold standard" for a particular task.

For our purposes, we will be focusing on creating Ground Truth Question-Answer (QA) Sets. These are curated lists of questions that are representative of what our users will ask, paired with the ideal, factually correct answers. This QA set will be the bedrock of our entire evaluation framework, allowing us to systematically and objectively measure the performance of our agent.

This lesson will guide you through the process of creating a high-quality, comprehensive, and maintainable ground truth QA set. This is not just a technical exercise; it is a strategic investment in the long-term quality and reliability of your AI agent.

The Anatomy of a High-Quality QA Set

A good QA set is more than just a list of questions and answers. It is a structured dataset that captures the full complexity of the evaluation task.

Component

Description

Example

Question

A clear, unambiguous question that a user might ask.

"What is the warranty period for the X-1000 model?"

Ideal Answer

The precise, factually correct answer to the question.

"The warranty period for the X-1000 model is two years from the date of purchase."

Context/Source

The specific document or data source where the answer can be found.

product_manual.pdf, page 12

Intent

The corresponding intent from our taxonomy.

product.warranty_info

Keywords

Key terms or phrases that must be present in the answer.

"two years", "2 years"

Methods for Creating Ground Truth QA Sets

There are several methods for creating a ground truth QA set, and the best approach is often a combination of all of them.

1. Manual Curation

This is the most labor-intensive but also the most high-quality method. It involves subject matter experts (SMEs) manually writing questions and answers based on their deep knowledge of the domain. This is particularly important for capturing nuanced or complex information.

2. Generation from Existing Documents

As we saw in our research, tools like IBM's watsonx Orchestrate can automatically generate ground truth datasets from user stories and tool definitions [2]. This can be a powerful way to quickly create a large number of QA pairs from your existing documentation.

3. Recording and Annotation of Live Conversations

This method involves capturing real user conversations with the agent (or with human agents) and then having experts annotate them with the correct answers and intents. This is an excellent way to ensure that your QA set reflects how real users actually communicate.

4. Synthetic Data Generation

Large Language Models can be used to generate a wide variety of questions from a given document. This can be a cost-effective way to expand your QA set, but it is crucial to have a human-in-the-loop to verify the quality and accuracy of the generated pairs.

Best Practices for Building Your QA Set

Diversity is Key: Your QA set should cover a wide range of topics, question types, and levels of complexity.
Focus on the Important Stuff: Prioritize creating QA pairs for the most common and most critical user intents.
Keep it Current: Your QA set is a living document. It must be updated whenever your products, policies, or data sources change.
Version Control: Use a version control system (like Git) to track changes to your QA set over time.
Collaboration is Crucial: Involve a diverse team of SMEs, developers, and business stakeholders in the creation and maintenance of your QA set.

Conclusion: Your Single Source of Truth

A high-quality ground truth QA set is the single most important asset you will create in the evaluation process. It is your single source of truth, the unchanging benchmark against which all of your improvements will be measured. The effort you invest in creating a comprehensive and accurate QA set will pay dividends throughout the entire lifecycle of your AI agent, enabling you to build a system that is not only intelligent but also trustworthy.

In the next lesson, we will explore how to use this QA set to measure the core performance metrics of our agent: accuracy, consistency, and confidence.

PreviousModule 5: Video Overview NextLesson 5.2 — Measuring Accuracy, Consistency & Confidence

Last updated 5 days ago