Lesson 5.2 — Measuring Accuracy, Consistency & Confidence

Introduction: From Subjective to Objective

With our high-quality ground truth QA set in hand, we are now ready to move from the subjective feeling of "I think the agent is working well" to the objective, data-driven statement of "The agent is X% accurate." This is the core of the evaluation process: using our ground truth as a benchmark to systematically measure the performance of our agent.

In this lesson, we will focus on three of the most critical metrics for any AI agent: accuracy, consistency, and confidence. These three pillars of performance will give us a comprehensive understanding of not only if our agent is getting the right answer, but also how it is getting that answer and how sure it is about its own performance.

Accuracy: The Core Performance Metric

Accuracy is the most fundamental measure of an AI agent's performance. It answers the simple question: "How often does the agent provide the correct answer?"

Measuring Accuracy

To measure accuracy, we simply run all of the questions from our ground truth QA set through the agent and compare the agent's generated answer to the "ideal answer" in our dataset.

There are several ways to perform this comparison:

Exact Match: This is the strictest method, where the agent's answer must be identical to the ground truth answer. This is often too rigid for generative AI applications.
Keyword Match: A more flexible approach where we check if the agent's answer contains the key terms and phrases specified in our QA set.
Semantic Similarity: Using another LLM or a sentence-transformer model, we can measure the semantic similarity between the agent's answer and the ground truth answer. This is often the most effective method for evaluating generative AI.
Human Evaluation: A human evaluator scores the agent's answer on a scale (e.g., 1-5) based on its correctness and completeness.

The Accuracy Score

Once we have evaluated all of the answers, we can calculate our overall accuracy score:

Accuracy = (Number of Correct Answers / Total Number of Questions) * 100

This single number is our headline metric, the key performance indicator (KPI) that we will use to track our progress over time.

Consistency: The Reliability Metric

Consistency, or reliability, measures whether the agent provides the same answer to the same (or very similar) questions over time. An agent that gives different answers to the same question is not trustworthy.

Measuring Consistency

To measure consistency, we can:

Run the same question through the agent multiple times and check if the answers are consistent.
Create multiple variations of the same question in our QA set and check if the agent provides the same answer to all of them.

Consistency can be measured using the same comparison methods as accuracy (exact match, keyword match, or semantic similarity).

Confidence: The Self-Awareness Metric

As we discussed in Module 4, a sophisticated agent should not only provide an answer but also have a sense of its own confidence in that answer. The confidence score is a measure of the agent's self-awareness.

Measuring Confidence

Most intent classification and RAG systems provide a confidence score along with their output. When evaluating our agent, we should not only look at the correctness of the answer but also at the confidence score associated with it.

This allows us to ask more sophisticated questions about our agent's performance:

When the agent is correct, is it also confident?
When the agent is wrong, is it aware of its own uncertainty (i.e., does it have a low confidence score)?

An agent that is confidently wrong is a dangerous agent. An agent that is aware of its own limitations is a much more reliable and trustworthy partner.

The Evaluation Dashboard

To effectively track these metrics, it is essential to create an evaluation dashboard. This dashboard should display:

The overall accuracy score over time.
Accuracy scores broken down by intent.
The average confidence score for correct and incorrect answers.
A list of the most common failure cases.

This dashboard will be our central hub for understanding the performance of our agent and for identifying areas for improvement.

Conclusion: The Three Pillars of Performance

Accuracy, consistency, and confidence are the three essential pillars of a high-performing AI agent. By systematically measuring and tracking these metrics, we can move beyond subjective impressions and build a data-driven culture of continuous improvement. We now have a clear, objective understanding of our agent's performance, and we are ready to explore some of the more advanced and adversarial methods of testing.

In the next lesson, we will dive into the world of red teaming and adversarial prompt testing, where we will learn how to proactively search for our agent's weaknesses before our users do.

PreviousLesson 5.1 — Creating Ground Truth QA Sets NextLesson 5.3 — Red Teaming & Adversarial Prompt Testing

Last updated 5 days ago