Agent Evaluation in Microsoft Copilot Studio
This feature provides a standardized mechanism to measure, manage, and improve the performance and reliability of AI agents, moving them from “promising prototypes” to trustworthy production-ready tools.
Real-time User Journey
The user journey for a “Maker” (someone building the agent) follows a continuous feedback loop:
- Defining the Goal: The maker identifies a scenario (e.g., an HR agent answering leave questions).
- Inputting Realistic Data: Instead of perfect prompts, the maker uploads datasets reflecting messy, real-world user questions (vague phrasing, mixed intents).
- Simulated Execution: Copilot Studio runs the agent against these prompts in a simulated environment using a specific User Identity (e.g., testing if a contractor accidentally sees full-time employee benefits).
- Automated Grading: The system applies “Graders” to evaluate the responses based on Quality (completeness), Classification (behavior alignment), and Capability (using the right tool/topic).
- Analysis & Refinement: The maker reviews aggregated trends to see high-level performance and drills down into specific failures to understand why the agent missed the mark.
- Comparison: After making tweaks to instructions or data, the maker runs a new eval and compares it to the previous one to prove the agent is actually getting better.
Step-by-Step: How to Enable
Agent Evaluation is a built-in feature of Microsoft Copilot Studio. Here is how to set it up:
- Step 1: Access the Evaluation Tab: Open your agent in Copilot Studio and navigate to the Evaluation section.
- Step 2: Create a New Evaluation: Click to start a new evaluation run and give it a descriptive name.
- Step 3: Upload Test Data: Import a dataset or manually enter a set of “Expected User Prompts.” You can also use AI-assisted generation to broaden your test coverage.
- Step 4: Configure Graders: Select from ready-to-use logic (e.g., General Quality, Capability, or Correctness). You can combine multiple graders for one run.
- Step 5: Set User Context: Select the user profile/identity under which the agent should be tested to validate permission-based data access.
- Step 6: Run & Analyze: Execute the evaluation. Once finished, view the Dashboard for aggregated pass/fail rates and the Details tab for step-by-step logs.
Infographic: The 8-Step Confidence Loop
This visual summary represents the lifecycle of evaluating an AI agent:
| Phase | Step | Action |
| Setup | 1. Scenario | Define what you are testing. |
| 2. Data | Use “messy” real-world prompts. | |
| 3. Logic | Choose your Graders (Quality, Capability). | |
| 4. Identity | Set the user context (Permissions). | |
| Execution | 5. Run | Simulate prompts and generate responses. |
| Analysis | 6. Aggregate | Look at the “Big Picture” trends. |
| 7. Drill-Down | Investigate individual failures. | |
| Iteration | 8. Compare | Validate that updates improved the agent. |
References