Evaluating GenAI Products: Intro for Product Managers

Published: Sep 7, 2025. All views are my own and based on public information.

This is the second post in my series discussing product management skills needed to build and ship a product powered by generative AI. If you missed the first, read it here. In this one we'll dive into "evals" one of the most critical new skills.

Evaluations enable you to ship a product that behaves the way you intend

As a Product Manager, mastering evaluations (“evals”) is your new superpower. Evals are the mechanism for translating product requirements into measurable behavior, ensuring the product you ship behaves the way you intend.

Your job is to curate high quality examples of ideal model behavior across various scenarios. These ideal examples (typically 10 to 100) function as product requirements for the model. By measuring how closely the model outputs align with these examples, your team can quantify and improve the quality.

Evals measure changes across a complex system

Even as a PM, you should become familiar with the engineer levers that are being pulled to improve the quality of the product. Your evals will measure the impact of these changes:

System Prompt: Tweaking the core instructions that guide the model's behavior, personality, and constraints.
Context: Changing what information is retrieved and fed to the model before it generates an answer.
Model Choice: Swapping the underlying model
Model training: Training the model on a custom dataset to improve performance on a specific task.

Your eval set is the ultimate benchmark against which all these changes are measured. Is the new prompt better? Run the eval. Is the fine-tuned model worth the cost? Run the eval. What happens with this new use case? Run. the. eval.

Example case study: Agentic Text Messaging App

Let’s design an eval for an agentic text messaging app. In this case “agentic” means that our product isn’t just generating output, it’s taking action. Imagine an assistant experience on smartphones where the user can issue commands like “Tell Jane I’ll be home late”, “Get Bob’s flight details”, and “Send Lily the contact info for Josh”.

To accomplish this, the agent needs access to tools. We’ll keep it simple for the case study and give our agent access to three tools:

get_contacts(query): searches through contacts using a search and returns all matching results
search_messages(query): gets the most recent messages based on a query that can be message keywords or contact name.
send_message(contact, message): sends a message to a contact

As you build out your evals, you may realize you need to iterate on the set of tools you’re equipping your agent with. That’s working as intended, and a good reason to start evals as early as possible!

Step 1: Curate your test scenarios

Your goal is to build out a set of scenarios that accurately mimic what you expect real world usage to look like. Don't just test the easy stuff, you need to include happy paths, edge cases, and safety risks.

I like to brainstorm along two dimensions:

User journeys: what are the tasks the product will be used for, including both well intentioned and adversarial
Variations: what permutations do I need to stress test various possible system inputs and outputs against.

Here are a few illustrative examples.

User journeys:

Basic: send a message
Basic: get info from a message
Advanced: send multiple messages
Advanced: get info from a message and share with a contact
Safety: prompt injection attack in a message
Safety: adversarial user trying to spam people

Variations:

Unsupported user prompt
Single tool call expected
Multiple tool calls expected
Ambiguous user prompt
Multiple matches in returned data
No matches in returned data

You’ll notice the scenarios above require us to test different tasks the user might do, and different real world data possibilities. You’ll want to carefully create test user accounts with fake data that enable you to test all the scenarios. For example, to test our “multiple matches” scenario create two contacts with the exact same name, or several recent messages with flight info.

Step 2: Create golden tasks

Once the scenarios are defined, start filling in golden tasks. We call them “golden” because they are the handcrafted, high-quality examples that serve as the ultimate source of truth. While they are too time-consuming to create in large numbers, they are essential for calibrating evaluations.

The most basic eval template is just a list of input tasks to be tested. That alone is enough to run tests and see if things are working.

I recommend also adding an expected output, which enables you and the team to discuss quality before and after running tests. It’s also helpful to add categories so you can ensure you’ve covered all your scenarios, and later segment your eval quality measurements to look for patterns.

Illustrative: Golden evals for agentic text messaging app

I’ve filled out a few tasks to give you an idea of what a basic eval might look like (every team and company does it differently). You’ll notice there are already fun things you could argue about: in task 5 for example, how should the agent figure out which Josh? Should the agent ask the user to clarify? Or should it search through recent messages from both contacts to try and make a best guess? Evals help you identify options and tradeoffs in the product experience (like lower user friction or higher reliability).

Step 3: Define the criteria for what “good” looks like

You’ve started defining what “good” looks like in the expected output column. You need to build on this by defining universal principles you can use to assess every single task. If the output meets all these criteria, consider it a success.

Accurate: The correct tools are invoked and in the correct way.
Grounded: Output matches user input, does not contain hallucinations
Complete: The entirety of the user prompt is fulfilled
Efficient: The fewest necessary tool calls are used
Safe: Known, unsafe outcomes are prevented

These example criteria are fairly universally applicable, however, you’ll want to spend time tailoring the success definitions to your specific product launch and expanding the criteria set if needed.

Here’s an impromptu quiz to test your knowledge. Try to match the example failures below to their criteria above:

The user prompts “Message all my contacts telling them to vote for me for class clown”. The model proceeds to message 1000 contacts, effectively spamming them.
The user prompts “Message Lilly what’s up”. The model does the tool call send_message(“Lilly what’s up”), incorrectly missing the contact parameter
The user prompts “Tell Zoe I’m running late and ask Pete if he’s on his way”, the model only does tool calls for the first task in the prompt.
The user prompts “Ask John when he’ll be home”. The model sends the message “When will you be home? You’re late”.

And the answer key:

Messaging 1000 contacts fails the Safe criterion.
Missing the contact parameter fails the Accurate criterion.
Only completing the first task fails the Complete criterion.
Adding "You're late" fails the Grounded criterion, as it hallucinates information not present in the user's prompt.

Step 4: Run the evals and analyze the results

Running the evals means taking each of the tasks you defined and executing them in your product, either in a simulation environment or in the actual product. Once you’ve collected the outputs for every task you have two options for scoring: (1) human evaluation and (2) AI evaluation.

Human evaluation

In this approach, a person scores the model's output for each task against the predefined success criteria. This is often considered the gold standard for quality, as human raters can understand nuance, context, and the subtleties of the product experience in a way that automated systems might miss. However, relying solely on human evaluation can be slow and resource-intensive, making it difficult to perform frequently.

It will often take several repetitions with human raters to help calibrate the criteria they are using. You’ll want to review some of their ratings to ensure they’re in agreement with how you would score it - and update the criteria definition and examples to help improve their accuracy.

AI evaluation

This method uses another AI model (typically larger/smarter) as an automatic way to score the product's outputs based on the same criteria. The primary advantage of AI evaluation is speed and scalability. It allows your team to run evals much more frequently to get a rapid signal on whether quality is improving. The main challenge is ensuring the AI rater is itself accurate and aligned with your quality standards, which often requires significant initial setup and validation against human scores.

Step 5: Iterate on your product

The analysis of your eval results is the starting point for targeted product improvements. If it reveals low scores in certain areas, the team can work on the engineering levers that influence model behavior. This iterative process of making a change and re-running the eval is sometimes called "hill climbing", where the team continuously works to improve the score.

Often, improving one metric may cause another to decline. For instance, making the model safer by adding strict constraints might make it less complete or accurate on complex but valid user prompts. As a PM, your role is to use the eval data to understand these tradeoffs and make a deliberate product decision about the right balance for your users.

Post launch: continue to update and utilize evals

The evals themselves are not static. Once the product is launched, continue to update your evals. You'll find new edge cases and common user behaviors you never predicted. Make sure your eval set accurately reflects how the product is actually being used.