Unit 09 of 10

Unit 9: Experimentation and iteration for probabilistic features

Learning objectives

Design experiments for AI features that produce variable outputs. Set appropriate success thresholds for probabilistic features. Build iteration rhythms that account for model behavior evolution.

Video script

Reading material

The AI experimentation playbook

Phase 1: Internal quality testing. Before any user sees the feature, test the AI's output quality internally. Have team members review a sample of outputs and rate them. This catches obvious quality issues cheaply.

Phase 2: Dogfooding. Use the AI feature within your own team for real work. If you wouldn't trust it for your own tasks, users won't either. This phase also generates ideas for improvement that pure testing misses.

Phase 3: Limited beta. Release to a small group of users who have agreed to provide feedback. Watch them closely. This group should be representative of your target users, not just enthusiastic early adopters.

Phase 4: Controlled rollout. A/B test with proper controls. Measure across all four metric layers. Run the test for at least two weeks, longer if trust dynamics are important.

Phase 5: Full rollout with monitoring. Launch to all users with active monitoring. The experiment doesn't end at launch. Continue monitoring trust metrics and quality metrics because AI performance can shift as usage patterns evolve.

Iteration patterns for AI features

AI features rarely get to "done." They evolve through cycles of monitoring, learning, and improving.

Quality iteration. The model isn't performing well enough for certain use cases. The fix might be additional training data, model tuning, or switching to a different model. This is a collaboration with the ML team.

UX iteration. The model is performing well, but users aren't adopting or trusting the feature. The fix is in the product design: better explanations, easier overrides, different presentation, or better integration into the workflow. This is a collaboration with design.

Scope iteration. The feature works for some use cases but not others. The fix is to narrow or expand the feature's scope based on where it actually adds value. Sometimes a feature that tries to do everything should be narrowed to the three use cases where it works reliably.

Practical exercise

Exercise: Design an AI feature experiment

Choose an AI feature concept from a previous exercise. Design a full experiment plan.

Define the hypothesis: "We believe [this AI feature] will [change this behavior/metric] for [this user segment] because [this reason]."
Design the experiment: what are the variants? How will you segment? What's the sample size? How long will you run it?
Define success criteria: what metric needs to move by how much for you to consider the experiment successful?
Plan for interpretation: what will you do if the results are ambiguous? What qualitative data will you collect alongside the quantitative metrics?
Plan the next iteration: if the experiment succeeds, what's next? If it fails, what are the three most likely reasons and how would you test each one?

Write up your experiment plan as a one-page document.