Building Ask Iris: Turning Thousands of Experiments into Actionable Intelligence

At Cro Metrics, we have run tens of thousands of experiments over the years. That depth of data is a huge advantage. It also creates a very real problem that most growth teams eventually face: the more experiments you have, the harder it becomes to actually use what you have learned.

Patterns get buried. Context fades. Teams rely on memory, tribal knowledge, and whoever happens to still be around. When people leave, a lot of that understanding leaves with them. What remains is often locked behind primitive search tools, inconsistent documentation, or presentations that only answer very limited questions and surface few insights.

The answers are usually in the data. The problem has always been finding them.

Ask Iris started as a way to unlock that institutional knowledge.

We needed a system that could reason across our full experimentation history in a way that was accurate, fast, and secure. Simply connecting an LLM to a primitive RAG data store was not going to cut it.

We are building Ask Iris in the open. This post walks through the architectural choices and lessons that helped us get there.

Part 1: From Verbose Experiment Data to Useful Retrieval

The Problem: Too Much Detail Becomes Noise

Experiment data is naturally verbose. A single test often includes:

A long hypothesis and background
Detailed specs and designs
Notes from multiple stakeholders
Results and post-test analysis

If you take all of that and push it directly into a vector database, retrieval quality suffers. You do not get better answers. You get vague ones.

The Solution: AI-Driven Pre-Processing

Ask Iris 1 Instead of syncing raw experiment records straight from MySQL into a vector store, we built a processing loop designed specifically for experimentation data.

1. AI Summarization as a First-Class Step

Each experiment is first passed through an LLM that extracts the core of the test:

What was the hypothesis?
What changed between control and variant?
What happened as a result?
How can we classify this experiment to improve results (i.e. metadata)?

This creates a clean, consistent semantic summary while preserving the original data for reference and improving retrieval accuracy.

2. One Experiment, One Chunk

Many RAG systems split documents into arbitrary chunks. Because our data is already summarized, we can enforce a simple rule: one experiment equals one chunk.

This improves precision and avoids answers that incorrectly blend multiple tests together.

3. Why We Chose Qdrant

We use Qdrant as our vector database, and it has been a great fit for this problem.

Qdrant offers a set of features that were particularly important for our use case:

100% API-driven platform
Support for text-based structured payloads (instead of docs only)
Precise metadata filtering by fields like date, client, industry, or test type

That combination lets Ask Iris retrieve extremely accurate answers, not just something loosely related.

Part 2: Designing the Brain of Ask Iris

Ask Iris 2 We needed a stack that let us move quickly while still giving us deep visibility into how the system behaved.

Frontend and Orchestration

We did not want to spend time building a chat UI from scratch, so we used assistant-ui. It is an open-source, enterprise-grade tool that covers the table-stakes UX, allowing us to focus our time on the intelligence and tooling layers.

For orchestration, we leaned on Vercel’s AI SDK. It made streaming responses, tool calling, and model coordination in React/Typescript much easier than rolling our own solutions.

Letting Subject Matter Experts Tune the System

Any product manager who has built AI-enabled tools knows that prompt tuning can be one of the hardest parts of a project.

In past projects, we learned that engineers are not always the right people to tune product-integrated prompts. Subject matter experts are much better positioned to do this work.

LangSmith changed how we worked:

SMEs could manage, version, and test prompts safely in staging
We could compare different prompts, models, and parameters side by side
We gained full visibility into reasoning steps, tool calls, and intermediate outputs

From a product perspective, this was critical. If you cannot iterate on prompts quickly or inspect how an AI system behaves, progress slows and blind spots multiply. LangSmith made that level of visibility and iteration possible.

Choosing the Right Model

We tested several high-reasoning models. While impressive, many took 30 seconds or more to respond. That does not work for a conversational product.

We landed on GPT-4.1 because it offers a strong balance of intelligence, speed, and predictability. It also handles tool use particularly well, even when those tools have a large number of parameters and configuration options. This is an area where GPT-4o consistently struggled for our use case.

The main limitation we have encountered so far with GPT-4.1 is with long, multi-step workflows, where it can start to lose the thread. We addressed this by delegating narrower tasks to specialized sub-agents, which keeps the primary interaction fast and responsive.

Speed matters to users. It directly affects perceived value, engagement, and whether a tool becomes part of someone’s daily workflow. That reality played a major role in why we landed here.

What Ask Iris Can Actually Do

An agent is only as useful as the tools it can use. Ask Iris is not just answering questions. It is doing work.

Semantic and Deterministic Search

Semantic search to find similar experiments, themes, or ideas
Deterministic search to count tests, filter by attributes, or retrieve exact records

This hybrid approach avoids answers that sound plausible but are not precise.

Agentic Workflows

Test Analysis Wizard
Ask Iris can review test details, designs, and raw results, then draft an end-of-test report that strategists can refine.

ImpactLens Prioritization
Ask Iris connects directly to ImpactLens, our prioritization system powered by predictive modelling, to score new ideas and support smarter roadmap decisions.

Visual Context

AI performs better with context, so we gave Iris the ability to see.

Using a screenshotting API built on headless browser tooling, users can pass in a URL and Iris can:

See the current webpage experience
Suggest test ideas and analyze test results
Identify UX or conversion opportunities

This has had an outsized impact on AI-powered A/B test ideation.

Security by Design

Security was a design constraint from day one.

Ask Iris uses JWT-based authentication across the entire stack, enforced at the infrastructure level. Access control happens before a prompt ever reaches a model.

That means:

No data leakage
No prompt injection shortcuts
No reliance on polite model behavior for security

If a user should not see something, the system simply can’t retrieve it. Plain, simple, and secure.

What’s Next

Ask Iris is evolving from a data assistant into a true agentic partner for experimentation, supporting:

Research
Ideation
Roadmap planning
Spec writing
Post-test analysis

The goal is straightforward: reduce busywork so growth teams can focus on higher-leverage thinking and make better decisions.

Ask Iris is only three months old, but we are excited about where it is heading.

I hope this peek under the hood helps others who are building RAG or agentic chat systems. Please reach out to me if you have questions.

Ready to turn ambitious growth goals into deeper customer connections and measurable business impact?

Reach Out Today

Digital Journey & Conversion Optimization (CRO)

Customer Journey Analysis

Integrated Marketing

Strategy & AI Consulting

Website Design and Build

Creative

Paid Media

Analytics

Lifecycle & Loyalty

Iris™ Intelligence

Boosting Revenue and Long-Term Success for Home Chef

Simple E-Newsletter Updates Boost Click-Through Rates

Turning Clicks into Trips by Reducing Friction for High-Intent Users