A Case Study Prepared for Stable 4/13/26

Rebuilding
experimentation
at Apollo.


How I replaced a six-year-old in-house A/B tool and taught the org to run hypothesis-backed experiments with real rigor for the first time.


By Peter Kincaid Interview Deck Vol. I · No. 01
01
How it started

A question in a
revenue war room.

Execs were pulled into a room over a multi-million-dollar miss on the month. My manager brought me in for exposure. A data analyst was walking through an A/B test on one of our pricing upsells. I asked him if we knew whether users were seeing the treatment. He said, "I'm not sure."

Do we know if users are seeing the treatment?

I cross-referenced Amplitude assignment data against our raw HTTP logs. A team had tried to run a nested split test and ended up hiding our credit top-up modal from the cohort entirely. Top-ups were one of our biggest revenue drivers, a three-click, in-flow upsell. Dark for weeks, and nobody knew. Our system had no exposure tracking. If it had, we would have caught the drop immediately.

My manager to me: "You should find us an A/B testing tool ASAP."
02
The state of play

An in-house A/B tool paired with an organization that doesn't practice experimentation.

  • No dependent split tests, no cohort experiments, no statistical rigor.
  • People used it interchangeably with feature flags. Nobody on the team agreed on what it was for.
  • Hypotheses were vibes. Decisions were vibes.
  • No exposure tracking. We ran analysis on users assigned to a variant, not users who ever saw it.
  • Losing experiments shipped anyway, because "we didn't want to throw away hard work."
  • We'd push major, major changes to 100% of users without any test at all.

One month, an un-tested change to the onboarding hub quietly cost us millions.

03
The bet

Why Amplitude,
on purpose.

Procurement

I'd been asking for a real experimentation tool since my first month at Apollo. Not a priority until the war room. The go-ahead came in a Slack message the same afternoon.

I shortlisted Amplitude and Statsig. Our head of data had heard "Amplitude wasn't great" from friends, but couldn't name a reason. I ran a technical comparison and showed that Statsig would add roughly two extra months of integration before our first experiment, and force a retrain of a product org that already lived in Amplitude every day.

We timed the deal to a favorable Amplitude renewal and saved about $60K on the contract.

Build

One month from green light to first experiment on the new system. I owned frontend, backend, docs, and training for about eighty percent of the work.

I pulled in a staff backend engineer for the hardest piece: a team-scoped 24-hour Redis cache layer. Our CTO had a hard mandate that app load stay under 200ms, and the experimentation SDK couldn't put that at risk.

Shortlist
Amplitude, Statsig
Time to first expt
~30 days
Contract savings
~$60K
04
Change management

The tool was the
easy part.

I wrote docs, built Slack bots, ran all-hands talks, and published playbooks. Teams still ran experiments wrong. They still shipped losing variants because they didn't want to throw away two weeks of work.

So I asked to embed a week at a time inside each product team. Teach the fundamentals, run one real experiment end-to-end together, stay through the readout. For a full quarter, half of my time went to teaching.

The hardest lesson I had to teach: it is okay to delete two weeks of your own code when the experiment loses. Better than letting it fester in the codebase and drag on users.

Case in point

The redesign that tanked.

One team wanted to ship a full feature redesign because "the new design is more intuitive" and a couple of high-ticket clients had asked for it. I was embedded with them the week before launch and asked how they planned to measure whether it worked. They hadn't planned to.

We ran it as an experiment. Usage tanked. The team fought the kill decision, until a VP who backed experimentation walked them through the same reasoning I had. We went back to the high-ticket accounts, pulled the useful feedback from the test, and built a middle ground instead.

05
What shifted

Experimentation became
how decisions got made.

Knowing every feature had to survive at least a week-long experiment before shipping, teams got more deliberate about what they built. A failed test meant deleting the code or going back, learning from the result, and rerunning. Either way, the stakes were real. So people stopped pitching features they couldn't defend.

My mentee's first ticket: users couldn't add a contact to a sequence from the prospecting page. She wrote a 30-line fix. Adoption jumped. We only saw the impact because we were finally instrumenting the right thing.

The quality of what shipped went up because the bar for starting work went up.

06
In hindsight

What I'd do
differently.

I'd fight for a top-down mandate on day one.

Leadership wanted voluntary adoption so we wouldn't impede team roadmaps. That consensus path kept the legacy tool on life support for an extra six months, and let old habits persist longer than they needed to. The software was the lever I pulled well. The organizational mandate was the lever I under-pulled.