Choosing an Experimentation Platform: A Retrospective

Immediate Engineering Isn’t Sufficient: How 4 Bricks of Context Engineering Cease RAG Hallucinations

Robotically Assign a Class to Uncategorized Rows in Energy Question and DAX

, in each firm that wishes to ship merchandise individuals love, when “we must always experiment extra” turns into “we can not hold experimenting like this.” Hand-tuned holdouts; traffic-allocation tickets bouncing between PMs and engineers; analyst calendars booked weeks out. The want to be data-driven kind of outgrows the equipment that was purported to make it so.

That was the place we sat at ManyChat final 12 months. We selected Eppo, however that call is the smallest a part of the story, and the half you’ll be able to least transplant to your organization. What I wish to share as an alternative is the method I walked by way of to get there, what I acquired mistaken alongside the best way, and what stunned me on the opposite facet of the contract (yep, docs hate me for this trick).

A notice on timing. We picked Eppo at an unusually thrilling second within the trade, as the seller map was shifting beneath us mid-evaluation. Eppo itself had been acquired by Datadog some months earlier than. Statsig had just lately been acquired by OpenAI, and would later be bought on to Amplitude. I don’t assume any of what I describe under is determined by that specific information cycle, however I wish to acknowledge that a few of it formed our temper whereas we have been deciding.

I break what follows into three acts: earlier than the choice, throughout it (making the choice), and after.

Earlier than

Let me get you within the temper we have been in at the beginning occurred. As I onboarded to the corporate, an engineer instructed me that if there have been two simultaneous alternatives to run experiments, his staff would merely postpone the second concept to a later dash as a result of the technical headache of configuring the 2 allocations. The danger of getting it mistaken ultimately outweighed the thrill to check. That is fairly actually: anti-velocity at finest; no experiment at worst. And for that one experiment that might be configured, copy-pasting boilerplate allocation logic was their bread and butter.

An analyst on the opposite facet of that very same pipe described herself as a “human microservice”; she meant the holdout teams, outlined by hand, refreshed by hand, handed on to the engineer, and so forth … an thrilling alternative to expertise your complete circulate in first-person POV, certainly. However, irony apart, that was the second the case for a platform stopped being summary.

I had seen variations of this room earlier than. At Marktplaats, some years earlier, I had written the in-house Python libraries that attempt to take in this type of ache, and we noticed time-to-insight go down from days to hours, within the tail instances.

I watched the identical build-or-buy debate play out once more at Adevinta, globally, at a bigger scale, the place it landed on constructing relatively than shopping for. Fortunate for us at Manychat, by the top of 2025 the platform choices had matured sufficient that, for a corporation our measurement and at that second, shopping for was the plain transfer.

We wished the instrument that might give us one of the best shot at getting our experimentation program the place we would like it: cutting-edge statistics, sure, however extra importantly a instrument that nudges its customers towards conclusive experiments by default; product managers included.

Two issues stood between us and the selection. The primary was easy: we had named the ache, nevertheless it was solely anecdotal thus far. Management had a (excellent) notion of what was damaged, and I had heard devs and product managers grumble concerning the present stack once I first met them. However none of that was the identical sort of object as a vendor necessities record. Till we might put the 2 facet by facet, we couldn’t inform which capabilities have been nice-to-haves and which have been the purpose.

The second was more durable. The choice carried numerous weight as regardless of how you set it, there’s all the time a lock-in aspect to any platform; culturally, if not technically. And sources are finite: we couldn’t POC each platform available on the market. Not to mention the chance price of getting to reverse the choice and begin over once more. Selecting one to guess on, in a single sitting, with no probability to course-correct, would have been asking to be mistaken. And with the choices being so related in most methods, discovering one of the best one for us was a matter of precision. We wanted a solution to break a single high-stakes choice into smaller, lower-stakes ones that constructed on one another.

Interviews, and de-risking the choice

I began with interviews. PMs, product analysts, engineers, entrepreneurs. The purpose was to transform anecdote into one thing we might maintain up towards a vendor’s characteristic record. The engineer’s calendar story, the analyst’s “human microservice”, the PM who had given up on operating atomic experiments and was bundling modifications into larger releases as an alternative, suspending a few of them completely: these turned the job description for the instrument. I can not overstate how a lot this paid me again later. Each time the method drifted, and it drifted, the interviews have been the anchor we got here again to. They have been additionally what made the entire effort credible contained in the group: telling my CPO why we have been spinning up a POC was a distinct dialog once I might quote a particular friction again to her.

For the single-shot downside, we phased the invention into three layers, every specializing in the following stage of depth within the analysis:

Desk analysis. Learn the seller docs, sketch an extended record. Most platforms self-eliminated right here, earlier than we ever opened a gross sales funnel. Loads of Claude Code at this step, too.
Demos. A targeted dialog with every shortlisted vendor. A bit of gross sales pitch, positive, however principally us probing the areas we had determined mattered most.
POC. Arms on the platforms, with actual information and actual evaluators, just for the 2 finalists.

Every layer narrowed the sector and purchased us info at a “value” we might afford. By the point we reached the POC we have been down to 2, and the choice in entrance of us had shrunk to one thing we might truly maintain. Statsig, or Eppo?

There may be one a part of this I’d repeat on day one among any future platform choice, in any class: the interviews outline these ache factors. They have been the only largest unlock of the entire stage. Working shut behind them, sponsorship. And I don’t imply simply from my director, who requested to push it ahead. I saved friends and stakeholders who must again / undertake the choice within the loop the entire approach by way of. By the point the POC ended, the choice stunned nobody.

On the finish of “earlier than” we had a shortlist of two, and the self-discipline of how we had narrowed to them. We knew what labored for us. The more durable query was nonetheless ready: between two platforms that each cleared our bar, which was truly higher for us? How would we outline “higher” conceptually, and the way would we agree on it virtually?

Throughout

It was the debrief, after the POC, and the analysts on the panel have been taking turns speaking. Two of them, who knew our stack finest, completed their abstract with a sentence just like:

“As a product analyst, I’d be actually blissful to maneuver ahead with both of them.”

I sat with that for a second. The consolidated scores agreed with them: the 2 platforms got here in at 4.36 and 4.47 on a five-point scale, throughout greater than twenty weighted standards. By any cheap learn, it was a tie. I had spent weeks constructing a course of that might level clearly at one platform, and the method had simply instructed me, within the voice of the friends I trusted most to identify a significant distinction, that there was no significant distinction from his seat.

What I discovered in that second, and wouldn’t have discovered with out the panel, is that analyst-grade rigor has turn out to be desk stakes. The marginal worth of selecting one fashionable experimentation platform over one other doesn’t accrue to your scorecard; it accrues some place else. The place, precisely, was the query I now needed to reply.

So I wanted a choice I might defend; to myself first, then to my information director and CPO, then to the groups who would inherit it. Coin flips and private preferences are dangerous foundations for a multi-year contract. And the tie meant the tiebreaker couldn’t be invented after the very fact; it needed to replicate what we truly wished from the following few years of experimentation at ManyChat.

Particularly, we weren’t selecting between two snapshots; we have been selecting between two trajectories. Eppo’s guess was on guided, opinionated, PM-shaped *cough * proof *cough * workflows; Statsig’s was on power-user flexibility. Each have been defensible for positive. However we had stated, recall:

We wished the instrument that might give us one of the best shot at getting our experimentation program the place we would like it: cutting-edge statistics, sure, however extra importantly a instrument that nudges its customers towards conclusive experiments by default (…)

I observed what didn’t occur. The POC plan known as for PMs to trial each platforms and feed scores again into the matrix. They principally didn’t due to bandwidth. One head of selling operations and one PM gave me unprompted impressions, and the remainder of the PM-side proof and enter stayed skinny. The absence of PM suggestions did one thing counterintuitive: it elevated the load I gave to PM-facing UX / workflows, and governance, within the closing name. The logic is uneven. Analysts are adaptable, power-users if you’ll; they are going to work their approach by way of no matter interface you hand them. PM onboarding shouldn’t be adaptable in the identical approach. If the platform our analysts rated equally can also be the one which lowers the barrier for our PMs, that may be a choice; the reverse, choosing the analyst-equivalent platform our PMs would have struggled with, would have been quiet self-sabotage.

In brief, we might lastly say: all the pieces else near-equal, the usability for non-technical of us is what units the 2 platforms aside.

So we picked Eppo. The trajectory query is what tipped it: on an extended horizon, Eppo lined up higher with the place we wished experimentation to dwell; nearer to experimenting groups, and past simply the analyst. Information administration as a first-class object. Reporting that doesn’t want a deck rebuilt round it. Statsig had its benefits too; CUPED (a variance-reduction approach) inside its energy calculator, a standalone metrics explorer, a extra versatile evaluation floor; and we accepted these as 12 months 1 gaps to work round, whereas Eppo was being revambed inside Datadog, and buying these options too.

Wanting again, the lesson I take away from it’s double-edged. The choice wanted extra rigour than intuition wished, after which much less religion in that rigour than I anticipated. The scorecard mattered as a result of it pressured everybody to be particular, and to create a way of belief and credibility within the end result. It gave me 360-degree protection, however the name got here from the moments inside it: the analyst tie, and the imaginative and prescient query. Six months after signing, a curious colleague would ask me how we had picked, and I might stroll them by way of the panel, the scorecard, the corrections, and the imaginative and prescient/framing query. That’s a win for me.

After

I feel I anticipated, someplace I’d not admit aloud, that signing the contract was the end line. I had spent weeks constructing a reputable choice system, a course of, and had spent a few hours of vendor calls. The week we signed I had a quiet day. I sat down at my desk and began a working doc about what would occur subsequent. Legend has it that I’m nonetheless writing it.

The clean-water metaphor I had used within the proposal saved coming again to me. We had laid the pipes; that was the SDK integration, the information plumbing, the warehouse connections. The platform itself too, if you’ll. Pipes get you circulate, however not clear water. Within the worst case, pipes contaminate it as an alternative (extra crap output, quicker). Clear water is what comes out of pipes when the remainder of the system (the supply, the therapy, the individuals who preserve it) does its job. Experiments work the identical approach: a platform will get you the circulate, however the reliable outcomes come from governance and course of, from individuals, and from how significantly the group treats the distinction between testing an concept and launching a characteristic.

The instrument is prepared; the group shouldn’t be but prepared for the instrument.

Until that time I used to be deep in the price of the contract, however not the price of bridging the hole between the instrument is current now and the group is able to use it.

I had instructed colleagues, within the weeks main as much as signing, {that a} chunk of the analytics staff’s capability would slowly ramp as much as a brand new equilibrium as soon as Eppo was dwell. As of writing, I’m nonetheless hopeful that may materialise 1 / 4 or two from now; however not earlier than we get some issues in place first. Velocity, the mere act of experimenting extra in a given interval, additionally has to attend.

Signing didn’t purchase time again but, nor did it convey us extra experiments instantly. The work that began the day after signing, forming a cross-functional integration group, drafting the experiment lifecycle, configuring Eppo protocols (a part of its governance framework), certifying our first success metrics and guardrails, migrating a data base, designing a coaching curriculum, all needed to occur earlier than the platform might ship the rate potential we knew it had. En breve, what was forward was not a instrument downside. Somewhat, a governance, course of, and folks one.

Three legs of a stool

For experiments to truly be reliable at Manychat, three issues need to be current on the similar time: the tooling, and engineering integration so experiments can circulate by way of the platform, course of and governance so the experiments that circulate by way of are correctly designed and determined, and individuals and abilities so one of the best practises are adopted in observe and never solely on paper. Drop any one of many three and the entire thing leans.

We had the instrument and the connections now. Course of and governance was totally on the information science staff: a five-stage experiment lifecycle (Suggest, Design, Run, Analyse, Resolve); an authorized set of success and guardrail metrics; all of it encoded into the platform’s personal protocol templates in order that the rails weren’t a Notion web page however a characteristic of the instrument. Individuals and abilities are to be materialised in advert hoc Eppo-delivered instrument quick-starts, and an Experimentation 101 and 102 curriculum in the long run. An ongoing argument for a graduated autonomy mannequin, PMs paired with analysts at first, extra independence over time; that’s the dot on the horizon.

The opposite factor

A milder lesson: signing Eppo was the place my job description modified. I had walked into the venture because the Workers chargeable for choosing a instrument. I walked out doing change administration; onboarding groups, educating, leaning on PMs about lifecycle compliance, spending credibility I had banked for different issues. It was completely price it for me, although.

Closing notes

If I needed to compress all of this, these can be the few traces I’d match it in:

A reputable choice is the deliverable, not the platform. The platform is an artifact. The choice is what your group will dwell inside for years.

In the identical spirit, pipes will not be water. A instrument is critical infrastructure for reliable experimentation, however not enough. The work begins, not ends, on the day the contract is signed.

I’m writing all of this understanding the experimentation instruments market is in movement; the seller churn I flagged up prime has not stopped. Regardless of the map seems like by the point you learn this, the bits of course of that survived for me are in all probability the bits price borrowing: the interviews, the phased discovery, the imaginative and prescient framing, and the sincere budgeting for what comes after.

If you wish to dive into the small print over a web based cup of espresso, be happy to ping me on LinkedIn! I’d be blissful to share concepts with you.

Additionally take a look at my private web page for extra piece like this.