Most A/B testing tutorials teach you to fetch a flag and switch a view:
if featureFlags.isEnabled("new_checkout") {
showNewCheckout()
} else {
showOldCheckout()
}
That is the trivial 5%. It is also the only part most write-ups cover. The other 95% — the part that decides whether the number you ship is real or noise — is the entire reason A/B testing is hard. This post is the 95%.
The frame to hold the whole way through: a result is only worth acting on if it has three properties.
- Consistency — the same user sees the same thing, deterministically, offline, across sessions and platforms.
- Attribution — you can divide the people who succeeded by the people who were actually shown the thing, and not by some other denominator.
- Inference — the difference you observed is distinguishable from chance, under rules you fixed before you looked.
Break any one and the experiment is decorative. Everything below is in service of those three.
Why mobile makes this sharper
Feature flags are not an iOS invention. They predate mobile by decades — conditional code paths toggled by configuration instead of a deploy. Universally applicable.
But iOS makes them load-bearing in a way the web does not, for one reason: App Store review lag. You cannot hotfix. You cannot gradually roll out by pushing new code. Submission-to-availability is days, not minutes, and it is a dependency you do not control. So the flag becomes your only mechanism for changing behavior on a binary that already shipped — your kill switch, your rollout dial, your rollback. On the web a bad experiment is a redeploy away from dead. On iOS, a server-controlled flag is the only thing standing between you and a week-long bad release.
Two more constraints fall out of the platform:
- Offline. The variant decision cannot require a network round-trip at the call site. The user opens the app on a train; the experiment still has to resolve. This forces deterministic local evaluation, which turns out to be the cleanest design anyway.
- Session stability. A variant must not flip mid-session. If the UI changes under the user between two screens, you have contaminated their behavior and your data.
Flag vs. test: the flag is the primitive
A feature flag is an on/off switch. No split, no measurement, no statistics. An A/B test is a feature flag with three things bolted on:
feature flag = show this to these users, on/off
A/B test = feature flag
+ deterministic bucketing (who gets which variant)
+ exposure logging (who actually saw it)
+ statistics (did it matter)
So the test is the flag plus who, when, and did it matter. Keep that decomposition; it tells you exactly which subsystem owns which failure.
The pipeline
Three of these stages are where experiments actually die — assignment, exposure logging, and analysis. The rest is plumbing.
Assignment — get the three properties or get nothing
Assignment decides which variant a user is in. It has to be stable, consistent, and sticky, and those words mean specific things.
Stable — the same user always resolves to the same variant. The mechanism is deterministic hashing:
let bucket = hash(userID + experimentID) % 100 // 0...99
let variant = bucket < splitThreshold ? .control : .treatment
Same input, same bucket, every time, on every device, with no server call. That is what makes it stable and offline-resolvable in one stroke.
Consistency — hash the pair, not the user. Note the input is userID + experimentID, not userID alone. If you hash on userID only, then adding a new experiment reshuffles bucketing for every existing experiment — because the hash domain changed. Concatenating the experiment ID decorrelates experiments from each other: each one gets its own independent, stable partition of users. Assign once, stick forever, per experiment.
Consistent across platforms — a logged-in user on iOS and web must land in the same variant. Two ways to guarantee it: compute assignment on the server and have the client fetch it, or agree on an identical hash function on both clients. The mature systems (LaunchDarkly, Statsig) deliver the ruleset from the server and evaluate it on the client — server-authored rules, client-side deterministic evaluation. You get control and offline resolution at once.
Sticky — the split can only add users, never move them. Mid-experiment you will ramp: 1% → 10% → 50%. The invariant is that already-assigned users never change variant when the split widens. New traffic flows into the experiment; existing assignments are frozen. The mechanism that enforces this is simple — check for a saved assignment before you hash:
The saved-assignment check is the whole ballgame for stickiness. If you re-hash on every launch, you are fine as long as the split never changes — but the moment it ramps, an unsaved user near the boundary can cross it. The persisted assignment is what makes ramping safe.
Two edge cases that separate real systems from toy ones
Reinstall. Local storage is wiped, so the saved assignment is gone. Does the user flip variants? No — if the hash is deterministic on a stable ID, re-hashing reproduces the same bucket and therefore the same variant. Determinism is what makes reinstall safe. The catch is the ID: UserDefaults does not survive reinstall, the Keychain does. For pre-login / anonymous experiments you need a device ID persisted in the Keychain, or every reinstall is a new user and your denominators rot.
Non-random user IDs. If your user IDs are sequential or structured, taking id % 100 directly buckets them non-uniformly — all the even IDs, all the IDs ending in a region prefix, clustering into the same arms. The fix is the hash function itself: run the ID through MD5 / murmur first so the output is uniformly distributed, then modulo. You are not bucketing the ID; you are bucketing its hash. This is why "consistent hashing" is the mechanism and not just "modulo."
Fetch ≠ activate
You fetch the experiment config in the background — on launch, on a timer, on a push. You do not apply it the instant it arrives. Applying mid-session is exactly the variant-flip-under-the-user failure. So the config has two moments:
- Fetch — pull the latest ruleset, cache it. Background, invisible, can be stale.
- Activate — promote the cached ruleset to live at a controlled boundary (next cold launch, next session start). This is when the user's resolved variants are allowed to change.
Stale config is almost always acceptable for experiments — you are not settling payments. The thing you are buying with the fetch/activate split is the guarantee that a variant is fixed for the duration of a session.
Exposure ≠ conversion — the most misunderstood step
This is the one that quietly destroys more experiments than any bug. Three distinct events, frequently collapsed into one by people who then wonder why their numbers are wrong:
- Assignment — the system decided which variant the user is in. Happens early, possibly long before they see anything.
- Exposure — the user actually encountered the variant in the UI. Fires exactly once, at the moment of render.
- Conversion — the user did the measured thing (signed up, purchased, completed checkout).
You can be assigned to a new-checkout variant and never be exposed, because you never reached checkout. Assignment is not exposure.
conversions / exposures. Not conversions / assignments, and emphatically not conversions / conversions.The classic mistake is logging only the conversions and reconstructing the denominator from assignment counts. The instant exposure rates differ between arms — which they will, because variants change how many people reach the surface at all — your rates are computed against the wrong base and every comparison is biased. Log the exposure event the instant the variant renders. That event carries experimentID, variant, userID, timestamp, and nothing else clever.
The experiment lifecycle
The client renders per state; the server drives the transitions. The non-obvious rule lives between Shipped and Cleanup: do not delete the losing branch the moment a winner is declared. Wait until the config says 100% of users are on the winner and you have cut a release that bakes that in. Delete earlier and a user on an old binary, still being served the now-deleted variant, falls through to undefined behavior.
When experiments touch the same surface, isolate them with mutex groups so their effects do not interact:
struct ExperimentConfig {
let mutexGroup: String? // "search_surface", "checkout_surface"
}
A user is in at most one experiment per mutex group. Without this, a search-ranking test and a price-display test running on the same screen give you results you cannot attribute — did ranking move the metric, or price, or the interaction of the two?
The call site, and the tech debt it breeds
Resolve the variant cleanly and you still have a problem: branches at every call site.
if featureFlag("new_search_ranking") {
newRankingAlgorithm()
} else {
oldRankingAlgorithm()
}
One of these is fine. Two hundred of them, accumulated over a year of experiments, is a codebase where every method has a fork in it and nobody remembers which forks are still live. Two mitigations worth knowing:
Model each feature as its own type, not a shared enum. The mature iOS pattern (Kiwi.com's, for one) gives every experiment a dedicated type and surfaces it through a Swift property wrapper, instead of one giant Experiment enum that every consumer switches over. The type system then tells you what each branch is for, and a removed experiment is a removed type — a compile error at every stale call site, not a silent dead path.
Automate the cleanup. Dead experiment code is the worst kind of dead code: it does not look dead. It has conditions and branches that look intentional, so engineers leave it alone out of fear. Agoda uses Piranha (the open-source Polyglot Piranha) for exactly this — an AST-based tool that, given the winning variant, traces a flag through the codebase and deletes the losing branches automatically, inlining the winner:
// before
if featureFlag("new_search_ranking") {
newRankingAlgorithm()
} else {
oldRankingAlgorithm() // losing branch — dead, but looks intentional
}
// after Piranha
newRankingAlgorithm() // flag gone, winner inlined
Teams that run this at scale report it reclaiming a serious amount of engineering time — flag debt is a recurring tax, and paying it down by hand is exactly the kind of work nobody volunteers for. One caveat worth stating: Piranha is harder to apply to Swift than to backend languages, because Swift's type-driven flag patterns are less uniform than the string-keyed branches Piranha was built to chew through.
The statistics that decide whether you learned anything
Here is the trap that the whole post has been circling. You show A to half, B to half. B gets more signups. Can you conclude B is better?
No. Not yet. More signups in a sample is exactly what you would sometimes see even if A and B were identical, because users are noisy and your split is finite. The entire job of the statistics layer is to separate "B is genuinely better" from "B got luckier this week." Skip this layer and you will ship noise as signal, repeatedly, with confidence.
Pre-register the sample size. Before launch, fix your minimum detectable effect (the smallest lift worth caring about), your power (usually 80%), and your significance level (usually α = 0.05). Those three numbers compute the sample size you need. That number tells you when you are allowed to look. You decide it up front; you do not reverse-engineer it after seeing the data.
p < 0.05 flickers into view massively inflates your false-positive rate — given enough peeks, almost any null experiment will cross the line by chance at some point, and you will "discover" effects that do not exist. Two legitimate options: fix the sample size in advance and look exactly once, or use a method designed for continuous monitoring.| Approach | Answers | Tradeoff |
|---|---|---|
| Frequentist (t / z-test) | "If there were no difference, how surprising is this data?" | Fixed horizon; peeking breaks it; the p-value is easy to misread |
| Sequential / always-valid | Same question, but valid to monitor continuously | Needs more data for the same confidence; more machinery |
| Bayesian | "P(B beats A) = 96%, expected lift 3.2%" | Needs priors; but it answers the question PMs actually ask |
The numbers that actually gate a decision:
- Effect size and direction — not just "significant," but how much and which way.
- Confidence / credible interval — a point estimate with no interval is not a result. If the interval crosses zero, you have nothing.
- Guardrail metrics — the primary metric improved, but a guardrail (crash rate, latency, refund rate) degraded slightly. This is where judgment lives, and it is the question a good interviewer asks: what do you do when the thing you wanted went up but something you promised not to break went down? There is no formula. That is the point.
Worth knowing that not everyone runs the textbook α = 0.05. Some high-traffic shops deliberately loosen the significance bar, trading a higher false-positive rate for experiment velocity — at enough volume, the cost of an occasional wrong call is outweighed by the throughput of deciding faster. It's a deliberate business tradeoff, not a mistake, but it's a choice you make with eyes open, not a default.
Culture is the actual hard part
The tooling is straightforward. The discipline is not. The failure modes are human: engineers who want to ship features without experiments, PMs who declare winners before significance, executives who override a clean statistical result with a gut call, teams that run so many overlapping experiments that each one is too diluted to conclude. None of that is a code problem. The system above is necessary and nowhere near sufficient — it gives you the ability to learn the truth; it cannot make anyone act on it.