Detecting anomalies sounds simple–until real client traffic enters the picture. I’ll share how we learned to spot volume and conversion dips that matter, tune out the noise, and turn detection into trust.
When we first started looking for anomalies in our payment metrics data, the goal was simple: catch issues before clients did. The reality, of course, wasn’t that simple.
Our first version looked only for “volume zero” – complete outages. It worked for big events, but we quickly realised most of the pain lived in the subtle dips: the 10% drop at lunchtime, the steady weekend decline, the provider that failed only for one type of payments or even device. None of that showed up in the old alerts.
So we built something smarter. We started calculating a predicted line for each client, based on the data we hold – recognising that every one of them has their own traffic rhythm. Then we added anomaly lines above and below it to capture what “too high” or “too low” really means in their context. It wasn’t perfect, but with every iteration, the framework got better at noticing the right things, not just everything.
This talk is about how we evolved the way we use our data: how we went from total outages to anticipating small dips, from arguing over false alarms to trusting our signals. I’ll share what worked, what didn’t, and how this journey taught us about technical patterns.
Key takeaways
- Start simple, then evolve with what your data actually teaches you.
- Predicting traffic wasn’t the hardest part; agreeing on what’s normal was and how to alert on proper anomaly.
- We learned to separate signal from noise by measuring burn rate and treating false positives as feedback, not failure.
- Collaboration between engineering and CSMs was key – alerts only work if the people closest to clients can understand and act on them.
- Building detection wasn’t just about models; it was about creating shared trust in what “an anomaly” really means.