Real-time fraud detection at 18,000 TPS for a digital bank

IndustryDigital Banking / Fintech

DisciplinesData Engineering · ML · Cloud · Apps

Duration6 months to GA · ongoing model ops

RegionIndia · 28 million customers

The problem

The bank’s legacy fraud engine was a 2,000-rule decision tree maintained by a 6-person team. Latency budget was 400ms but it routinely missed; rule changes took 3 weeks to deploy. Mule-account networks were detected only after the money had moved 2-3 hops out of reach. Annual fraud loss was running 0.18% of GMV — twice industry benchmark.

The solution

A streaming feature store on Apache Flink that materialises 240+ behavioural and graph features per customer in real time, a gradient-boosted scoring model with sub-50ms inference, and a live transaction graph in ScyllaDB used to detect mule clusters as they form. Every decline is paired with SHAP-based explanation served to the customer-care console.

What we built

Sub-80ms decisions at 18K TPS peak

Apache Flink streaming feature store + LightGBM in ONNX runtime. P99 round-trip 78ms across the entire pipeline.

Graph-based mule detection

Live transaction graph in ScyllaDB. Detects mule clusters as soon as the second hop happens, not after the seventh. Cut mule loss 71%.

Explainable every decline

Per-decision SHAP values surfaced to the care team. Customer service can explain a decline in under 30 seconds — cut fraud-decline complaints 54%.

Adaptive thresholds per cohort

Risk thresholds vary by customer tenure, channel, geography, and transaction history. Less friction for trusted customers, tighter guardrails for new ones.

Daily model refresh

Production model retrained nightly on the previous day’s adjudicated outcomes. Champion-challenger every Monday. No more 3-week rule deploys.

Counterfactual explainer for analysts

Fraud analysts can ask “what would have changed if amount were 30% lower?” and see model behaviour shift live. Helps tune policy without deploying.

How it’s built

Stream ProcessingApache Flink 1.18 · Kafka · 240+ behavioural features

Feature StoreFeast · Redis Hot · ScyllaDB warm

ML ModelsLightGBM · ONNX runtime · SHAP · daily retraining

Graph DBScyllaDB · custom graph traversal in Go

HostingAWS multi-AZ · MSK · auto-scaling EKS · 99.99% SLA

ObservabilityDatadog · custom drift dashboards · per-feature freshness

CompliancePCI-DSS · RBI Master Direction · DPDP · ISO 27001

Care ConsoleReact · WebSocket · per-decision explainer

The numbers

18K

Peak transactions per second sustained

<80ms

P99 decision latency end-to-end

-64%

Fraud loss in the first 6 months post-GA

-71%

Mule-network exposure caught at hop 2

-54%

Fraud-decline complaints (explainability)

99.99%

Platform availability across 6 months

“Going from a 2,000-rule engine to a streaming model was the single most impactful technical change we made in 18 months. Loss is down, false-positives are down, and the fraud team finally sleeps.”
— CTO, Digital Bank (28M customers)

Have a project that looks like this?

If your engagement combines 3 or more disciplines, we’d like to hear about it. Tell us the constraint, the deadline, and the outcome that matters — we’ll come back with a scoped proposal.

Start a conversation All case studies