Kafka at OpenAI

Jan 03, 2026

Before:
– A central backbone for data pipelines and transactional workloads
– 37 Kafka clusters with different configurations for different needs
– Each cluster was a single point of failure in case of an outage
– High friction when onboarding a new service. Which cluster has the needed topic? How do I get network access? Which credentials should I use?
– Clusters were overloaded with a huge number of client connections

Goal: decouple producers and consumers from Kafka clusters.

Solution:
– Producer load balancer (Prism proxy)
– Consumer load balancer (uForwarder from Uber)
– Kafka clusters are grouped. All clusters within a group contain the same Kafka topics. Each topic belongs to exactly one cluster group.
– A Control Plane was introduced to manage a cluster-of-clusters Kafka platform

Pros:
– More efficient scalability
– Higher reliability
– Much simpler development and operations for producers and consumers
– Freedom to change Kafka infrastructure without breaking clients

Cons:
– Partitioning issues
– No event ordering guarantees
– No exactly-once processing
Each drawback can be mitigated with compensating solutions outside of the Kafka platform itself.

After:
– 6 cluster groups
– Zero-downtime migration to the new architecture
– 99.999% availability of the new Kafka platform
– Kafka is no longer a single point of failure.
The infrastructure survived full regional outages with zero impact on producers and minimal impact on consumers.
– During the migration, Kafka usage inside OpenAI grew 10x, overall throughput increased 20x.

Charts and deep technical details are available on Boosty

Software Architecture

Discussion about this post

Ready for more?