Kafka at OpenAI
Before:
– A central backbone for data pipelines and transactional workloads
– 37 Kafka clusters with different configurations for different needs
– Each cluster was a single point of failure in case of an outage
– High friction when onboarding a new service. Which cluster has the needed topic? How do I get network access? Which credentials should I use?
– Clusters were overloaded with a huge number of client connections
Goal: decouple producers and consumers from Kafka clusters.
Solution:
– Producer load balancer (Prism proxy)
– Consumer load balancer (uForwarder from Uber)
– Kafka clusters are grouped. All clusters within a group contain the same Kafka topics. Each topic belongs to exactly one cluster group.
– A Control Plane was introduced to manage a cluster-of-clusters Kafka platform
Pros:
– More efficient scalability
– Higher reliability
– Much simpler development and operations for producers and consumers
– Freedom to change Kafka infrastructure without breaking clients
Cons:
– Partitioning issues
– No event ordering guarantees
– No exactly-once processing
Each drawback can be mitigated with compensating solutions outside of the Kafka platform itself.
After:
– 6 cluster groups
– Zero-downtime migration to the new architecture
– 99.999% availability of the new Kafka platform
– Kafka is no longer a single point of failure.
The infrastructure survived full regional outages with zero impact on producers and minimal impact on consumers.
– During the migration, Kafka usage inside OpenAI grew 10x, overall throughput increased 20x.
Charts and deep technical details are available on Boosty


