Cut Code Review Time & Bugs in Half (Sponsored)Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request. Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss. CodeRabbit reviews 1 million PRs every week across 3 million repositories and is used by 100 thousand Open-source projects. CodeRabbit is free for all open-source repo’s. The Reddit Engineering Team completed one of the most demanding infrastructure migrations in the company’s history. It moved its entire Apache Kafka fleet, comprising over 500 brokers and more than a petabyte of live data, from Amazon EC2 virtual machines onto Kubernetes. The migration was done with zero downtime and without asking a single client application to change how it connected to Kafka. In this article, we will look at the breakdown of this migration, the challenges the engineering team faced, and how they achieved their goal of a successful migration. Disclaimer: This post is based on publicly shared details from the Reddit Engineering Team. Please comment if you notice any inaccuracies. The Role of Kafka at RedditTo put things into perspective, let us first understand what exactly Apache Kafka is. Apache Kafka is an open-source message streaming platform. Applications called producers write messages into Kafka partitions, and other applications called consumers read those messages out. Kafka sits in the middle and stores those messages reliably, even if the producer and consumer are running at completely different times. A single Kafka server is called a broker, whereas a collection of brokers working together forms a cluster. At Reddit, Apache Kafka is not a peripheral tool. It sits underneath hundreds of business-critical services, processing tens of millions of messages every second. If Kafka went down, large portions of Reddit would break. Why Reddit Wanted to Move Away from EC2Before the migration, Reddit managed its Kafka brokers on Amazon EC2 instances using a combination of Terraform, Puppet, and custom scripts. Operators handled upgrades, configuration changes, and machine replacements by running commands directly from their laptops. This worked fine until a certain point. However, as the fleet grew, it became increasingly slow, error-prone, and expensive. Reddit needed a more scalable and reliable way to operate Kafka. Kubernetes, paired with a tool called Strimzi, offered that path. Kubernetes is an open-source platform for running and managing containerized applications. Instead of manually provisioning and maintaining individual servers, Kubernetes lets developers describe what should be running and handles deployment, scaling, and recovery automatically. Strimzi, on the other hand, is a project under the Cloud Native Computing Foundation that specifically lets you run Kafka on Kubernetes. It provides a declarative way to manage Kafka clusters. This means that developers can describe what they want in a configuration file, and Strimzi handles deployment, upgrades, and maintenance. This promised fewer manual interventions and more predictable operations. The Four Constraints That Shaped the MigrationReddit did not jump straight into moving brokers. Before writing a single line of migration code, Reddit identified four hard constraints that ruled out entire categories of approaches. The constraints are as follows:
Phase 1: Taking Control of the Naming LayerThe first phase of the migration did not touch Kafka at all. Reddit introduced a DNS facade, which is a set of DNS records that act as an intermediate layer between client applications and the actual Kafka brokers. DNS is the system that translates human-readable names into the addresses of servers. By creating new, infrastructure-controlled DNS names that initially pointed to the same EC2 brokers, Reddit changed nothing from the perspective of client applications. Reddit then rolled out these new connection strings across more than 250 services using automated tooling that generated batch pull requests to update configuration files. Once all clients were talking through this DNS layer, Reddit could change where those names pointed, from EC2 to Kubernetes, without modifying any client code. Phase 2: Making Room for New BrokersEach Kafka broker is identified by a unique numeric ID. Strimzi assigns broker IDs starting at 0 by default. However, Reddit’s existing EC2 brokers already occupied those low numbers. To free up that ID space, Reddit doubled the cluster size by adding new EC2 brokers with higher IDs, then terminated the original low-numbered brokers. This shifted all data onto the higher-numbered brokers and opened up IDs 0, 1, 2, and so on for Strimzi-managed brokers to use. |