Is your team building or scaling AI agents?(Sponsored)One of AI’s biggest challenges today is memory—how agents retain, recall, and remember over time. Without it, even the best models struggle with context loss, inconsistency, and limited scalability. This new O’Reilly + Redis report breaks down why memory is the foundation of scalable AI systems and how real-time architectures make it possible. Inside the report:
Disclaimer: The details in this post have been derived from the details shared online by the Databricks Engineering Team. All credit for the technical details goes to the Databricks Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. Kubernetes has become the standard platform for running modern microservices. It simplifies how services talk to each other through built-in networking components like ClusterIP services, CoreDNS, and kube-proxy. These primitives work well for many workloads, but they start to show their limitations when traffic becomes high volume, persistent, and latency sensitive. Databricks faced exactly this challenge. Many of their internal services rely on gRPC, which runs over HTTP/2 and keeps long-lived TCP connections between clients and servers. Under Kubernetes’ default model, this leads to uneven traffic distribution, unpredictable scaling behavior, and higher tail latencies. By default, Kubernetes uses ClusterIP services, CoreDNS, and kube-proxy (iptables/IPVS/eBPF) to route traffic:
Since the selection happens only once per TCP connection, the same backend pod keeps receiving traffic for the lifetime of that connection. For short-lived HTTP/1 connections, this is usually fine. However, for persistent HTTP/2 connections, the result is traffic skew: a few pods get overloaded while others stay idle. For Databricks, this created several operational issues:
The Databricks Engineering Team needed something smarter: a Layer 7, request-level load balancer that could react dynamically to real service conditions instead of relying on connection-level routing decisions. In this article, we will learn how they built such a system and the challenges they faced along the way. The Core SolutionTo overcome the limitations of the default Kubernetes routing, the Databricks Engineering Team shifted the load balancing responsibility from the infrastructure layer to the client itself. Instead of depending on kube-proxy and DNS to make connection-level routing decisions, they built a client-side load balancing system supported by a lightweight control plane that provides real-time service discovery. This means the application client no longer waits for DNS to resolve a service or for kube-proxy to pick a backend pod. Instead, it already knows which pods are healthy and available. When a request is made, the client can choose the best backend at that moment based on up-to-date information. Here’s a table that shows the difference between the default Kubernetes LB and Databricks client-side LB: By removing DNS from the critical path, the system gives each client a direct and current view of available endpoints. This allows smarter, per-request routing decisions instead of static, per-connection routing. The result is more even traffic distribution, lower latency, and better use of resources across pods. This approach also gives Databricks greater flexibility to fine-tune how traffic flows between services, something that is difficult to achieve with the default Kubernetes model. Custom Control Plane - Endpoint Discovery ServiceA key part of the intelligent load balancing system is its custom control plane. This component is responsible for keeping an accurate, real-time view of the services running inside the Kubernetes cluster. Instead of depending on DNS lookups or static routing, the control plane continuously monitors the cluster and provides live endpoint information to clients. See the diagram below: |