Build a more sustainable on-call experience (Sponsored)Keeping your systems reliable shouldn’t come at the expense of your team. This practical guide from Datadog shows how to design sustainable on-call processes that reduce burnout and improve response. Get step-by-step best practices so you can:
Disclaimer: The details in this post have been derived from the details shared online by the Datadog Engineering Team and the P99 Conference Organizers. All credit for the technical details goes to the Datadog Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them. In the world of cloud monitoring, Datadog operates at a massive scale. The company’s platforms must ingest billions of data points every single second from millions of hosts around the globe. This constant flow of information creates an immense engineering challenge: how do you store, manage, and query this data in a way that is not only lightning-fast but also cost-effective? For the Datadog Engineering Team, the answer was to build their own solution from the ground up. In this article, we will understand how the Datadog engineering team built Monocle, their custom-built time series storage engine to power their real-time metrics platform. This article analyzes the technical decisions and clever optimizations behind the database. Move faster with AI: Write code you can trust (Sponsored)AI is speeding things up, but all that new code creates a bottleneck — who’s verifying the quality and security? Don’t let new technical debt and security risks slip past. Sonar’s automated review gives you the trust you need in every line of code, human- or AI-written.
Get started with SonarQube today to fuel AI-enabled development and build trust into all code. High-Level Metrics Platform ArchitectureBefore diving into the custom database engine, it is important to understand where it fits. Their custom engine, named Monocle, is just one specialized component within a much larger “Metrics Platform.” This platform is the entire system responsible for collecting, processing, storing, and serving all of its customer metrics. The journey of a single data point begins at the “Metrics Edge.” This component acts as the front door, receiving the flood of data from millions of customer systems. From there, it is passed to a “Storage Router.” Just as the name suggests, this router’s main job is to process the incoming data and intelligently decide where it needs to be stored. This is where Datadog’s first major design decision becomes clear. The Datadog Engineering Team recognized that not all data queries are the same. An engineer asking for a performance report from last year has very different needs than an automated alert checking for a failure in the last 30 seconds. To serve both, they split their storage into two massive, specialized systems.
A time series data point has two parts:
|