Prometheus is an awesome tool to consume raw data from your application stacks and extract real data about your users’ experience and the state of your fleet. But many teams end up throwing more and more data at Prometheus, or other metrics vendors, and just assume that the tool can make sense of the chaos. Well, garbage in makes for garbage out. We will look at the common methodologies for instrumenting resources and HTTP calls and how these can be applied to produce consistency across the environment. Especially in a micro-services style environment, consistency in instrumentation is key for creating debuggable systems.
On this journey we discover that Prometheus has problematic support for percentiles, but percentiles are key for understanding our applications. As application stacks grow, metric cardinality and high error rates in percentiles can blind a whole organization. How can we use Prometheus to build consistent 30 day service level objectives across our applications? We will tackle accuracy in percentile calculations, tactics for handling cardinality, and methods to handle long service level object periods.