Reading Up on Observability and Monitoring

Title: Charity Majors on Observability and Understanding the Operational Ramifications of a System

Can you explain a little about how operational and infrastructure monitoring has evolved over the last five years? How have cloud, containers, and new (old) modular architectures impacted monitoring?

What’s interesting is that monitoring hasn’t really changed. Not in the past … 20 years.

You’ve still got metrics, dashboards, and logs. You’ve got much better ones! But monitoring is a very stable set of tools and techniques, with well known edge cases and best practices, all geared around monitoring and making sure the system is still in a known good state.

However, I would argue that the health of the system no longer matters. We’ve entered an era where what matters is the health of each individual event, or each individual user’s experience, or each shopping cart’s experience (or other high cardinality dimensions). With distributed systems you don’t care about the health of the system, you care about the health of the event or the slice.

This is why you’re seeing people talk about observability instead of monitoring, about unknown-unknowns instead of known-unknowns, and about distributed tracing, honeycomb, and other event-level tools aimed at describing the internal state of the system to external observers.

Monitor everything”. Dude, you can’t. You *can’t*. People waste so much time doing this that they lose track of the critical path, and their important alerts drown in fluff and cruft. In the chaotic future we’re all hurtling toward, you actually have to have the discipline to have radically *fewer* paging alerts … not more. Request rate, latency, error rate, saturation. Maybe some end-to-end checks that stress critical Key Performance Indicator (KPI) code paths.

Title: Monitoring and Observability

…because devs don’t like to do “monitoring”… oh really?
Yes, yes the snark gets a high rating.
If I were to cry of pity for people, this is one of the things I’d cry about for all.

Title: Observability vs. Monitoring, is it about Active vs. Passive or Dev vs. Ops?

For Observability, the system, code, developers, etc. are taking step to make things available to make the system more observable. This often starts with increasingly rich and structured logs, plus events or markers, JMX data points, and Etsy-style emitted metrics. Loved and tended to by Developers and the most modern Ops.

Observability elements, on the other hand, are often much detailed, more diverse, and used more for debugging, complex troubleshooting, performance analyses, and generally going ‘deeper’.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store