You remember that time you passive-aggressively completed a story in the most useless way possible to check that checkbox? That’s most monitoring systems.
Take a look at your project’s compilation warnings. If you’re using NPM, you’ll see the impossible to resolve deprecation warnings a mile long and quickly realize how much people ignore issues. Still, something has everyone convinced that people actually want to fix things. What leads to this massive disconnect? Bad monitoring. Let’s go over traits of a good and bad system.
Good: Active Alerts, Passive Support
Any good monitoring system should swiftly deliver a notification to the person with the power to make change in a way they will review it immediately.
Bad: Passive Dashboards, Active Support
Dashboards require people to actively use them. This is a problem for a number of reasons. The first is that means the time between issues and the time to notification is delayed. This leads to the second which is that people have to ACTUALLY USE the dashboards periodically, and motivating people to use dashboards is like pulling teeth. They won’t. I’ve never seen someone actively use a dashboard they didn’t create themselves. And of course they don’t want to use it. They would have to take time out of their busy schedule to look at a dashboard more relevant to someone else.
Don’t believe me? I once worked for a monitoring service and we eliminated our customers’ dashboards. A few people complained, but it didn’t impact our numbers even slightly.
Good: Accuracy over Precision
This is a topic that confuses a lot of people. Accuracy is how truthful something is. Precision is how specific something is. You want every single alert you receive to identify an issue correctly, even if that means it’s only saying “I know something is broken.”
Bad: Precision over Accuracy
This is one of the most painful uphill battles you will fight in monitoring. People naturally think they want to know as much as possible about an issue upfront. With experience, I can tell you that’s a disaster. What happens is specific alerting introduces complexity. With complexity comes bugs. With bugs you lose trust. Trust in your monitoring is the most critical aspect. You’re better off having no monitoring than monitoring you don’t trust. Untrusted monitoring won’t just lead to missed issues (the entire purpose of monitoring); it will also waste time. In monitoring, false negatives are better than false positives. If false negatives are an issue, reduce your precision to broaden the number of real issues your system captures.
Good: HTTP 500
Any HTTP 500 is a valid thing you want to know about and there is never an excuse for ignoring them. A team complaining their errors are inflated because of invalid HTTP 500’s is a team that’s lying to your face.
Bad: HTTP 400
On the other hand, HTTP 400 level errors mean the client screwed up. Maybe they just had a bookmarked link to a page that doesn’t exist anymore. HTTP 400 level errors on their own do not indicate a problem.
Good: Low Maintenance
If your system doesn’t require a plugin to automatically detect the issues you care about, don’t install the plugin. Why would you add the extra complexity and maintenance? In addition, your monitors should be opt-in/opt-out rather than complexly configured. Make a way to identify all the things you care about generically, then set a monitor around that.
Bad: Specific Monitors
Specific monitors increase maintenance. If services A, B, and C are all lambdas on your team but services D and E aren’t, don’t make a monitor that says “Keep an eye on A AND B AND C.” Group services by tagging A, B, and C with your team name and make a monitor that says “Keep an eye on services with Team == MY_TEAM.” Eventually, you will add service F and no one will remember to go into the monitoring service and add it, but they will see the tags on the other lambdas and tag the new one correctly.
Good: Audience Cares
When we rolled out a new monitoring system to reduce the number of errors in our department, all the teams were tasked to integrate. After an audit, we found only half the teams actually set up their system in a way they were receiving the new alerts. One of the team members from a team that succeeded told me they initially were receiving more alerts, but that resulted in bug fixes and now their number of alerts are much lower. That team currently has the highest successful-vs-total-requests ratio in the department.
Bad: Audience Apathy
Conversely, and prior to implementing, a team member from the team with the lowest successful-vs-total-requests ratio stated no one on his team is passionate about that tool. If your audience doesn’t care about the errors, they’re not going to care about the alerting. To go even further, if the engineers care about their errors, they will find their own way of alerting.
Good: Power
If you have the power to fix a bug, you’re much more likely to fix it. This sounds obvious except it’s not because…
Bad: Moral Hazard
I started with a system where another team would publish data to us in an eventing architecture and would frequently publish corrupt data. It was my team’s responsibility to address anytime data was not ingested correctly into our system. As a result, we had floods of errors in our system. We tried asking them to stop and they said no. We tried asking management if we could stop making successful ingestion our responsibility and they said no. We had no way to distinguishing between them doing testing causing errors (which would happen hourly) or them accidentally sending us bad data because of a bug. I have no idea how many issues were missed because people stopped reading our alerts. Eventually, we killed that system and now every alert is within our control and addressed immediately, but that still haunts me to this day.