
For a key-value storage system, this measurement might be transactions and retrievals per second. For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content).
On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.Ī measure of how much demand is being placed on your system, measured in a high-level system-specific metric.
For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. It’s important to distinguish between the latency of successful requests and the latency of failed requests. Lifted directly from Google's SRE Book chapter on monitoring distributed systems, it is suggested to at least collect metrics on the "Four Golden Signals": I'm surprised nobody mentioned the four golden signals explicitly as an answer, so I'll add it.
monitoring KPIs) also, we are running in local/local cloud infrastructure, so the cost of the application is not (that)relevant - but it might be someday :-) UPDATE: the complexity of our system does not demand an extra application for reporting (e.g.
if any of these occur, send a warning to the appropriate staff (e.g. historical data predicts that we will run out of disk space) see, if there are any indicators, that the system is likely to crash (e.g. the see if the system functions correctly (at high-level, e.g. microservices are running or not).Īre there any other fundamental categories for to monitor? Or is there another category system to use? dashboard: quick overview of the system (e.g. number of active users, page request, etc. logical or application data: the current status/health of the system, e.g. middleware data: perfomance/health for MySQL instantces, Tomcat instances, JVMs, etc. I have so far come up with the following general categories: Real user, and synthetic monitoring of web applications from outside the firewall.We are building a (Zabbix-based) monitoring system for our applications hovewer, I'm having difficulties in defining what to monitor? Real-time live tailing, searching, and troubleshooting for cloud applications and environments. Monitoring and visualization of machine data from applications and infrastructure inside the firewall, extending the SolarWinds® Orion® platform. Infrastructure and application performance monitoring for commercial off-the-shelf and SaaS applications built on the SolarWinds® Orion® platform.įast and powerful hosted aggregation, analytics and visualization of terabytes of machine data across hybrid applications, cloud applications, and infrastructure. SaaS-based infrastructure and application performance monitoring, tracing, and custom metrics for hybrid and cloud-custom applications. Deliver unified and comprehensive visibility for cloud-native, custom web applications to help ensure optimal service levels and user satisfaction with key business services