2.1.6 Infrastructure and Application Monitoring: Causal Analysis
When creating a dynamic infrastructure, end-user experience is almost always the most important KPI. For example, if CPU utilization on our virtual machines is constantly around 90%, it means we are efficiently using our paid resources. As long as the end-user experience is where it should be, high CPU or memory usage are not critical metrics. This is the target in our control model. We want to drive our resource consumption on a given virtual machine as high as possible without requiring additional virtual machines, as long as end-user experience stays where it should.
When the KPI falls outside of what we consider to be an acceptable value, this indicates that we need to investigate a scaling event. This does not explicitly mean we have a scaling out-worthy incident. It could be an actual problem causing the KPI degradation rather than a capacity issue.
We need to perform a causal analysis on the environment to determine whether or not we should scale or if we should trigger a fault alert and have someone intervene.