2.1.6 Infrastructure and Application Monitoring: Causal Analysis
When creating a dynamic infrastructure, end-user experience is almost always the most important KPI. For example, if CPU utilization on virtual machines is a constant 90%, it means that paid resources are being used efficiently. As long as the end-user experience is where it should be, high CPU or memory usage are not critical metrics. This is the target in this control model—you want to drive resource consumption on a given virtual machine as high as possible without requiring additional virtual machines, as long as end-user experience does not degrade.
When the KPI falls outside of an acceptable range, it indicates that a scaling event must be investigated. This does not explicitly mean that the incident requires scaling out. It could be an actual problem causing the KPI degradation rather than a capacity issue.
You must perform a causal analysis on the environment to determine whether to scale out, or whether the event should trigger a fault alert that requires intervention.