8. vCloud Operations Control : 8.4 Event, Incident and Problem Management : 8.4.3 Process Automation and Tool Alignment/Integration
   
8.4.3 Process Automation and Tool Alignment/Integration
The Event, Incident and Problem Management processes for vCloud depend on tooling, and if the appropriate tools are not in place, it is difficult to manage and operate the environment while sustaining the required service levels. Traditionally, Event, Incident and Problem Management has relied heavily on tooling, and in a vCloud, the scope of the required tools increases. This is due to the additional requirements that the vCloud has, such as a greater need for early warning for impending incidents and a higher level of automation. For early warnings, increased functionality of the tools (for example, smart alerts, dynamic thresholds and intelligent analytics) help fulfill this requirement. For a higher level of automation, additional tools, such as vCenter Orchestrator, are required.
To realize the vCloud benefits of reliability and lower OPEX costs, it is not sufficient to only interpret events to highlight incidents and problems. It is also necessary to establish how incidents can be more efficiently identified, and how remediation can be put in place quicker, as well as identifying the root cause to prevent the problem from happening again.
As the vCloud resources and services supplied to vCloud customers are based on underlying vSphere resources, it is possible to use tools that manage and monitor at the vSphere level.
vCenter Operations Manager can be used to provide an up-to-date understanding of the health of the vSphere environment as it relates to the vCloud provider virtual datacenters (Figure 16).
Figure 23. vCenter Operations Manager Event and Incident Management
 
The Health badge shows a score that indicates the overall health of the object selected. This can be a vCenter instance, a vSphere datacenter, cluster, host, or datastore. This monitoring mechanism provides proactive analysis of the performance of the environment and identifies when the health of the object reaches a level that indicates an incident may be about to occur. To enforce effective management, the vCloud NOC can be provided with a dashboard that shows these key metrics that indicate the health of the environment.
The score shown for the Health badge is calculated from a number of sub-badges:
*Workload, which provides a view of how hard the selected object is working.
*Anomalies, which provides an understanding of metrics that are outside of their expected range.
*Faults, which provides detail of any infrastructure events that may impact the selected objects availability.
For faults, active vCenter events or alerts are used. These can include host hardware events, virtual machine FT and HA issues, vCenter health issues, cluster HA issues, and so on. These vCenter alerts are supplied through the vSphere adapter into vCenter Operations Manager, and can be used to identify root cause. Additionally, alerts are generated by vCenter Operations Manager if a sub-badge score hits a predefined value.
The events or alerts appear as faults as shown in Figure 24.
Figure 24. vCenter Operations Manager Faults
Any fault can be selected to gain further information. In this example, this event is associated with a host, and that an uplink has been lost (see the following figure).
Figure 25. vCenter Operations Alert
 
In addition to using vCenter Operations Manager for vSphere metrics and events, VMware vFabric Hyperic can be used to provide operating system and application metrics. Providing these metrics to vCenter Operations Manager further enhances the Incident Management toolset.