8. vCloud Operations Control : 8.3 Performance Management : 8.3.1 Performance Management Process Definition and Components
   
8.3.1 Performance Management Process Definition and Components
The high-level event, incident, and problem processes for performance management shown in Figure 19 apply for both vCloud providers and tenants. These processes look the same as any traditional Performance Management process. However, the dynamic nature of the vCloud and the drive to reduce OpEx means that the process has to be more agile and rely less on manual intervention. Manual performance management may be appropriate for physical infrastructure world and early virtualization adoption stages, but only tooling and automation can provide the level of performance management required for the vCloud.
At a high level, the objectives of event, incident, and problem processes for performance management are to automate as much as possible and maximize the number of tasks that can be performed by level 1 operators, rather than level 2 administrators or level 3 subject matter experts (SMEs). The following are possible ways to handle events, incidents, or problems, listed in order of preference:
1. Automated Workflows – These workflows are totally automatic and can be initiated by predefined events or support personnel.
2. Interactive Workflows These workflows require human interaction and can be initiated by predefined events or support personnel.
3. Level 1 support – Operators monitor systems for events. They are expected to follow runbook procedures for reacting to events, which might include executing predefined workflows.
4. Level 2 support – Administrators with basic technology expertise handle most routine tasks and execute predefined workflows.
5. Level 3 support – SMEs for the various technologies handle the most difficult issues, and are also responsible for defining the workflows and runbook entries that allow Level 1 operators and Level 2 administrators to handle more events and incidents. This is described in more detail in Section 8.4, Event, Incident, and Problem Management.
8.3.1.1. Event Management Process for Performance Management
As the following figure shows, there multiple ways for performance events to be generated.
*vCenter Operations Manger Early Warning Smart Alerts – These alerts are the result of multiple metrics showing a change in behavior. They are typically reviewed by Level 2 Administrators in an attempt to determine if an incident has occurred.
*vCenter Operations Manager Key Performance Indicator (KPI) Smart Alerts – These alerts are the result of anomalous behavior from pre-defined KPIs or Super Metrics. Because these alerts are more specific, they are more readily automated with workflows.
*Service Desk receives a call from a user to report a performance issue.
*The Level 1 Operator receives an alert from the monitoring system regarding a performance issue.
Figure 19. High-Level Event Management Process for Performance
 
If a performance event is identified as a known issue, it might trigger a predefined action such as an automated workflow, interactive workflow, or runbook procedure. If the event does not have a definition, it becomes an incident that a Level 2 Administrator or Level 3 SME must resolve.
8.3.1.2. Incident Management Process for Performance Management
As Figure 20 shows, there are a different ways to resolve performance incidents, depending on how they are generated.
*Lack of tenant capacity – When a tenant’s capacity is fully used, events can be triggered depending on how the tenant’s lease is defined. If the tenant purchased a bursting ability, additional resources can be added at a premium cost if they are in excess of their base usage. If bursting has not been purchased or is not available, the tenant should be notified that their capacity is fully used.
*Lack of provider capacity – This should never happen if the design guidance for proactive capacity management is established and effective. If capacity is fully used, the service provider must either add more capacity or move capacity around to address the issue. This condition should be reported to Capacity Management and may result in SLA breaches for tenants.
*Hardware or software failure – Performance issues can be the result of software or hardware error such as host failures, configuration errors, bad software updates, or other repairable issues. If insufficient redundancy is built into the overall vCloud, these types of errors can also result in SLA breaches for tenants.
Figure 20. High-Level Incident Management Process for Performance
If the incident is high priority or a chronic issue, turn it over to Problem Management for further analysis.
8.3.1.3. Problem Management Process for Performance Management
As shown in the following figure, the primary goal of Problem Management is to identify the root cause of a problem. After the root cause is identified, develop and implement an action plan to avoid the problem in the future.
*The preferred method is to fix the root cause so that the problem never occurs again.
*If the problem cannot be eliminated, workflows and runbook entries must be defined so that the problem can be quickly resolved if it occurs again. KPIs and Super Metrics can be defined to help identify an issue before it becomes a problem.
Figure 21. High-Level Problem Management Process for Performance