Architecting a vCloud Director Solution : Cloud Management Components : 5.3 VM Metric Database
   
5.3 VM Metric Database
Beginning with vCloud Director 5.6, virtual machine performance and resource consumption metrics are collected and historical data is provided for up to two weeks.
Table 3. Virtual Machine Performance and Resource Consumption Metrics
Metric Name
Type
Unit
Description
cpu.usage.average
Rate
Percent
Host view of average actively used CPU as a percentage of total available
cpu.usagemhz.average
Rate
Megahertz
Host view of actively used CPU as a raw measurement
cpu.usage.maximum
Rate
Percent
Host view of maximum actively used CPU as a percentage of total available
mem.usage.average
Absolute
Percent
Usage as percentage of total configured or available memory
disk.provisioned.latest
Absolute
Kilobytes
Storage space potentially used
disk.used.latest
Absolute
Kilobytes
Storage space actually used
disk.read.average
Rate
Kilobytes per second
Read rate aggregated across all datastores
disk.write.average
Rate
Kilobytes per second
Write rate aggregated across all datastores
Retrieval of both current and historical metrics is available through vCloud API. The current metrics are directly retrieved from the vCenter Server database with the Performance Manager API. The historical metrics are collected every 5 minutes (with 20 seconds granularity) by the StatsFeeder process running on the cell with vCenter Server Proxy and pushed to persistent storage—Cassandra NoSQL database cluster with KairosDB database schema and API.
Note The usage of KairosDB will be deprecated in the future vCloud Director releases.
The following figure depicts the recommended VM metric database design. Multiple Cassandra nodes are deployed in the same network. A KairosDB database runs on each node, which also provides an API endpoint for vCloud cells to store and retrieve data. For high availability, load balance all KairosDB instances behind a single virtual IP address that is configured by the cell management tool as the VM metric endpoint.
Figure 8. VM Metric Database Design
 
The following are VM metric database design considerations:
Currently only KairosDB 0.9.1 and Cassandra 1.2.x/2.0.x are supported.
Minimum cluster size is three nodes (number of nodes must be equal or greater than the replication factor). Use the scale-out rather than scale-up approach because Cassandra performance scales linearly with the number of nodes.
Estimate I/O requirements based on expected number of VMs, and size the Cassandra cluster and its storage properly.
n – Expected number of VMs
m – Number of metrics per VM (currently 8)
t – Retention (days)
r – Replication factor
Write I/O per second = n × m × r / 10
Storage = n × m × t × r × 114 KB
For 30,000 VMs, the I/O estimate is 72,000 write IOPS and 3,288 GB of storage (worst-case scenario if data retention is 6 weeks and the replication factor is 3).
Enable Leveled Compaction Strategy (LCS) on the Cassandra cluster to improve read performance.
Install JNA (Java Native Access) version 3.2.7 or later on each node because it can improve Cassandra memory usage (no JVM swapping).
For heavy read utilization (many tenants collecting performance statistics) and availability, VMware recommends increasing the replication factor to 3.
Recommended size of one Cassandra node: 8 vCPUs (more CPU improves write performance), 16 GB RAM (more memory improves read performance), and 2 TB storage (each backed by separate LUNs/disks with high IOPS performance).
KairosDB does not enforce data retention policy. Therefore, old metric data must be regularly cleared with a script.
The following example deletes one month’s data:
#!/bin/sh
 
if [ "$#" -ne 4 ]; then
echo "$0 <kairosdbvip> port month year"
exit
fi
 
let DAYS=$(( ( $(date -ud 'now' +'%s') - $(date -ud "${4}-${3}-01 00:00:00" +'%s') )/60/60/24 ))
if [[ $DAYS -lt "42" ]]; then
echo "Date to delete is in not before 6 weeks"
exit
fi
METRICS=( `curl -s -k http://$1:$2/api/v1/metricnames -X GET|sed -e 's/[{}]/''/g' | awk -v k="results" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'|tr -d '[":]'|sed 's/results//g'|grep -w "cpu\|mem\|disk\|net\|sys"` )
echo $METRICS
 
for var in "${METRICS[@]}"
do
for date in `seq 1 30`;
do
STARTDAY=$(($(date -d $3/$date/$4 +%s%N)/1000000))
end=$((date + 1))
date -d $3/$end/$4 > /dev/null 2>&1
if [ $? -eq 0 ]; then
ENDDAY=$(($(date -d $3/$end/$4 +%s%N)/1000000))
echo "Deleting $var from " $3/$date/$4 " to " $3/$end/$4
echo '
{
"metrics": [
{
"tags": {},
"name": "'${var}'"
}
],
"cache_time": 0,
"start_absolute": "'${STARTDAY}'",
"end_absolute": "'${ENDDAY}'"
}' > /tmp/metricsquery
curl http://$1:$2/api/v1/datapoints/delete -X POST -d @/tmp/metricsquery
fi
done
done
rm -f /tmp/metricsquery > /dev/null 2>&1
Note The space gains are not seen until data compaction occurs and the delete marker column (tombstone) expires (by default 10 days). This can be changed by editing gc_grace_seconds in the cassandra.yaml configuration file.
KairosDB v0.9.1 uses the Quorum consistency level both for reads and writes. Quorum is calculated as rounded down (replication factor + 1) / 2 and, for both reads and writes, the quorum number of replica nodes must be available. Data is assigned to nodes through a hash algorithm and every replica has equal importance. The following table provides guidance on replication factor and cluster size configurations.
Table 4. Cassandra Configuration Guidance
Repl. Factor
Cluster Size
Node Amount
of Data
Quorum
Availability
1
1
100%
1
Does not tolerate any node loss
1
2
50%
1
Does not tolerate any node loss
1
3
33%
1
Does not tolerate any node loss
2
2
100%
2
Does not tolerate any node loss
2
3
67%
2
Does not tolerate any node loss
2
4
50%
2
Does not tolerate any node loss
3
3
100%
2
Tolerates loss of one node
3
4
75%
2
Tolerates loss of one node
3
5
60%
2
Tolerates loss of one node
4
4
100%
3
Tolerates loss of one node
4
5
80%
3
Tolerates loss of one node
5
5
100%
3
Tolerates loss of two nodes
5
6
83%
3
Tolerates loss of two nodes