5.3 VM Metric Database

Architecting a vCloud Director Solution : Cloud Management Components : 5.3 VM Metric Database

Beginning with vCloud Director 5.6, virtual machine performance and resource consumption metrics are collected and historical data is provided for up to two weeks.

Table 3. Virtual Machine Performance and Resource Consumption Metrics

Metric Name	Type	Unit	Description
cpu.usage.average	Rate	Percent	Host view of average actively used CPU as a percentage of total available
cpu.usagemhz.average	Rate	Megahertz	Host view of actively used CPU as a raw measurement
cpu.usage.maximum	Rate	Percent	Host view of maximum actively used CPU as a percentage of total available
mem.usage.average	Absolute	Percent	Usage as percentage of total configured or available memory
disk.provisioned.latest	Absolute	Kilobytes	Storage space potentially used
disk.used.latest	Absolute	Kilobytes	Storage space actually used
disk.read.average	Rate	Kilobytes per second	Read rate aggregated across all datastores
disk.write.average	Rate	Kilobytes per second	Write rate aggregated across all datastores

Retrieval of both current and historical metrics is available through vCloud API. The current metrics are directly retrieved from the vCenter Server database with the Performance Manager API. The historical metrics are collected every 5 minutes (with 20 seconds granularity) by the StatsFeeder process running on the cell with vCenter Server Proxy and pushed to persistent storage—Cassandra NoSQL database cluster with KairosDB database schema and API.

Note The usage of KairosDB will be deprecated in the future vCloud Director releases.

The following figure depicts the recommended VM metric database design. Multiple Cassandra nodes are deployed in the same network. A KairosDB database runs on each node, which also provides an API endpoint for vCloud cells to store and retrieve data. For high availability, load balance all KairosDB instances behind a single virtual IP address that is configured by the cell management tool as the VM metric endpoint.

Figure 8. VM Metric Database Design

The following are VM metric database design considerations:

• Currently only KairosDB 0.9.1 and Cassandra 1.2.x/2.0.x are supported.

• Minimum cluster size is three nodes (number of nodes must be equal or greater than the replication factor). Use the scale-out rather than scale-up approach because Cassandra performance scales linearly with the number of nodes.

• Estimate I/O requirements based on expected number of VMs, and size the Cassandra cluster and its storage properly.

n – Expected number of VMs

m – Number of metrics per VM (currently 8)

t – Retention (days)

r – Replication factor

Write I/O per second = n × m × r / 10

Storage = n × m × t × r × 114 KB

For 30,000 VMs, the I/O estimate is 72,000 write IOPS and 3,288 GB of storage (worst-case scenario if data retention is 6 weeks and the replication factor is 3).

• Enable Leveled Compaction Strategy (LCS) on the Cassandra cluster to improve read performance.

• Install JNA (Java Native Access) version 3.2.7 or later on each node because it can improve Cassandra memory usage (no JVM swapping).

• For heavy read utilization (many tenants collecting performance statistics) and availability, VMware recommends increasing the replication factor to 3.

• Recommended size of one Cassandra node: 8 vCPUs (more CPU improves write performance), 16 GB RAM (more memory improves read performance), and 2 TB storage (each backed by separate LUNs/disks with high IOPS performance).

• KairosDB does not enforce data retention policy. Therefore, old metric data must be regularly cleared with a script.

The following example deletes one month’s data:

#!/bin/sh

if [ "$#" -ne 4 ]; then

echo "$0 <kairosdbvip> port month year"

exit

let DAYS=$(( ( $(date -ud 'now' +'%s') - $(date -ud "${4}-${3}-01 00:00:00" +'%s') )/60/60/24 ))

if [[ $DAYS -lt "42" ]]; then

echo "Date to delete is in not before 6 weeks"

exit

METRICS=( `curl -s -k http://$1:$2/api/v1/metricnames -X GET|sed -e 's/[{}]/''/g' | awk -v k="results" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'|tr -d '[":]'|sed 's/results//g'|grep -w "cpu\|mem\|disk\|net\|sys"` )

echo $METRICS

for var in "${METRICS[@]}"

for date in `seq 1 30`;

STARTDAY=$(($(date -d $3/$date/$4 +%s%N)/1000000))

end=$((date + 1))

date -d $3/$end/$4 > /dev/null 2>&1

if [ $? -eq 0 ]; then

ENDDAY=$(($(date -d $3/$end/$4 +%s%N)/1000000))

echo "Deleting $var from " $3/$date/$4 " to " $3/$end/$4

echo '

{

"metrics": [

{

"tags": {},

"name": "'${var}'"

}

"cache_time": 0,

"start_absolute": "'${STARTDAY}'",

"end_absolute": "'${ENDDAY}'"

}' > /tmp/metricsquery

curl http://$1:$2/api/v1/datapoints/delete -X POST -d @/tmp/metricsquery

done

rm -f /tmp/metricsquery > /dev/null 2>&1

Note The space gains are not seen until data compaction occurs and the delete marker column (tombstone) expires (by default 10 days). This can be changed by editing gc_grace_seconds in the cassandra.yaml configuration file.

• KairosDB v0.9.1 uses the Quorum consistency level both for reads and writes. Quorum is calculated as rounded down (replication factor + 1) / 2 and, for both reads and writes, the quorum number of replica nodes must be available. Data is assigned to nodes through a hash algorithm and every replica has equal importance. The following table provides guidance on replication factor and cluster size configurations.

Table 4. Cassandra Configuration Guidance

Repl. Factor	Cluster Size	Node Amount of Data	Quorum	Availability
1	1	100%	1	Does not tolerate any node loss
1	2	50%	1	Does not tolerate any node loss
1	3	33%	1	Does not tolerate any node loss
2	2	100%	2	Does not tolerate any node loss
2	3	67%	2	Does not tolerate any node loss
2	4	50%	2	Does not tolerate any node loss
3	3	100%	2	Tolerates loss of one node
3	4	75%	2	Tolerates loss of one node
3	5	60%	2	Tolerates loss of one node
4	4	100%	3	Tolerates loss of one node
4	5	80%	3	Tolerates loss of one node
5	5	100%	3	Tolerates loss of two nodes
5	6	83%	3	Tolerates loss of two nodes