SaaS Infrastructure metrics' thresholds
Metric List
1.1 CPU thresholds
In order to measure CPU, we recommend analyzing CPU USER and CPU STEAL. Additionally, you can use the SAR tool for getting CPU information.
Note: processor_vcpus is the number of virtual processors each CPU
cpu_warning = 85
cpu_critical = 90
cpu_warning_stealtime = 10
cpu_critical_stealtime = 25
1.1.2 CPU thresholds for Redis instances
Redis is a mono-thread service (version 6.0.6) so it only uses one CPU. That means we have to resize the CPU umbral to check the CPU behaviour correctly.
cpu_warning = (80 / processor_vcpus)
cpu_critical = (90 / processor_vcpus)
1.2 Load-average
In order to read the system Load-average, we use icinga2 plugin and monitor the value of load average for each one, five, and fifteen minutes. Also, we resize threshold values to take the numbers of CPU.
load_wload1 = processor_vcpus * 1.5
load_cload1 = processor_vcpus * 2.5
load_wload5 = processor_vcpus * 1.25
load_cload5 = processor_vcpus * 2.25
load_wload15 = processor_vcpus
load_cload15 = processor_vcpus * 2
1.3 Free RAM
In order to read freeable memory we use file /proc/meminfo and apply this thresholds.
memory_warning_threshold = 10
memory_critical_threshold = 5
1.4 Free space in HD
For reading freeable memory we use the icinga2 plugin and apply this threshold. The zequenze system are installed into /opt/zequenze so is necessary monitoring this directory
memory_warning_threshold = 10
memory_warning_threshold = 5
2. Cache Instances
2.1 NGINXCPU thresholds for Redis instances
NGINXRedis statusis cana bemono-thread monitoredservice with(version 6.0.6) so it only uses one CPU. That means we have to resize the followingCPU Curlumbral requestto check the CPU behaviour correctly.
curlcpu_warning http:= (80 / processor_vcpus)
cpu_critical = (90 /127.0.0.1/nginx_status/ processor_vcpus)
Active connections: 30
server accepts handled requests
2590 2590 15578
Reading: 0 Writing: 1 Waiting: 29
3.1 Application Status
Monitoring
Application status can be monitored with the following http request in order to get information about ping, chache, database and locmem.
curl 127.0.0.1:8000/status/
{
"ping":{
"code":200,
"status":"Ok",
"time":0
},
"caches":{
"default":{
"code":200,
"status":"Ok",
"time":0.003637,
"time_set":0.001012,
"time_get":0.00225,
"time_del":0.000375
},
"locmem":{
"code":200,
"status":"Ok",
"time":0.000135,
"time_set":5.5e-05,
"time_get":7e-05,
"time_del":1e-05
}
},
"database":{
"default":{
"code":200,
"status":"Ok",
"time":0.003241
},
"replica1":{
"code":200,
"status":"Ok",
"time":0.003045
}
}
}
3.24. Nagios formatDatabase
Application status can be monitored with the following http request and formatter output with the nagios format plugin.
curl http://127.0.0.1:8000/status/?tests=threads&format=nagios
{
"threads":{
"code":200,
"status":"Ok",
"count":8
}
}
4. AWS - RDS Metrics
AWS Cloudwatch API can be used to gather information on RDS. Follows the metrics and default thresholds for each RDSdatabase instances
4.1 CPU Utilization
warning = "75"
critical = "85"
4.2 Freeable Memory
(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)
warning = "500000000"15%"
critical = "100000000"7%"
4.3 Database Connections
(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)
warning = "1800"
critical = "2200"
4.4 Queries
(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)
warning = "2900"
critical = "3500"