SaaS Infrastructure metrics' thresholds
Metric List
- 1 Hardware
- CPU
- Load Average
- RAM
- HardDisk
- 2 NGINX
- Number of connections
- 3 Application
- Status
- Nagios format
- 4 AWS RDS
- CPU Utilization
- Freeable Memory
- Database Connections
- Queries
1.1 CPU thresholds
In order to measure CPU, we recommend analyzing CPU USER and CPU STEAL. Additionally, you can use the SAR tool for getting CPU information.
Note: processor_vcpus is the number of virtual processors each CPU
cpu_warning = 85
cpu_critical = 90
cpu_warning_stealtime = 10
cpu_critical_stealtime = 25
1.1.2 CPU thresholds for Redis instances
Redis is a mono-thread service (version 6.0.6) so it only uses one CPU. That means we have to resize the CPU umbral to check the CPU behaviour correctly.
cpu_warning = (80 / processor_vcpus)
cpu_critical = (90 / processor_vcpus)
1.2 Load-average
In order to read the system Load-average, we use icinga2 plugin and monitor the value of load average for each one, five, and fifteen minutes. Also, we resize threshold values to take the numbers of CPU.
load_wload1 = processor_vcpus * 1.5
load_cload1 = processor_vcpus * 2.5
load_wload5 = processor_vcpus * 1.25
load_cload5 = processor_vcpus * 2.25
load_wload15 = processor_vcpus
load_cload15 = processor_vcpus * 2
1.3 Free RAM
In order to read freeable memory we use file /proc/meminfo and apply this thresholds.
memory_warning_threshold = 10
memory_critical_threshold = 5
1.4 Free space in HD
For reading freeable memory we use the icinga2 plugin and apply this threshold. The zequenze system are installed into /opt/zequenze so is necessary monitoring this directory
memory_warning_threshold = 10
memory_warning_threshold = 5
2.1 NGINX
NGINX status can be monitored with the following Curl request
curl http://127.0.0.1/nginx_status/
Active connections: 30
server accepts handled requests
2590 2590 15578
Reading: 0 Writing: 1 Waiting: 29
3.1 Application Status
Application status can be monitored with the following http request in order to get information about ping, chache, database and locmem.
curl 127.0.0.1:8000/status/
{
"ping":{
"code":200,
"status":"Ok",
"time":0
},
"caches":{
"default":{
"code":200,
"status":"Ok",
"time":0.003637,
"time_set":0.001012,
"time_get":0.00225,
"time_del":0.000375
},
"locmem":{
"code":200,
"status":"Ok",
"time":0.000135,
"time_set":5.5e-05,
"time_get":7e-05,
"time_del":1e-05
}
},
"database":{
"default":{
"code":200,
"status":"Ok",
"time":0.003241
},
"replica1":{
"code":200,
"status":"Ok",
"time":0.003045
}
}
}
3.2 Nagios format
Application status can be monitored with the following http request and formatter output with the nagios format plugin.
curl http://127.0.0.1:8000/status/?tests=threads&format=nagios
{
"threads":{
"code":200,
"status":"Ok",
"count":8
}
}
4. AWS - RDS Metrics
AWS Cloudwatch API can be used to gather information on RDS. Follows the metrics and default thresholds for each RDS instances
4.1 CPU Utilization
warning = "75"
critical = "85"
4.2 Freeable Memory
(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)
warning = "500000000"
critical = "100000000"
4.3 Database Connections
(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)
warning = "1800"
critical = "2200"
4.4 Queries
(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)
warning = "2900"
critical = "3500"