SaaS Infrastructure metrics' thresholds

Metric List

1 Hardware
- CPU
- Load Average
- RAM
- HardDisk
2 NGINX
- Number of connections
3 Application
- Status
- Nagios format
4 AWS RDS
- CPU Utilization
- Freeable Memory
- Database Connections
- Queries

1.1 CPU thresholds

In order to measure CPU, we recommend analyzing CPU USER and CPU STEAL. Additionally, you can use the SAR tool for getting CPU information.

Note: processor_vcpus is the number of virtual processors each CPU

  cpu_warning = 85
  cpu_critical = 90

  cpu_warning_stealtime = 10
  cpu_critical_stealtime = 25

1.1.2 CPU thresholds for Redis instances

Redis is a mono-thread service (version 6.0.6) so it only uses one CPU. That means we have to resize the CPU umbral to check the CPU behaviour correctly.

cpu_warning = (80 / processor_vcpus)
cpu_critical = (90 / processor_vcpus)

1.2 Load-average

In order to read the system Load-average, we use icinga2 plugin and monitor the value of load average for each one, five, and fifteen minutes. Also, we resize threshold values to take the numbers of CPU.

  load_wload1  = processor_vcpus * 1.5
  load_cload1  = processor_vcpus * 2.5

  load_wload5  = processor_vcpus * 1.25
  load_cload5  = processor_vcpus * 2.25


  load_wload15 = processor_vcpus
  load_cload15 = processor_vcpus * 2

1.3 Free RAM

In order to read freeable memory we use file /proc/meminfo and apply this thresholds.


memory_warning_threshold = 10
memory_critical_threshold = 5

1.4 Free space in HD

For reading freeable memory we use the icinga2 plugin and apply this threshold. The zequenze system are installed into /opt/zequenze so is necessary monitoring this directory


memory_warning_threshold = 10
memory_warning_threshold = 5

2.1 NGINX

NGINX status can be monitored with the following Curl request

curl  http://127.0.0.1/nginx_status/

Active connections: 30 
server accepts handled requests
 2590 2590 15578 
Reading: 0 Writing: 1 Waiting: 29

3.1 Application Status

Application status can be monitored with the following http request in order to get information about ping, chache, database and locmem.


curl 127.0.0.1:8000/status/

 {
   "ping":{
      "code":200,
      "status":"Ok",
      "time":0
   },
   "caches":{
      "default":{
         "code":200,
         "status":"Ok",
         "time":0.003637,
         "time_set":0.001012,
         "time_get":0.00225,
         "time_del":0.000375
      },
      "locmem":{
         "code":200,
         "status":"Ok",
         "time":0.000135,
         "time_set":5.5e-05,
         "time_get":7e-05,
         "time_del":1e-05
      }
   },
   "database":{
      "default":{
         "code":200,
         "status":"Ok",
         "time":0.003241
      },
      "replica1":{
         "code":200,
         "status":"Ok",
         "time":0.003045
      }
   }
}

3.2 Nagios format

Application status can be monitored with the following http request and formatter output with the nagios format plugin.


curl http://127.0.0.1:8000/status/?tests=threads&format=nagios

{
   "threads":{
      "code":200,
      "status":"Ok",
      "count":8
   }
}

4. AWS - RDS Metrics

AWS Cloudwatch API can be used to gather information on RDS. Follows the metrics and default thresholds for each RDS instances

4.1 CPU Utilization

warning = "75"
critical = "85"

4.2 Freeable Memory

(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)

warning = "500000000"
critical = "100000000"

4.3 Database Connections

(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)

warning = "1800"
critical = "2200"

4.4 Queries

(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)

warning = "2900"
critical = "3500"