SaaS Infrastructure metrics' thresholds

Metric List

~~1 Hardware~~

~~CPU~~ ~~Load Average~~ ~~RAM~~ ~~HardDisk~~ ~~2 NGINX~~

~~Number of connections~~ 31. Application

~~Status~~ ~~Nagios format~~ ~~4 AWS RDS~~

~~CPU Utilization~~ ~~Freeable Memory~~ ~~Database Connections~~ ~~Queries~~ Instances

1.1 CPU thresholds

In order to measure CPU, we recommend analyzing CPU USER and CPU STEAL. Additionally, you can use the SAR tool for getting CPU information.

Note: processor_vcpus is the number of virtual processors each CPU

  cpu_warning = 85
  cpu_critical = 90

  cpu_warning_stealtime = 10
  cpu_critical_stealtime = 25

1.1.2 CPU thresholds for Redis instances

~~Redis is a mono-thread service (version 6.0.6) so it only uses one CPU. That means we have to resize the CPU umbral to check the CPU behaviour correctly.~~

cpu_warning = (80 / processor_vcpus)
cpu_critical = (90 / processor_vcpus)

1.2 Load-average

In order to read the system Load-average, we use icinga2 plugin and monitor the value of load average for each one, five, and fifteen minutes. Also, we resize threshold values to take the numbers of CPU.

  load_wload1  = processor_vcpus * 1.5
  load_cload1  = processor_vcpus * 2.5

  load_wload5  = processor_vcpus * 1.25
  load_cload5  = processor_vcpus * 2.25


  load_wload15 = processor_vcpus
  load_cload15 = processor_vcpus * 2

1.3 Free RAM

In order to read freeable memory we use file /proc/meminfo and apply this thresholds.


memory_warning_threshold = 10
memory_critical_threshold = 5

1.4 Free space in HD

For reading freeable memory we use the icinga2 plugin and apply this threshold. The zequenze system are installed into /opt/zequenze so is necessary monitoring this directory


memory_warning_threshold = 10
memory_warning_threshold = 5

2. Cache Instances

2.1 NGINXCPU thresholds for Redis instances

~~NGINX~~Redis ~~status~~is ~~can~~a bemono-thread ~~monitored~~service ~~with~~(version 6.0.6) so it only uses one CPU. That means we have to resize the ~~following~~CPU ~~Curl~~umbral ~~request~~to check the CPU behaviour correctly.

curlcpu_warning http:= (80 / processor_vcpus)
cpu_critical = (90 /127.0.0.1/nginx_status/ processor_vcpus)

Active connections: 30 
server accepts handled requests
 2590 2590 15578 
Reading: 0 Writing: 1 Waiting: 29

3.1 Application Status

Monitoring

Application status can be monitored with the following http request in order to get information about ping, chache, database and locmem.


curl 127.0.0.1:8000/status/

 {
   "ping":{
      "code":200,
      "status":"Ok",
      "time":0
   },
   "caches":{
      "default":{
         "code":200,
         "status":"Ok",
         "time":0.003637,
         "time_set":0.001012,
         "time_get":0.00225,
         "time_del":0.000375
      },
      "locmem":{
         "code":200,
         "status":"Ok",
         "time":0.000135,
         "time_set":5.5e-05,
         "time_get":7e-05,
         "time_del":1e-05
      }
   },
   "database":{
      "default":{
         "code":200,
         "status":"Ok",
         "time":0.003241
      },
      "replica1":{
         "code":200,
         "status":"Ok",
         "time":0.003045
      }
   }
}

3.24. Nagios formatDatabase

~~Application status can be monitored with the following http request and formatter output with the nagios format plugin.~~


curl http://127.0.0.1:8000/status/?tests=threads&format=nagios

{
   "threads":{
      "code":200,
      "status":"Ok",
      "count":8
   }
}

4. AWS - RDS Metrics

~~AWS Cloudwatch API can be used to gather information on RDS.~~ Follows the metrics and default thresholds for each ~~RDS~~database instances

4.1 CPU Utilization

warning = "75"
critical = "85"

4.2 Freeable Memory

~~(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)~~

warning = "500000000"15%"
critical = "100000000"7%"

4.3 Database Connections

~~(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)~~

warning = "1800"
critical = "2200"

4.4 Queries

~~(NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)~~

warning = "2900"
critical = "3500"