Skip to main content

SaaS Infrastructure metrics' thresholds

Metric List

    1 Hardware
      CPU Load Average RAM HardDisk 2 NGINX
        Number of connections 31. Application
          Status Nagios format 4 AWS RDS
            CPU Utilization Freeable Memory Database Connections Queries Instances

            1.1 CPU thresholds

            In order to measure CPU, we recommend analyzing CPU USER and CPU STEAL. Additionally, you can use the SAR tool for getting CPU information.

            Note: processor_vcpus is the number of virtual processors each CPU

              cpu_warning = 85
              cpu_critical = 90
            
              cpu_warning_stealtime = 10
              cpu_critical_stealtime = 25
            
            

            1.1.2 CPU thresholds for Redis instances

            Redis is a mono-thread service (version 6.0.6) so it only uses one CPU. That means we have to resize the CPU umbral to check the CPU behaviour correctly.

            cpu_warning = (80 / processor_vcpus)
            cpu_critical = (90 / processor_vcpus)
            
            

            1.2 Load-average

            In order to read the system Load-average, we use icinga2 plugin and monitor the value of load average for each one, five, and fifteen minutes. Also, we resize threshold values to take the numbers of CPU.

              load_wload1  = processor_vcpus * 1.5
              load_cload1  = processor_vcpus * 2.5
            
              load_wload5  = processor_vcpus * 1.25
              load_cload5  = processor_vcpus * 2.25
            
            
              load_wload15 = processor_vcpus
              load_cload15 = processor_vcpus * 2
            

            1.3 Free RAM

            In order to read freeable memory we use file /proc/meminfo and apply this thresholds.

            
            memory_warning_threshold = 10
            memory_critical_threshold = 5
            
            

            1.4 Free space in HD

            For reading freeable memory we use the icinga2 plugin and apply this threshold. The zequenze system are installed into /opt/zequenze so is necessary monitoring this directory

            
            memory_warning_threshold = 10
            memory_warning_threshold = 5
            
            

            2. Cache Instances

            2.1 NGINXCPU thresholds for Redis instances

            NGINXRedis statusis cana bemono-thread monitoredservice with(version 6.0.6) so it only uses one CPU. That means we have to resize the followingCPU Curlumbral requestto check the CPU behaviour correctly.

            curlcpu_warning http:= (80 / processor_vcpus)
            cpu_critical = (90 /127.0.0.1/nginx_status/ processor_vcpus)
            
            Active connections: 30 
            server accepts handled requests
             2590 2590 15578 
            Reading: 0 Writing: 1 Waiting: 29 
            

            3.1 Application Status

             Monitoring

            Application status can be monitored with the following http request in order to get information about ping, chache, database and locmem.

            
            curl 127.0.0.1:8000/status/
            
             {
               "ping":{
                  "code":200,
                  "status":"Ok",
                  "time":0
               },
               "caches":{
                  "default":{
                     "code":200,
                     "status":"Ok",
                     "time":0.003637,
                     "time_set":0.001012,
                     "time_get":0.00225,
                     "time_del":0.000375
                  },
                  "locmem":{
                     "code":200,
                     "status":"Ok",
                     "time":0.000135,
                     "time_set":5.5e-05,
                     "time_get":7e-05,
                     "time_del":1e-05
                  }
               },
               "database":{
                  "default":{
                     "code":200,
                     "status":"Ok",
                     "time":0.003241
                  },
                  "replica1":{
                     "code":200,
                     "status":"Ok",
                     "time":0.003045
                  }
               }
            }
            

            3.24. Nagios formatDatabase

            Application status can be monitored with the following http request and formatter output with the nagios format plugin.

            
            curl http://127.0.0.1:8000/status/?tests=threads&format=nagios
            
            {
               "threads":{
                  "code":200,
                  "status":"Ok",
                  "count":8
               }
            }
            

            4. AWS - RDS Metrics

            AWS Cloudwatch API can be used to gather information on RDS. Follows the metrics and default thresholds for each RDSdatabase instances

            4.1 CPU Utilization

            warning = "75"
            critical = "85"
            

            4.2 Freeable Memory

            (NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)

            warning = "500000000"15%"
            critical = "100000000"7%"
            

            4.3 Database Connections

            (NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)

            warning = "1800"
            critical = "2200"
            

            4.4 Queries

            (NEED TO REVIEW, IT SHOULD BE A % OF THE TOTAL NOT AN ABSOLUTE VALUE)

            warning = "2900"
            critical = "3500"