Awesome
Prometheus
Toolkit
Source code on GitHub

Browse Library

Basic resource monitoring

28 rules

Prometheus job missing, Prometheus target missing, Prometheus all targets missing, Prometheus target missing with warmup time, Prometheus configuration reload failure, Prometheus too many restarts, Prometheus AlertManager job missing, Prometheus AlertManager configuration reload failure, Prometheus AlertManager config not synced, Prometheus AlertManager E2E dead man switch, Prometheus not connected to alertmanager, Prometheus rule evaluation failures, Prometheus template text expansion failures, Prometheus rule evaluation slow, Prometheus notifications backlog, Prometheus AlertManager notification failing, Prometheus target empty, Prometheus target scraping slow, Prometheus large scrape, Prometheus target scrape duplicate, Prometheus TSDB checkpoint creation failures, Prometheus TSDB checkpoint deletion failures, Prometheus TSDB compactions failed, Prometheus TSDB head truncations failed, Prometheus TSDB reload failures, Prometheus TSDB WAL corruptions, Prometheus TSDB WAL truncations failed, Prometheus timeseries cardinality

38 rules

Host out of memory, Host memory under memory pressure, Host Memory is underutilized, Host unusual network throughput in, Host unusual network throughput out, Host unusual disk read rate, Host unusual disk write rate, Host out of disk space, Host disk will fill in 24 hours, Host out of inodes, Host filesystem device error, Host inodes will fill in 24 hours, Host unusual disk read latency, Host unusual disk write latency, Host high CPU load, Host CPU is underutilized, Host CPU steal noisy neighbor, Host CPU high iowait, Host unusual disk IO, Host context switching high, Host swap is filling up, Host systemd service crashed, Host physical component too hot, Host node overtemperature alarm, Host RAID array got inactive, Host RAID disk failure, Host kernel version deviations, Host OOM kill detected, Host EDAC Correctable Errors detected, Host EDAC Uncorrectable Errors detected, Host Network Receive Errors, Host Network Transmit Errors, Host Network Interface Saturated, Host Network Bond Degraded, Host conntrack limit, Host clock skew, Host clock not synchronising, Host requires reboot

5 rules

Smart device temperature warning, Smart device temperature critical, Smart critical warning, Smart media errors, Smart NVME Wearout Indicator

8 rules

Container killed, Container absent, Container High CPU utilization, Container High Memory usage, Container Volume usage, Container high throttle rate, Container Low CPU utilization, Container Low Memory usage

9 rules

Blackbox probe failed, Blackbox configuration reload failure, Blackbox slow probe, Blackbox probe HTTP failure, Blackbox SSL certificate will expire soon, Blackbox SSL certificate will expire soon, Blackbox SSL certificate expired, Blackbox probe slow HTTP, Blackbox probe slow ping

5 rules

Windows Server collector Error, Windows Server service Status, Windows Server CPU Usage, Windows Server memory Usage, Windows Server disk Space Usage

4 rules

Virtual Machine Memory Warning, Virtual Machine Memory Critical, High Number of Snapshots, Outdated Snapshots

9 rules

Netdata high cpu usage, Host CPU steal noisy neighbor, Netdata high memory usage, Netdata low disk space, Netdata predicted disk full, Netdata MD mismatch cnt unsynchronized blocks, Netdata disk reallocated sectors, Netdata disk current pending sector, Netdata reported uncorrectable disk sectors

Databases and brokers

10 rules

MySQL down, MySQL too many connections (> 80%), MySQL high prepared statements utilization (> 80%), MySQL high threads running, MySQL Slave IO thread not running, MySQL Slave SQL thread not running, MySQL Slave replication lag, MySQL slow queries, MySQL InnoDB log waits, MySQL restarted

21 rules

Postgresql down, Postgresql restarted, Postgresql exporter error, Postgresql table not auto vacuumed, Postgresql table not auto analyzed, Postgresql too many connections, Postgresql not enough connections, Postgresql dead locks, Postgresql high rollback rate, Postgresql commit rate low, Postgresql low XID consumption, Postgresql high rate statement timeout, Postgresql high rate deadlock, Postgresql unused replication slot, Postgresql too many dead tuples, Postgresql configuration changed, Postgresql SSL compression active, Postgresql too many locks acquired, Postgresql bloat index high (> 80%), Postgresql bloat table high (> 80%), Postgresql invalid index

2 rules

SQL Server down, SQL Server deadlock

1 rules

Patroni has no Leader

3 rules

PGBouncer active connections, PGBouncer errors, PGBouncer max connections

12 rules

Redis down, Redis missing master, Redis too many masters, Redis disconnected slaves, Redis replication broken, Redis cluster flapping, Redis missing backup, Redis out of system memory, Redis out of configured maxmemory, Redis too many connections, Redis not enough connections, Redis rejected connections

18 rules

MongoDB Down, Mongodb replica member unhealthy, MongoDB replication lag, MongoDB replication headroom, MongoDB number cursors open, MongoDB cursors timeouts, MongoDB too many connectionsMongoDB replication lag, MongoDB replication Status 3, MongoDB replication Status 6, MongoDB replication Status 8, MongoDB replication Status 9, MongoDB replication Status 10, MongoDB number cursors open, MongoDB cursors timeouts, MongoDB too many connections, MongoDB virtual memory usageMgob backup failed

20 rules

RabbitMQ node down, RabbitMQ node not distributed, RabbitMQ instances different versions, RabbitMQ memory high, RabbitMQ file descriptors usage, RabbitMQ too many unack messages, RabbitMQ too many connections, RabbitMQ no queue consumer, RabbitMQ unroutable messagesRabbitMQ down, RabbitMQ cluster down, RabbitMQ cluster partition, RabbitMQ out of memory, RabbitMQ too many connections, RabbitMQ dead letter queue filling up, RabbitMQ too many messages in queue, RabbitMQ slow queue consuming, RabbitMQ no consumer, RabbitMQ too many consumers, RabbitMQ unactive exchange

19 rules

Elasticsearch Heap Usage Too High, Elasticsearch Heap Usage warning, Elasticsearch disk out of space, Elasticsearch disk space low, Elasticsearch Cluster Red, Elasticsearch Cluster Yellow, Elasticsearch Healthy Nodes, Elasticsearch Healthy Data Nodes, Elasticsearch relocating shards, Elasticsearch relocating shards too long, Elasticsearch initializing shards, Elasticsearch initializing shards too long, Elasticsearch unassigned shards, Elasticsearch pending tasks, Elasticsearch no new documents, Elasticsearch High Indexing Latency, Elasticsearch High Indexing Rate, Elasticsearch High Query Rate, Elasticsearch High Query Latency

2 rules

Meilisearch index is empty, Meilisearch http response time

30 rules

Cassandra Node is unavailable, Cassandra many compaction tasks are pending, Cassandra commitlog pending tasks, Cassandra compaction executor blocked tasks, Cassandra flush writer blocked tasks, Cassandra connection timeouts total, Cassandra storage exceptions, Cassandra tombstone dump, Cassandra client request unavailable write, Cassandra client request unavailable read, Cassandra client request write failure, Cassandra client request read failureCassandra hints count, Cassandra compaction task pending, Cassandra viewwrite latency, Cassandra bad hacker, Cassandra node down, Cassandra commitlog pending tasks, Cassandra compaction executor blocked tasks, Cassandra flush writer blocked tasks, Cassandra repair pending tasks, Cassandra repair blocked tasks, Cassandra connection timeouts total, Cassandra storage exceptions, Cassandra tombstone dump, Cassandra client request unavailable write, Cassandra client request unavailable read, Cassandra client request write failure, Cassandra client request read failure, Cassandra cache hit rate key cache

14 rules

ClickHouse Memory Usage Critical, ClickHouse Memory Usage Warning, ClickHouse Disk Space Low on Default, ClickHouse Disk Space Critical on Default, ClickHouse Disk Space Low on Backups, ClickHouse Replica Errors, ClickHouse No Available Replicas, ClickHouse No Live Replicas, ClickHouse High Network Traffic, ClickHouse High TCP Connections, ClickHouse Interserver Connection Issues, ClickHouse ZooKeeper Connection Issues, ClickHouse Authentication Failures, ClickHouse Access Denied Errors

4 rules

Zookeeper Down, Zookeeper missing leader, Zookeeper Too Many Leaders, Zookeeper Not Ok

4 rules

Kafka topics replicas, Kafka consumers groupKafka topic offset decreased, Kafka consumer lag

10 rules

Pulsar subscription high number of backlog entries, Pulsar subscription very high number of backlog entries, Pulsar topic large backlog storage size, Pulsar topic very large backlog storage size, Pulsar high write latency, Pulsar large message payload, Pulsar high ledger disk usage, Pulsar read only bookies, Pulsar high number of function errors, Pulsar high number of sink errors

20 rules

Nats high connection count, Nats high pending bytes, Nats high subscriptions count, Nats high routes count, Nats high memory usage, Nats slow consumers, Nats server down, Nats high CPU usage, Nats high number of connections, Nats high JetStream store usage, Nats high JetStream memory usage, Nats high number of subscriptions, Nats high pending bytes, Nats too many errors, Nats JetStream consumers exceeded, Nats frequent authentication timeouts, Nats max payload size exceeded, Nats leaf node connection issue, Nats max ping operations exceeded, Nats write deadline exceeded

4 rules

Solr update errors, Solr query errors, Solr replication errors, Solr low live node count

10 rules

Hadoop Name Node Down, Hadoop Resource Manager Down, Hadoop Data Node Out Of Service, Hadoop HDFS Disk Space Low, Hadoop Map Reduce Task Failures, Hadoop Resource Manager Memory High, Hadoop YARN Container Allocation Failures, Hadoop HBase Region Count High, Hadoop HBase Region Server Heap Low, Hadoop HBase Write Requests Latency High

Reverse proxies and load balancers

3 rules

Nginx high HTTP 4xx error rate, Nginx high HTTP 5xx error rate, Nginx latency high

3 rules

Apache down, Apache workers load, Apache restart

30 rules

HAProxy high HTTP 4xx error rate backend, HAProxy high HTTP 5xx error rate backend, HAProxy high HTTP 4xx error rate server, HAProxy high HTTP 5xx error rate server, HAProxy server response errors, HAProxy backend connection errors, HAProxy server connection errors, HAProxy backend max active session > 80%, HAProxy pending requests, HAProxy HTTP slowing down, HAProxy retry high, HAproxy has no alive backends, HAProxy frontend security blocked requests, HAProxy server healthcheck failureHAProxy down, HAProxy high HTTP 4xx error rate backend, HAProxy high HTTP 5xx error rate backend, HAProxy high HTTP 4xx error rate server, HAProxy high HTTP 5xx error rate server, HAProxy server response errors, HAProxy backend connection errors, HAProxy server connection errors, HAProxy backend max active session, HAProxy pending requests, HAProxy HTTP slowing down, HAProxy retry high, HAProxy backend down, HAProxy server down, HAProxy frontend security blocked requests, HAProxy server healthcheck failure

6 rules

Traefik service down, Traefik high HTTP 4xx error rate service, Traefik high HTTP 5xx error rate serviceTraefik backend down, Traefik high HTTP 4xx error rate backend, Traefik high HTTP 5xx error rate backend

Runtimes

1 rules

PHP-FPM max-children reached

1 rules

JVM memory filling up

2 rules

Sidekiq queue size, Sidekiq scheduling latency too high

Orchestrators

34 rules

Kubernetes Node not ready, Kubernetes Node memory pressure, Kubernetes Node disk pressure, Kubernetes Node network unavailable, Kubernetes Node out of pod capacity, Kubernetes Container oom killer, Kubernetes Job failed, Kubernetes CronJob suspended, Kubernetes PersistentVolumeClaim pending, Kubernetes Volume out of disk space, Kubernetes Volume full in four days, Kubernetes PersistentVolume error, Kubernetes StatefulSet down, Kubernetes HPA scale inability, Kubernetes HPA metrics unavailability, Kubernetes HPA scale maximum, Kubernetes HPA underutilized, Kubernetes Pod not healthy, Kubernetes pod crash looping, Kubernetes ReplicaSet replicas mismatch, Kubernetes Deployment replicas mismatch, Kubernetes StatefulSet replicas mismatch, Kubernetes Deployment generation mismatch, Kubernetes StatefulSet generation mismatch, Kubernetes StatefulSet update not rolled out, Kubernetes DaemonSet rollout stuck, Kubernetes DaemonSet misscheduled, Kubernetes CronJob too long, Kubernetes Job slow completion, Kubernetes API server errors, Kubernetes API client errors, Kubernetes client certificate expires next week, Kubernetes client certificate expires soon, Kubernetes API server latency

4 rules

Nomad job failed, Nomad job lost, Nomad job queued, Nomad blocked evaluation

3 rules

Consul service healthcheck failed, Consul missing master node, Consul agent unhealthy

13 rules

Etcd insufficient Members, Etcd no Leader, Etcd high number of leader changes, Etcd high number of failed GRPC requests, Etcd high number of failed GRPC requests, Etcd GRPC requests slow, Etcd high number of failed HTTP requests, Etcd high number of failed HTTP requests, Etcd HTTP requests slow, Etcd member communication slow, Etcd high number of failed proposals, Etcd high fsync durations, Etcd high commit durations

1 rules

Linkerd high error rate

10 rules

Istio Kubernetes gateway availability drop, Istio Pilot high total request rate, Istio Mixer Prometheus dispatches low, Istio high total request rate, Istio low total request rate, Istio high 4xx error rate, Istio high 5xx error rate, Istio high request latency, Istio latency 99 percentile, Istio Pilot Duplicate Entry

2 rules

ArgoCD service not synced, ArgoCD service unhealthy

Network, security and storage

13 rules

Ceph State, Ceph monitor clock skew, Ceph monitor low space, Ceph OSD Down, Ceph high OSD latency, Ceph OSD low space, Ceph OSD reweighted, Ceph PG down, Ceph PG incomplete, Ceph PG inconsistent, Ceph PG activation long, Ceph PG backfill full, Ceph PG unavailable

2 rules

SpeedTest Slow Internet Download, SpeedTest Slow Internet Upload

4 rules

ZFS offline poolZFS pool out of space, ZFS pool unhealthy, ZFS collector failed

1 rules

OpenEBS used pool capacity

3 rules

Minio cluster disk offline, Minio node disk offline, Minio disk space usage

4 rules

SSL certificate probe failed, SSL certificate OSCP status unknown, SSL certificate revoked, SSL certificate expiry (< 7 days)

3 rules

Juniper switch down, Juniper high Bandwidth Usage 1GiB, Juniper high Bandwidth Usage 1GiB

1 rules

CoreDNS Panic Count

3 rules

Freeswitch down, Freeswitch Sessions Warning, Freeswitch Sessions Critical

4 rules

Vault sealed, Vault too many pending tokens, Vault too many infinity tokens, Vault cluster health

2 rules

Cloudflare http 4xx error rate, Cloudflare http 5xx error rate

Other

45 rules

Thanos Compactor Multiple Running, Thanos Compactor Halted, Thanos Compactor High Compaction Failures, Thanos Compact Bucket High Operation Failures, Thanos Compact Has Not RunThanos Query Http Request Query Error Rate High, Thanos Query Http Request Query Range Error Rate High, Thanos Query Grpc Server Error Rate, Thanos Query Grpc Client Error Rate, Thanos Query High D N S Failures, Thanos Query Instant Latency High, Thanos Query Range Latency High, Thanos Query OverloadThanos Receive Http Request Error Rate High, Thanos Receive Http Request Latency High, Thanos Receive High Replication Failures, Thanos Receive High Forward Request Failures, Thanos Receive High Hashring File Refresh Failures, Thanos Receive Config Reload Failure, Thanos Receive No UploadThanos Sidecar Bucket Operations Failed, Thanos Sidecar No Connection To Started PrometheusThanos Store Grpc Error Rate, Thanos Store Series Gate Latency High, Thanos Store Bucket High Operation Failures, Thanos Store Objstore Operation Latency HighThanos Rule Queue Is Dropping Alerts, Thanos Rule Sender Is Failing Alerts, Thanos Rule High Rule Evaluation Failures, Thanos Rule High Rule Evaluation Warnings, Thanos Rule Rule Evaluation Latency High, Thanos Rule Grpc Error Rate, Thanos Rule Config Reload Failure, Thanos Rule Query High D N S Failures, Thanos Rule Alertmanager High D N S Failures, Thanos Rule No Evaluation For10 Intervals, Thanos No Rule EvaluationsThanos Bucket Replicate Error Rate, Thanos Bucket Replicate Run LatencyThanos Compact Is Down, Thanos Query Is Down, Thanos Receive Is Down, Thanos Rule Is Down, Thanos Sidecar Is Down, Thanos Store Is Down

4 rules

Loki process too many restarts, Loki request errors, Loki request panic, Loki request latency

2 rules

Promtail request errors, Promtail request latency

6 rules

Cortex ruler configuration reload failure, Cortex not connected to Alertmanager, Cortex notification are being dropped, Cortex notification error, Cortex ingester unhealthy, Cortex frontend queries stuck

7 rules

Jenkins offline, Jenkins healthcheck, Jenkins outdated plugins, Jenkins builds health score, Jenkins run failure total, Jenkins build tests failing, Jenkins last build failed

6 rules

APC UPS Battery nearly empty, APC UPS Less than 15 Minutes of battery time remaining, APC UPS AC input outage, APC UPS low battery voltage, APC UPS high temperature, APC UPS high load

6 rules

Provider failed because net_version failed, Provider failed because get genesis failed, Provider failed because net_version timeout, Provider failed because get genesis timeout, Store connection is too slow, Store connection is too slow