AIStore Observability: Prometheus

AIStore (AIS) exposes metrics in Prometheus format via HTTP endpoints.

This integration enables comprehensive monitoring of AIS clusters, performance tracking, capacity planning, and long-term trend analysis.

Overview
Monitoring Stack
Prometheus Integration
Node Alerts
Best Practices
References
Related Documentation

Overview

AIS tracks a comprehensive set of performance metrics including:

Operation counters (GET/PUT/DELETE/etc.)
Resource utilization (CPU, memory, disk)
Latencies and throughput
Network and peer-to-peer streaming statistics
Extended actions (xactions)
Error counters and node state

AIS supports observability through several complementary tools:

Node logs (fine-grained operational events)
CLI for interactive monitoring (e.g., ais show cluster stats)
Monitoring backends:
- Prometheus (recommended)
- Grafana for dashboards & alerting

For load testing and benchmarking metrics, see AIS Load Generator and How To Benchmark AIStore.

A complete catalog of AIS metrics is available at: Monitoring Metrics Reference

Monitoring Stack

Typical Prometheus deployment:

┌────────────────┐       ┌────────────────┐
│                │ scrape│                │
│   Prometheus   │◄──────┤  AIStore Node  │
│                │       │   /metrics     │
└────────────────┘       └────────────────┘
       ││
       ││ query
       ▼
┌────────────────┐
│                │
│     Grafana    │
│                │
└────────────────┘

This stack provides:

Direct metric collection from AIS nodes
Centralized metric retention
Grafana visualization & alerting
Long-term performance & cost analysis

Prometheus Integration

Native Exporter

AIS acts as a first-class Prometheus exporter. Every node automatically:

Registers all metrics at startup
Exposes /metrics for Prometheus to scrape
Uses Prometheus native formatting and metadata
Works with both HTTP and HTTPS clusters

No configuration is required to “enable” Prometheus — it is always on.

AIS source metrics (put.size, get.ns, etc.) are exported with AIS naming conventions:

ais_target_<metric_name>{node_id="T1"} <value>

This document primarily uses the exported Prometheus names.

Viewing Raw Metrics

View metrics directly:

$ curl http://<node>:<port>/metrics
# or
$ curl https://<node>:<port>/metrics

Example:

# HELP ais_target_put_bytes total bytes served via PUT
# TYPE ais_target_put_bytes counter
ais_target_put_bytes{node_id="ClCt8081"} 1.721761792e+10

# HELP ais_target_put_ns_total total PUT latency (nanoseconds)
# TYPE ais_target_put_ns_total counter
ais_target_put_ns_total{node_id="ClCt8081"} 9.44367232e+09

# HELP ais_target_state_flags node state and alert flags
# TYPE ais_target_state_flags gauge
ais_target_state_flags{node_id="ClCt8081"} 6

To watch GET rates without Prometheus:

for i in {1..99999}; do
  curl -s http://hostname:8081/metrics | grep "ais_target_get_count"
  sleep 1
done

Key Metric Groups

AIS organizes metrics into four major groups, reflected in the codebase and Prometheus exporter:

Group	Description	Examples
1. Datapath	GET/PUT counters, sizes, latencies, rate-limiting, I/O errors	`ais_target_get_count`, `ais_target_put_bps`, `ais_target_ratelim_retry_get_n`
2. Metadata (in-memory)	Lcache activity (evictions, collisions)	`ais_target_lcache_evicted_count`
3. Extended Actions (xactions)	Background & multi-object jobs: LRU, EC, rebalance, ETL, Download, DSort, GetBatch	`ais_target_lru_evict_n`, `ais_target_getbatch_n`
4. Streams	Long-lived peer-to-peer (SharedDM) streaming channels	`ais_target_streams_out_obj_n`

For GetBatch observability, see Monitoring GetBatch.

Metric Labels

AIS exposes labels for filtering and aggregation:

Label	Usage
`node_id`	Node identity (target or gateway)
`disk`	Disk name for per-disk metrics
`bucket`	Source/destination bucket
`xaction`	Xaction UUID for multi-object jobs
`slice`	For erasure coding slice metrics
`archpath`	For per-file shard extraction (GetBatch)

Labels enable PromQL queries such as:

sum by (node_id)(rate(ais_target_put_bytes[5m]))
sum by (disk)(ais_target_disk_util)

Essential PromQL Queries

GET operations per second

sum(rate(ais_target_get_count[5m]))

Average GET latency (ms)

sum(rate(ais_target_get_ns_total[5m]))
/ sum(rate(ais_target_get_count[5m]))
/ 1e6   # convert ns → ms

Disk utilization

ais_target_disk_util{disk="nvme0n1"}

GET error percentage

sum(rate(ais_target_err_get_count[5m]))
/ sum(rate(ais_target_get_count[5m])) * 100

Total cluster capacity usage

sum(ais_target_capacity_used)
/
sum(ais_target_capacity_total)
* 100

GetBatch (x-moss) Queries

GetBatch is AIStore’s high-performance multi-object retrieval pipeline. Metrics describe throughput, composition (objects vs files), Rx stalls, throttling, and error behavior.

Work items per second

sum(rate(ais_target_getbatch_n[5m]))

Logical payload throughput

sum(rate(ais_target_getbatch_obj_size[5m]))
+
sum(rate(ais_target_getbatch_file_size[5m]))

Stall breakdown (RxWait vs Throttle)

sum(rate(ais_target_getbatch_rxwait_ns[5m]))
/
(
  sum(rate(ais_target_getbatch_rxwait_ns[5m])) +
  sum(rate(ais_target_getbatch_throttle_ns[5m]))
)

Soft vs Hard Error Rates

rate(ais_target_err_soft_getbatch_n[5m])
rate(ais_target_err_getbatch_n[5m])

Full details and operational guidance: → Monitoring GetBatch

Node Alerts

AIS nodes expose operational alerts and states via ais_target_state_flags. Flags indicate:

Red (critical)

OOS — Out of space
OOM — Out of memory
DiskFault — Disk failures
NumGoroutines — excessive goroutines
CertificateExpired — TLS expiry
KeepAliveErrors — peer connectivity issues

Warning

Rebalancing, Resilvering
RebalanceInterrupted, ResilverInterrupted
LowCapacity, LowMemory, LowCPU
NodeRestarted
MaintenanceMode
CertWillSoonExpire

Informational

ClusterStarted
NodeStarted
VoteInProgress

CLI Monitoring

$ ais show cluster

Prometheus Queries

Critical:

ais_target_state_flags & 8192  > 0  # OOS
or ais_target_state_flags & 16384 > 0  # OOM
or ais_target_state_flags & 65536 > 0  # DiskFault

Warnings:

ais_target_state_flags & 4096 > 0  # LowCapacity
or ais_target_state_flags & 8192 > 0  # LowMemory

Grafana Alert Example

ais_target_state_flags{node_id=~"$node"} & 8192 > 0

Best Practices

Prometheus retention Plan retention around performance analysis needs (14–30 days recommended).
Dashboard segmentation Maintain dashboards for:
- Cluster overview
- Node-level performance
- Resource utilization
- Error monitoring
- Extended actions (GetBatch, rebalance, ETL)
Alerts on critical states Monitor:
- Node state flags
- Error spikes
- Disk utilization
- High throttle or rxwait stalls (GetBatch)
Scrape frequency 5–15 seconds for critical workloads; 30s+ for low-traffic clusters.

Document	Description
Overview	AIS observability introduction
CLI	CLI monitoring and commands
Logs	Log-based observability
Metrics Reference	Full AIS metric catalog
Grafana	Grafana dashboards
Kubernetes	K8s deployment monitoring
GetBatch Monitoring	Multi-object retrieval metrics and analysis

Separately, Prometheus references:

MONITORING-PROMETHEUS

AIStore Observability: Prometheus

Table of Contents

Overview

Monitoring Stack

Prometheus Integration

Native Exporter

Viewing Raw Metrics

Key Metric Groups

Metric Labels

Essential PromQL Queries

GET operations per second

Average GET latency (ms)

Disk utilization

GET error percentage

Total cluster capacity usage

GetBatch (x-moss) Queries

Work items per second

Logical payload throughput

Stall breakdown (RxWait vs Throttle)

Soft vs Hard Error Rates

Node Alerts

Red (critical)

Warning

Informational

CLI Monitoring

Prometheus Queries

Grafana Alert Example

Best Practices

AIStore Observability: Prometheus

Table of Contents

Overview

Monitoring Stack

Prometheus Integration

Native Exporter

Viewing Raw Metrics

Key Metric Groups

Metric Labels

Essential PromQL Queries

GET operations per second

Average GET latency (ms)

Disk utilization

GET error percentage

Total cluster capacity usage

GetBatch (x-moss) Queries

Work items per second

Logical payload throughput

Stall breakdown (RxWait vs Throttle)

Soft vs Hard Error Rates

Node Alerts

Red (critical)

Warning

Informational

CLI Monitoring

Prometheus Queries

Grafana Alert Example

Best Practices

Related Documentation