MONITORING-GET-BATCH
Overview
GetBatch (a.k.a get-batch or x-moss) is the high-performance multi-object retrieval subsystem.
It streams objects and archived files in strict user-specified order, assembling them on the designated target (DT) and serving them as a TAR archive (buffered or streaming).
TAR is the default output format; compressed serialized options include TAR.GZ, ZIP, and TAR.LZ4 and are also fully supported.
See also:
- GetBatch: Multi-Object Retrieval API for overview, capabilities and APIs, operational guidance, usage, Go and Python examples, and more.
- Release Notes 4.0 - GetBatch introduction.
Unlike ordinary GET requests, get-batch:
- pulls objects concurrently from many targets,
- relies on intra-cluster streaming via SharedDM,
- performs archive extraction (for shards),
- and obeys load-based throttling and soft-error recovery.
This page documents how to observe and monitor get-batch jobs at scale.
Key Metrics
All metrics below are per-target Prometheus counters/totals.
Use rate() or increase() over a window for meaningful rates.
Workload Volume & Mix
| Metric | Description |
|---|---|
getbatch.n |
Total number of get-batch work items processed (successful or failed). |
getbatch.obj.n |
Number of whole objects retrieved and delivered in the output TAR. |
getbatch.file.n |
Number of files extracted from shard archives. |
getbatch.obj.size |
Cumulative size (bytes) of whole objects retrieved. |
getbatch.file.size |
Cumulative size (bytes) of shard-extracted files. |
These represent actual payload delivered, not including error placeholders.
Latency, Throttling & Backpressure
| Metric | Description |
|---|---|
getbatch.rxwait.ns |
Total nanoseconds the DT spent waiting to receive missing/out-of-order entries from peer targets. Reflects SDM/peer/network-induced stalls. |
getbatch.throttle.ns |
Total nanoseconds slept due to load-based throttling (memory/cpu pressure). Intentional back-pressure applied before serving next get-batch request. |
Interpretation:
- High rxwait → clustering, peer-to-peer streaming, SDM performance, or transient disconnects.
- High throttle → DT-level resource pressure (memory load, CPU load, configured Advice).
These two combined provide a full picture of “Why is my job slower?”
Soft vs Hard Errors
| Metric | Description |
|---|---|
err_soft.getbatch.n |
Number of soft error events (GFN recoveries and missing-path insertions). A single WI may contribute multiple increments. |
err_getbatch.n |
Number of hard errors (unrecoverable WIs, failures including 429 rejections). |
Soft errors reflect recoverable situations; Hard errors reflect request-level failure or 429 (“too-many-requests”) rejection.
PromQL examples
For PromQL examples, please refer to Observability: Prometheus document.
In this section, we only highlight a few less obvious queries that may help analyze GetBatch latency - in particular, distinguish peer-induced stalls from resource-induced throttling.
Rx Stall Rate (peer-induced waits)
rate(ais_target_getbatch_rxwait_ns[5m])
Throttle Stall Rate (load-induced waits)
rate(ais_target_getbatch_throttle_ns[5m])
Error Behavior
Soft Error Rate
rate(ais_target_err_soft_getbatch_n[5m])
Hard Error Rate
rate(ais_target_err_getbatch_n[5m])