Metrics

BanyanDB exposes metrics for monitoring and analysis.

Scrape source — the FODC proxy. In a cluster deployment, metrics are collected by scraping the FODC proxy /metrics endpoint (see FODC overview), which is the single Prometheus scrape target. The proxy aggregates every BanyanDB node’s metrics and adds per-node identity labels, so all the PromQL below is written for that scheme:

  • $job — the Prometheus scrape job for the FODC proxy.
  • $role — the node role, matched via the container_name label (liaison or data).
  • $pod — a BanyanDB node, matched via the pod_name label (the full node identity, e.g. banyandb-data-hot-0).

Because the proxy is the only target, the Prometheus-synthesized instance, job, and up labels describe the proxy, not individual BanyanDB nodes — use $role / $pod to scope a query to a node. Original BanyanDB labels (group, kind, method, service, path, operation, remote_node, …) are preserved on every sample. $__rate_interval is the Grafana rate-interval variable. Every expression below carries the {job=~"$job", container_name=~"$role", pod_name=~"$pod"} selector; a few liaison-only metrics (the write queue and its sync loop) are pinned with the literal container_name="liaison" instead.

(If you are not running the FODC proxy, BanyanDB also exposes its own metrics on port 2121; scrape each pod directly and substitute the Kubernetes pod/instance target labels for $pod below.)

Dashboards

The metrics are presented through two complementary Grafana dashboards, split by aggregation dimension (see Metrics Providers):

  • BanyanDB Cluster — Nodes — node/pod-level health and resources, aggregated by pod_name: Fleet Overview, Per-node Health, Resources, Disk by Path, and Go Runtime.
  • BanyanDB Cluster — Workload — business/data-level throughput and latency, aggregated by group: Cluster Workload Summary, Liaison: Ingestion, Query & Publish, Data: Storage, Data: Inverted Index, and Data: Internal Queue.

The sections below mirror the two dashboards row-for-row; each metric entry corresponds to one panel and uses that panel’s expression. A standalone Internal queue metrics reference at the end documents the queue_sub / queue_pub model in depth.


Nodes dashboard

Node and process health, aggregated per pod_name. Use this dashboard to find an overloaded, low-on-disk, or restarting node.

Fleet Overview

Reporting Nodes

The number of BanyanDB nodes currently reporting through the FODC proxy. (The Prometheus up metric reflects the proxy target, not individual nodes, so node liveness is derived from the per-node banyandb_system_up_time gauge instead.)

Expression: count(banyandb_system_up_time{job=~"$job", container_name=~"$role", pod_name=~"$pod"})

Nodes by Role

The reporting-node count broken down by role (liaison / data), via the container_name label.

Expression: count(banyandb_system_up_time{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (container_name)

Total CPU Cores

The total number of CPU cores across the selected nodes.

Expression: sum(banyandb_system_cpu_num{job=~"$job", container_name=~"$role", pod_name=~"$pod"})

Total Memory Used

The total resident memory used across the selected nodes, in bytes.

Expression: sum(banyandb_system_memory_state{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="used"})

Total Disk Used

The total disk space used across the selected nodes, in bytes (summed over all storage paths). See Resources → Disk Usage % for the used/total percentage and Disk by Path for the per-path breakdown.

Expression: sum(banyandb_system_disk{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="used"})

Node Uptime

The per-node uptime gauge. A value dropping to ~0 indicates a restart; a series disappearing indicates a node that stopped reporting.

Expression: banyandb_system_up_time{job=~"$job", container_name=~"$role", pod_name=~"$pod"} (one series per pod_name)

Per-node Health

A single table joining the key per-node signals so an unhealthy node stands out at a glance. The columns are the metrics documented in Resources below, each reduced by (pod_name):

  • Uptime — banyandb_system_up_time{…}
  • CPU (cores) — sum(rate(process_cpu_seconds_total{…}[$__rate_interval])) by (pod_name)
  • RSS — sum(process_resident_memory_bytes{…}) by (pod_name)
  • Memory % — max(banyandb_system_memory_state{…, kind="used_percent"}) by (pod_name)
  • Disk % — sum(banyandb_system_disk{…, kind="used"}) by (pod_name) / sum(banyandb_system_disk{…, kind="total"}) by (pod_name)

Resources

Resources metrics monitor per-node resource usage.

CPU Usage

CPU cores consumed per node. Sustained high utilization correlates with rising query/merge latency.

Expression: sum(rate(process_cpu_seconds_total{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (pod_name)

RSS Memory

Resident set size per node, in bytes. The peak over the rate interval is used so transient spikes are not hidden.

Expression: max by (pod_name) (max_over_time(process_resident_memory_bytes{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval]))

System Memory %

The percentage of system memory used per node. As it approaches the memory protector limit (--allowed-percent, default 75), query execution is throttled, so high memory surfaces as query slowdown first.

Expression: max(banyandb_system_memory_state{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="used_percent"}) by (pod_name)

Disk Usage %

The fraction of disk used per node (summed over all storage paths). BanyanDB rejects writes with STATUS_DISK_FULL once usage crosses the disk-full threshold (default 95%), so alert well before that — commonly at ~80–85%.

Expression: sum(banyandb_system_disk{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="used"}) by (pod_name) / sum(banyandb_system_disk{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="total"}) by (pod_name)

Network Usage

Bytes received and sent per second, per node and interface (name).

Expression1 (recv): sum(rate(banyandb_system_net_state{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="bytes_recv"}[$__rate_interval])) by (pod_name, name) Expression2 (sent): sum(rate(banyandb_system_net_state{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="bytes_sent"}[$__rate_interval])) by (pod_name, name)

Disk by Path

BanyanDB can place data on multiple storage paths; these panels break banyandb_system_disk down by the path label so a single full volume is visible even when the node total looks healthy.

Disk Used by Path

Expression: sum(banyandb_system_disk{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="used"}) by (pod_name, path)

Disk Total by Path

Expression: sum(banyandb_system_disk{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="total"}) by (pod_name, path)

Disk Used % by Path

Expression: sum(banyandb_system_disk{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="used"}) by (pod_name, path) / sum(banyandb_system_disk{job=~"$job", container_name=~"$role", pod_name=~"$pod", kind="total"}) by (pod_name, path)

Go Runtime

Standard Go process metrics, per node. Rising goroutine counts, GC pause, or allocation rate are early indicators of memory pressure or a leak.

Goroutines

Expression: sum(go_goroutines{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (pod_name)

GC Pause (avg)

Average garbage-collection pause, derived from the GC duration summary.

Expression: sum(rate(go_gc_duration_seconds_sum{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (pod_name) / sum(rate(go_gc_duration_seconds_count{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (pod_name)

Heap In-Use

Heap memory currently in use, per node.

Expression: sum(go_memstats_heap_inuse_bytes{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (pod_name)

Heap Next-GC / Alloc Rate

The heap size that will trigger the next GC, alongside the allocation rate.

Expression1 (next_gc): sum(go_memstats_next_gc_bytes{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (pod_name) Expression2 (alloc_rate): sum(rate(go_memstats_alloc_bytes_total{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (pod_name)


Workload dashboard

Business- and data-level throughput and latency, aggregated per group. Use this dashboard to see which business group / operation is slow, erroring, or backlogged.

Cluster Workload Summary

Cluster-wide RED stats. Each summand is wrapped in or vector(0) so a not-yet-registered counter reports 0 rather than “No Data”.

Cluster Write Rate

Total write operations per second across measures, streams and traces.

Expression: (sum(rate(banyandb_measure_total_written{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) or vector(0)) + (sum(rate(banyandb_stream_tst_total_written{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) or vector(0)) + (sum(rate(banyandb_trace_tst_total_written{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) or vector(0))

Cluster Query Rate

Query operations per second on the liaison servers.

Expression: sum(rate(banyandb_liaison_grpc_total_started{job=~"$job", container_name=~"$role", pod_name=~"$pod", method="query"}[$__rate_interval]))

Error Rate

Cluster error rate per minute, combining liaison gRPC, registry, stream-receive, queue-publisher, and liaison write-queue sync-loop errors. Several of these counters are lazily registered (absent until the first error), hence the or vector(0) guards.

Expression: (sum(rate(banyandb_liaison_grpc_total_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0)) + (sum(rate(banyandb_liaison_grpc_total_registry_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0)) + (sum(rate(banyandb_liaison_grpc_total_stream_msg_received_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0)) + (sum(rate(banyandb_queue_pub_total_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0)) + (sum(rate(banyandb_measure_total_sync_loop_err{job=~"$job", container_name="liaison", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0)) + (sum(rate(banyandb_stream_tst_total_sync_loop_err{job=~"$job", container_name="liaison", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0)) + (sum(rate(banyandb_trace_tst_total_sync_loop_err{job=~"$job", container_name="liaison", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0))

Liaison: Ingestion, Query & Publish

The liaison node’s front door — gRPC ingestion/query, the schema registry, the on-disk write queue (wqueue), and the tier-2 publish pipeline to data nodes.

Query Rate by Service

Query rate split by the service label (measure / stream / trace / property / topn).

Expression: sum(rate(banyandb_liaison_grpc_total_started{job=~"$job", container_name=~"$role", pod_name=~"$pod", method="query"}[$__rate_interval])) by (service)

Query Latency by Group

Average query latency in seconds, per group — total latency divided by total started. Because a distributed query fans out to all nodes, the slowest group dominates the overall query latency.

Expression: sum(rate(banyandb_liaison_grpc_total_latency{job=~"$job", container_name=~"$role", pod_name=~"$pod", method="query"}[$__rate_interval])) by (group) / sum(rate(banyandb_liaison_grpc_total_started{job=~"$job", container_name=~"$role", pod_name=~"$pod", method="query"}[$__rate_interval])) by (group)

gRPC Error Rate

Liaison error rate, broken down by service/method, plus registry and stream-receive errors. All terms are lazily registered, so they are simply absent on a healthy cluster.

Expression1 (grpc, by service/method): sum(rate(banyandb_liaison_grpc_total_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (service, method) Expression2 (registry): sum(rate(banyandb_liaison_grpc_total_registry_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) Expression3 (stream recv): sum(rate(banyandb_liaison_grpc_total_stream_msg_received_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval]))

Registry Operation Rate

The rate of schema-registry operations and non-query gRPC calls.

Expression1 (registry): sum(rate(banyandb_liaison_grpc_total_registry_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) Expression2 (non-query): sum(rate(banyandb_liaison_grpc_total_started{job=~"$job", container_name=~"$role", pod_name=~"$pod", method!="query"}[$__rate_interval]))

Write Rate by Group

Write operations per second on the liaison, per group, for measures, streams and traces (charted as separate series rather than added, since each data type uses different group values).

Expression1: sum(rate(banyandb_measure_total_written{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression2: sum(rate(banyandb_stream_tst_total_written{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression3: sum(rate(banyandb_trace_tst_total_written{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Publish Throughput & Success

The tier-2 publisher’s success throughput by operation, alongside file-sync bytes sent by group. See the queue reference for the full model.

Expression1 (success): sum(rate(banyandb_queue_pub_total_finished{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (operation) Expression2 (bytes): sum(rate(banyandb_queue_pub_sent_bytes{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Write Queue Length (wqueue)

The liaison buffers writes in an on-disk write queue (wqueue) before syncing each part to the data nodes. These gauges (scoped with the literal container_name="liaison") expose the queue depth per pod_name/group. The wqueue accumulates on disk — it does not drop data after a fixed window — so a sustained rise in any of these, especially pending, signals a slow or unavailable downstream data node and can eventually fill the liaison disk (STATUS_DISK_FULL). The matching *_total_sync_loop_* counters (started/finished/latency/bytes, and the lazily-registered *_total_sync_loop_err) describe the loop that drains it.

Expression1 (file parts): sum(banyandb_{measure,stream_tst,trace_tst}_total_file_parts{job=~"$job", container_name="liaison", pod_name=~"$pod"}) by (pod_name, group) Expression2 (mem parts): sum(banyandb_{measure,stream_tst,trace_tst}_total_mem_part{job=~"$job", container_name="liaison", pod_name=~"$pod"}) by (pod_name, group) Expression3 (pending): sum(banyandb_{measure,stream_tst,trace_tst}_pending_data_count{job=~"$job", container_name="liaison", pod_name=~"$pod"}) by (pod_name, group)

The panel charts all three families for measure, stream_tst and trace_tst as separate series (nine queries); the {measure,stream_tst,trace_tst} brace above is shorthand for the three concrete metric names.

Publish Send Latency p99

The 99th-percentile tier-2 send latency, per operation, from the publisher latency histogram.

Expression: histogram_quantile(0.99, sum(rate(banyandb_queue_pub_total_latency_bucket{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (le, operation))

Data: Storage

Storage metrics monitor the data nodes’ on-disk state, grouped by the group tag. View different pod_names to find a hot node or uneven data distribution.

Total Data

The total number of data elements (points/rows) stored, per group, for measures, streams and traces.

Expression1: sum(banyandb_measure_total_file_elements{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (group) Expression2: sum(banyandb_stream_tst_total_file_elements{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (group) Expression3: sum(banyandb_trace_tst_total_file_elements{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (group)

Merge File Rate

Merge-file operations per minute, per group. A surge means many small files are being merged, which raises disk I/O and CPU and can slow queries.

Expression1: sum(rate(banyandb_measure_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) * 60 Expression2: sum(rate(banyandb_stream_tst_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) * 60 Expression3: sum(rate(banyandb_trace_tst_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) * 60

Merge File Latency

Average merge-file latency in seconds, per group (total merge latency ÷ merge operations). A surge indicates slow merges, often from high disk I/O.

Expression1: sum(rate(banyandb_measure_total_merge_latency{job=~"$job", container_name=~"$role", pod_name=~"$pod", type="file"}[$__rate_interval])) by (group) / sum(rate(banyandb_measure_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression2: sum(rate(banyandb_stream_tst_total_merge_latency{job=~"$job", container_name=~"$role", pod_name=~"$pod", type="file"}[$__rate_interval])) by (group) / sum(rate(banyandb_stream_tst_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression3: sum(rate(banyandb_trace_tst_total_merge_latency{job=~"$job", container_name=~"$role", pod_name=~"$pod", type="file"}[$__rate_interval])) by (group) / sum(rate(banyandb_trace_tst_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Merge File Partitions

Average number of partitions merged per merge-file operation, per group (merged parts ÷ merge operations). A surge suggests the server is under heavy write load.

Expression1: sum(rate(banyandb_measure_total_merged_parts{job=~"$job", container_name=~"$role", pod_name=~"$pod", type="file"}[$__rate_interval])) by (group) / sum(rate(banyandb_measure_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression2: sum(rate(banyandb_stream_tst_total_merged_parts{job=~"$job", container_name=~"$role", pod_name=~"$pod", type="file"}[$__rate_interval])) by (group) / sum(rate(banyandb_stream_tst_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression3: sum(rate(banyandb_trace_tst_total_merged_parts{job=~"$job", container_name=~"$role", pod_name=~"$pod", type="file"}[$__rate_interval])) by (group) / sum(rate(banyandb_trace_tst_total_merge_loop_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Data: Inverted Index

Inverted Index metrics monitor the series index of measures and streams, grouped by the group tag. Rapid growth or churn signals a cardinality problem that bloats the index and slows queries.

Series Write Rate

Series index update operations per second, per group, for measures and streams.

Expression1 (measure): sum(rate(banyandb_measure_inverted_index_total_updates{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression2 (stream): sum(rate(banyandb_stream_storage_inverted_index_total_updates{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Series Term Search Rate

Series term-search operations per second, per group. A high value means reads are fetching many series — often a high-cardinality symptom.

Expression1 (measure): sum(rate(banyandb_measure_inverted_index_total_term_searchers_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group) Expression2 (stream): sum(rate(banyandb_stream_storage_inverted_index_total_term_searchers_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Total Series

The total number of series (index documents) stored, per group, for measures and streams.

Expression1 (measure): sum(banyandb_measure_inverted_index_total_doc_count{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (group) Expression2 (stream): sum(banyandb_stream_storage_inverted_index_total_doc_count{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (group)

Stream tst: Write Rate

Write rate into the stream time-series-table (tst) inverted index (the per-element index, distinct from the series index above), per group.

Expression: sum(rate(banyandb_stream_tst_inverted_index_total_updates{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Stream tst: Term Search Rate

Term-search rate on the stream tst inverted index, per group.

Expression: sum(rate(banyandb_stream_tst_inverted_index_total_term_searchers_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)

Stream tst: Total Documents

Total documents in the stream tst inverted index, per group.

Expression: sum(banyandb_stream_tst_inverted_index_total_doc_count{job=~"$job", container_name=~"$role", pod_name=~"$pod"}) by (group)

Data: Internal Queue

One panel per queue operation, each showing subscribe throughput (started/finished) and p99 latency, broken down by group. These are the data-node (subscribe) view of the internal queue; the full model is in the queue reference. The four operations are query, file-sync, batch-write and control.

Each panel uses the same three expressions, with operation pinned to the panel’s value (<op>query | file-sync | batch-write | control):

Started: sum(rate(banyandb_queue_sub_total_started{job=~"$job", container_name=~"$role", pod_name=~"$pod", operation="<op>"}[$__rate_interval])) by (group) Finished: sum(rate(banyandb_queue_sub_total_finished{job=~"$job", container_name=~"$role", pod_name=~"$pod", operation="<op>"}[$__rate_interval])) by (group) p99 latency: histogram_quantile(0.99, sum(rate(banyandb_queue_sub_total_latency_bucket{job=~"$job", container_name=~"$role", pod_name=~"$pod", operation="<op>"}[$__rate_interval])) by (le, group))

batch-write covers both plain writes and the secondary-index sync (measure/stream series-index, stream local-index, trace sidx-series). All of these carry their business group, so the per-group breakdown is complete for every operation.


Internal queue metrics reference (queue_sub / queue_pub)

Liaison nodes run an internal gRPC queue server (server-queue-sub, wired via sub.NewServerWithPorts in pkg/cmdsetup/liaison.go) and queue clients (server-queue-pub) for the tier-1/tier-2 pipelines. Prometheus metrics use the namespaces banyandb_queue_sub_* and banyandb_queue_pub_* (built from observability.RootScope + queue_sub / queue_pub sub-scopes). Data nodes expose the same families where the corresponding services run.

Both namespaces share one model: the base metrics total_started, total_finished, total_latency (a histogram), and total_err, labeled by operation (batch-write / file-sync / query / control) and group, plus the remote endpoint of the flow — remote_node (the peer’s BanyanDB node name, equal to its /cluster/topology metadata.name), remote_role (liaison / data), and remote_tier (hot / warm / cold, data only). total_err adds an error_type label. File-sync additionally exposes byte counters: sent_bytes (pub) and received_bytes (sub). The local end of each flow is the scrape target itself (pod_name / node_role / node_type from the FODC proxy), so joining the scrape labels with the remote_* labels reconstructs the liaison↔data(hot/warm/cold) call graph. For a worked recipe that fuses these metrics with the FODC /cluster/topology node inventory to render the topology, see Cluster Topology Rendering.

queue_sub — inbound server

Metric (suffix after banyandb_queue_sub_) Type Labels Meaning
total_started, total_finished Counter operation, group, remote_node, remote_role, remote_tier Subscribe RPCs started / finished, per operation and group. A persistent started − finished gap indicates backlog or stuck handlers.
total_latency Histogram operation, group, remote_node, remote_role, remote_tier Handling latency (_bucket / _sum / _count); use histogram_quantile for p50/p99.
total_err Counter …, error_type Errors by type. Lazily registered, so absent (not zero) on a healthy cluster.
received_bytes Counter operation, group, remote_node, remote_role, remote_tier Bytes received, file-sync only (operation="file-sync").

Troubleshooting: a growing total_started − total_finished gap (or rising total_latency p99) for a group/operation points at slow or stuck consumers. total_err broken down by error_type distinguishes transport issues (stream_error, recv_error, checksum_mismatch, out_of_order) from completion issues (finish_sync_err, part_failed). For file-sync, received_bytes together with total_latency separates partial completion from healthy throughput.

queue_pub — outbound batch client

Metric (suffix after banyandb_queue_pub_) Type Labels Meaning
total_started, total_finished Counter operation, group, remote_node, remote_role, remote_tier Sends started / finished per operation and target node; total_finished is the success rate.
total_latency Histogram operation, group, remote_node, remote_role, remote_tier Send latency (_bucket / _sum / _count); use histogram_quantile for p99.
total_err Counter …, error_type Send errors by type. error_type is one of non_transient, canceled, stream_canceled, retry_exhausted, recv_error, server_rejected (Send path) or stream_error, recv_error, checksum_mismatch, out_of_order, session_not_found, completion_error (file-sync). Lazily registered.
sent_bytes Counter operation, group, remote_node, remote_role, remote_tier Bytes sent, file-sync only.

Troubleshooting: total_err by error_type separates transport failures (recv_error) from application-level SendResponse errors (server_rejected) and exhausted retries (retry_exhausted). A persistent total_started − total_finished gap, or rising total_latency p99 for a given remote_node / remote_tier, indicates slow or unavailable data nodes.

Metrics are only registered when metadata implements metadata.Service and MetricsRegistry() is non-nil (e.g. after SetMetricsRegistry in bootstrap). NewWithoutMetadata() leaves queue_pub metrics disabled and logs a warning (queue_pub metrics disabled: ...). The total_err counters above are registered lazily on first occurrence, so they are simply absent (not zero) on a healthy cluster.

Example PromQL snippets

Saturation (scope by node with the proxy labels):

  • Subscribe throughput: sum(rate(banyandb_queue_sub_total_started{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (operation) (and the matching banyandb_queue_sub_total_finished)
  • Subscribe p99 latency: histogram_quantile(0.99, sum(rate(banyandb_queue_sub_total_latency_bucket{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (le, operation))
  • File-sync bytes received: sum(rate(banyandb_queue_sub_received_bytes{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (group)
  • Publisher success rate: sum(rate(banyandb_queue_pub_total_finished{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (operation)
  • Publisher errors by type: sum(rate(banyandb_queue_pub_total_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (operation, error_type)
  • Publisher file-sync bytes sent (by tier): sum(rate(banyandb_queue_pub_sent_bytes{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])) by (remote_tier)

Suggested alerts (tune thresholds per cluster):

  • Non-zero sustained rate(banyandb_queue_pub_total_err{error_type="retry_exhausted"}[5m]) on liaison.
  • A sustained rate(banyandb_queue_sub_total_started[5m]) - rate(banyandb_queue_sub_total_finished[5m]) gap (subscribe backlog), or total_latency p99 above an environment-specific ceiling, for a single group / operation.

Aggregate pipeline error rate (optional)

To combine subscribe-side and publisher-side queue failures (per-minute scaling as elsewhere in this doc; each term wrapped in or vector(0) so a missing counter doesn’t blank the result):

Expression: (sum(rate(banyandb_queue_sub_total_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0)) + (sum(rate(banyandb_queue_pub_total_err{job=~"$job", container_name=~"$role", pod_name=~"$pod"}[$__rate_interval])*60) or vector(0))