First Occurrence Data Collection (FODC)
First Occurrence Data Collection (FODC) is an observability and diagnostics subsystem for BanyanDB.
It continuously collects runtime parameters, performance indicators, node states, and configuration
data from DB nodes, and also supports on-demand performance profiling and memory snapshots.
FODC has two primary goals:
- Ensure the stable operation of DB nodes across all lifecycle stages (bootstrap, steady state, scaling, failure, etc.).
- Provide trustworthy, structured data that supports capacity planning, performance analysis, and troubleshooting.
FODC adopts a Proxy + Agent deployment model and exposes a unified, ecosystem-friendly data interface to external systems (such as Prometheus and other observability platforms).
Overview
FODC provides multiple categories of data:
- Metric collection and short-term caching (small time window)
- Node topology and status, including runtime parameters and role states
- Node configuration collection
- On-demand performance profiling and memory snapshots
To accomplish this, FODC is deployed as:
- A central Proxy service
- Multiple Agents, typically co-located with BanyanDB nodes (sidecar pattern)
Agents connect to the Proxy via gRPC and register themselves. The Proxy then:
- Aggregates and normalizes data from all agents
- Exposes unified REST/Prometheus-style interfaces
- Issues on-demand diagnostic commands (profiling, snapshots, config capture, etc.) to one or more agents
Architecture
Deployment Model
FODC uses a one-to-one mapping between an Agent and a BanyanDB node:
- Each liaison node has one FODC Agent
- Each data node (hot / warm / cold) has one FODC Agent
- In manual deployment modes, the same 1:1 relationship must be preserved
Agents are typically deployed as sidecars in the same pod or host as the corresponding BanyanDB node.
Proxy–Agent Relationship
The Proxy acts as the control plane and data aggregator, while Agents act as the data plane local to each DB node.
Key characteristics:
- Agents connect to the Proxy using gRPC and a configured Proxy domain name.
- The connection is bi-directional:
- Agents stream node metrics, status, and configuration to the Proxy.
- The Proxy sends on-demand diagnostic commands back to the Agents.
ASCII Architecture Diagram
+-----------------------------------------+
| FODC Proxy |
|-----------------------------------------|
| - Agent registry |
| - Cluster topology view |
External | - Aggregated metrics (/metrics |
Clients & <---- | and /metrics-windows) |
Ecosystem | - Lifecycle view (/cluster/lifecycle) |
| - On-demand control APIs |
+-----------------^-----------------------+
|
gRPC bi-directional streams
|
-----------------------------------------------------------------
| | |
v v v
+------------------+ +------------------+ +------------------+
| FODC Agent | | FODC Agent | | FODC Agent |
| (sidecar with | | (sidecar with | | (sidecar with |
| liaison node) | | data node - hot) | | data node - warm)|
|------------------| |------------------| |------------------|
| - Scrape local | | - Scrape local | | - Scrape local |
| Prometheus | | Prometheus | | Prometheus |
| metrics | | metrics | | metrics |
| - Collect OS & | | - KTM / OS obs | | - KTM / OS obs |
| KTM telemetry | | metrics | | metrics |
| - Execute on- | | - Execute on- | | - Execute on- |
| demand profile | | demand profile | | demand profile |
| & heap dump | | & heap dump | | & heap dump |
+--------^---------+ +--------^---------+ +--------^---------+
| | |
| | |
+------+--------+ +------+--------+ +------+--------+
| BanyanDB | | BanyanDB | | BanyanDB |
| liaison node | | data node | | data node |
| (process) | | (hot tier) | | (warm tier) |
+---------------+ +---------------+ +---------------+
Additional data node tiers (e.g.
datanode-cold) follow the same BanyanDB node ↔ FODC Agent 1:1 pattern.
Metric Collection and Prometheus Integration
Data Sources
Each FODC Agent collects metrics from:
- The local BanyanDB node
- Via its Prometheus
/metricsHTTP endpoint - Includes DB performance, internal queues, I/O stats, query latency, etc.
- Via its Prometheus
- Kernel & OS-level telemetry
- Through an integrated Kernel Telemetry Module (KTM) powered by eBPF
- Examples: OS page cache statistics, system I/O latency, CPU scheduling behavior
In-Memory Sliding Window Cache
Agents maintain a sliding window of recent metric samples in memory:
- A wake-up queue is used to buffer the last N collections.
- The time window is auto-tuned at startup based on:
- Sample interval
- Number of metrics
- Available memory constraints
- Target memory usage is kept low (around 30 MB per Agent) while still supporting:
- Short-term trend analysis
- Correlation during incident triage (e.g. spikes around first occurrence)
This design allows FODC to provide recent time-series context without depending on an external TSDB.
Agent Metric Exposure
Each FODC Agent exposes a Prometheus-compatible endpoint:
GET /metrics- Returns the latest scraped metrics and local telemetry
- Can be scraped directly by:
- The FODC Proxy
- External observability systems (if desired and authorized)
Proxy Metric Exposure
The FODC Proxy aggregates the metrics of all registered agents and re-exposes them through two endpoints. Both accept optional role and pod_name query parameters to scope the result to a subset of nodes:
GET /metrics- Returns the aggregated, latest per-node metrics from every agent in Prometheus text exposition format (
Content-Type: text/plain; version=0.0.4). - On each request the proxy collects the current sample from the agents on demand, then concatenates them — so scraping the proxy once yields every node’s series. This is the single scrape target for a Prometheus-based setup; per-node identity is carried in the
pod_nameandcontainer_namelabels (see Observability › Metrics Providers). - This endpoint exposes the aggregated node metrics, not a proxy-only view; the proxy’s own health and agent counts are served separately by
GET /health(status, online/total agents, uptime).
- Returns the aggregated, latest per-node metrics from every agent in Prometheus text exposition format (
GET /metrics-windows- Returns metrics as JSON — an array of time series, each with
name,description,labels,agent_id,pod_name, and a time-sorteddataarray of{timestamp, value}points. - When both
start_timeandend_time(RFC3339) are supplied, it returns the samples held in the agents’ in-memory sliding window for that range; otherwise it falls back to the latest sample (the same data as/metrics, in JSON form). - Each series carries whatever labels the underlying metric has (including
node_roleandcontainer_name); richer cluster metadata such as node IDs, membership, and roles is served byGET /cluster/topology. This endpoint is intended for short-term trend queries and incident triage, not as a Prometheus scrape target.
- Returns metrics as JSON — an array of time series, each with
This makes FODC a drop-in component for Prometheus-based ecosystems, while preserving richer semantic context about each node.
Cluster Topology, Roles, and Runtime State
The FODC Proxy maintains an up-to-date view of cluster topology based on Agent registration:
- On startup, each Agent:
- Connects to the Proxy via gRPC
- Registers its:
- Node ID and role
- Basic runtime attributes and capabilities
- The Proxy aggregates these registrations into a logical cluster hierarchy structure.
Topology & Status API
The Proxy exposes a cluster discovery endpoint:
GET /cluster/topology- Triggers a topology collection across all registered agents and returns the merged snapshot as JSON:
{ "nodes": [...], "calls": [...] }. - Each
nodesentry carries:- Node identity —
metadata.nameandgrpc_address labels— e.g.pod_nameandtype(hot/warm/cold)roles— role-name strings such asROLE_META,ROLE_DATA,ROLE_LIAISON(extensible for future roles)- Agent
status(online/offline) andlast_heartbeat
- Node identity —
callsdescribes the node-to-node call graph reported by the agents (route-table membership, not data flow).
- Triggers a topology collection across all registered agents and returns the merged snapshot as JSON:
See Proxy APIs and CLI Flags for the full response schema. To turn this node inventory into a directed, weighted topology of the actual data flow — joining it with the queue metrics for per-edge throughput, latency, and errors — see Cluster Topology Rendering.
This simplifies integration with:
- Cluster dashboards
- Automated operations (e.g. scheduled checks before resharding / scaling)
- Higher-level diagnostics tooling that needs a consistent cluster graph
Cluster Lifecycle and Group Information
Beyond topology, the Proxy aggregates per-group lifecycle data and per-pod lifecycle reports from the agents.
Lifecycle API
GET /cluster/lifecycle- Triggers lifecycle-data collection from all agents that support the lifecycle stream and returns JSON with two sections:
groups— group lifecycle information (group name, catalog type, resource options such as shard count / segment interval / TTL, and data info), collected from the first agent that provides it (typically the liaison node).lifecycle_statuses— per-pod lifecycle reports, i.e. the JSON report files read from each agent’s lifecycle report directory.
- Agents that do not support the lifecycle stream are silently skipped.
- Triggers lifecycle-data collection from all agents that support the lifecycle stream and returns JSON with two sections:
This allows:
- Verifying group / resource-option consistency across the cluster
- Reviewing lifecycle (rotation / TTL / migration) reports per node
- Correlating group settings with observed behavior during incident triage
Crash Diagnostics API
GET /diagnostics- Aggregates crash diagnostic records (structured panic records and on-disk crash artifacts) from all connected agents, with optional
roleandpod_namefilters.
- Aggregates crash diagnostic records (structured panic records and on-disk crash artifacts) from all connected agents, with optional
See Proxy APIs and CLI Flags for the full request/response schemas of both endpoints.
On-Demand Performance Profiling and Memory Snapshots
On-demand diagnostics are the first non read-only capability exposed by FODC. They enable deep performance analysis while carefully controlling overhead.
Design Principles
- Opt-in and controlled
Diagnostic actions are only triggered through explicit API calls to the Proxy. - Local execution, remote control
Agents perform the heavy work (profiling, snapshots) on the local node; the Proxy only orchestrates. - Low default footprint
By default, Agents run in low CPU / low memory mode and do not perform expensive diagnostics. - Burst resource usage when needed
Extra CPU/memory budget is mainly consumed only during active profiling or snapshot sessions.
Typical On-Demand Actions
Exact APIs may vary by implementation, but commonly include:
- CPU profiling
- Short-term CPU usage profiling (e.g. pprof)
- Useful for identifying hot code paths under load
- Heap / memory snapshots
- Captures heap allocation state for leak or fragmentation analysis
- I/O / lock contention profiling
- Optional profiling of DB internal lock contention or I/O stalls
- Configuration snapshot on demand
- Force a re-capture of configuration and runtime flags at a specific point in time
- Apply RBAC or other authorization controls on Proxy APIs
Summary
FODC provides:
- Unified, structured observability for BanyanDB clusters (metrics, topology, configuration)
- Prometheus-friendly interfaces for easy ecosystem integration
- On-demand deep diagnostics (profiling, memory snapshots) orchestrated centrally but executed locally
- A lightweight, extensible Proxy + Agent architecture that respects resource constraints
This makes FODC a foundational component for reliable operation, performance analysis, and automated troubleshooting of BanyanDB deployments.