BanyanDB Clustering

BanyanDB Clustering introduces a robust and scalable architecture that comprises “Liaison Nodes” and “Data Nodes”, along with a property-based schema registry. This structure allows for effectively distributing and managing time-series data within the system.

1. Architectural Overview

A BanyanDB installation includes two distinct types of nodes: Data Nodes and Liaison Nodes, along with a property-based schema registry for metadata management.

1.1 Data Nodes

Data Nodes hold all the raw time series data, metadata, and indexed data. They handle the storage and management of data, including streams, measures and traces, tag keys and values, as well as field keys and values.

Data Nodes also handle the local query execution. When a query is made, it is directed to a Liaison, which then interacts with Data Nodes to execute the distributed query and return results.

In addition to persistent raw data, Data Nodes also handle TopN aggregation calculation or other computational tasks.

1.2 Schema Registry

The schema registry is responsible for maintaining high-level metadata of the cluster, which includes:

All nodes in the cluster
All database schemas

In the default property mode, schema metadata is synchronized between nodes via an internal property-based protocol without requiring an external dependency.

1.3 Liaison Nodes

Liaison Nodes serve as gateways, routing traffic to Data Nodes. In addition to routing, they also provide authentication, TTL, and other security services to ensure secure and effective communication without the cluster.

Liaison Nodes are also responsible for handling computational tasks associated with distributed querying the database. They build query tasks and search for data from Data Nodes.

For measure queries that include aggregation, data nodes perform a local map phase and the liaison performs a reduce phase so results stay correct across shards while limiting raw data on the network. See Distributed Measure Aggregation.

1.4 Standalone Mode

BanyanDB integrates multiple roles into a single process in the standalone mode, making it simpler and faster to deploy. This mode is especially useful for scenarios with a limited number of data points or for testing and development purposes.

In this mode, the single process performs the roles of the Liaison Node and Data Node, along with the schema registry. It receives requests, maintains metadata, processes queries, and handles data, all within a unified setup.

2. Communication within a Cluster

All nodes within a BanyanDB cluster communicate with other nodes according to their roles:

The schema registry shares high-level metadata about the cluster.
Data Nodes store and manage the raw time series data and communicate with the schema registry.
Liaison Nodes distribute incoming data to the appropriate Data Nodes. They also handle distributed query execution and communicate with the schema registry.

Nodes Discovery

All nodes in the cluster are discovered through the configured node discovery mechanism (none, dns, or file). When a node starts up, it registers itself with the discovery service. The Liaison Nodes use this information to route requests to the appropriate nodes.

If data nodes become unreachable, the liaison nodes will not remove them from their routing list until they are also unreachable from the liaison nodes’ perspective. This approach ensures that the system can continue to function even if some data nodes are temporarily unavailable.

3. Data Organization

Different nodes in BanyanDB are responsible for different parts of the database, while Query and Liaison Nodes manage the routing and processing of queries.

3.1 Schema Registry

The schema registry stores all high-level metadata that describes the cluster, including Group configurations (such as shard_num and replicas) and Data Node registration information. In the default property mode, this metadata is synchronized between nodes via an internal property-based protocol.

Liaison Nodes use this metadata to dynamically determine shard-to-node assignments using a deterministic round-robin algorithm. Rather than storing explicit shard allocation mappings, BanyanDB calculates assignments on-the-fly based on the current cluster topology. This design simplifies cluster scaling, as adding or removing nodes automatically triggers recalculation of assignments without manual intervention.

3.2 Data Nodes

Data Nodes store all raw time series data, metadata, and indexed data. For time-series resources (Measure, Stream, Trace), the data is organized on disk by <group>/<segment>/shard-<shard_id>/<part>/, where the segment is a time bucket that supports the retention policy. Property data has no time segment — it is organized by <group>/shard-<shard_id>/ directly. See Storage & File Format for details.

3.3 Liaison Nodes

Liaison Nodes do not store data but manage the routing of incoming requests to the appropriate Query or Data Nodes. They also provide authentication, TTL, and other security services.

They also handle the computational tasks associated with data queries, interacting directly with Data Nodes to execute queries and return results.

4. Determining Optimal Node Counts

When creating a BanyanDB cluster, choosing the appropriate number of each node type to configure and connect is crucial. The number of Data Nodes scales based on your storage and query needs. The number of Liaison Nodes depends on the expected query load and routing complexity.

If the write and read load is from different sources, it is recommended to separate the Liaison Nodes for write and read. For instance, if the write load is from metrics, trace or log collectors and the read load is from a web application, it is recommended to separate the Liaison Nodes for write and read.

This separation allows for more efficient routing of requests and better performance. It also allows for scaling out of the cluster based on the specific needs of each type of request. For instance, if the write load is high, you can scale out the write Liaison Nodes to handle the increased load.

The BanyanDB architecture allows for efficient clustering, scaling, and high availability, making it a robust choice for time series data management.

5. Writes in a Cluster

In BanyanDB, writing data in a cluster is designed to take advantage of the robust capabilities of underlying storage systems, such as Google Compute Persistent Disk or Amazon S3(TBD). These platforms ensure high levels of data durability, making them an optimal choice for storing raw time series data.

5.1 Data Replication

Unlike some other systems, BanyanDB does not support application-level replication, which can consume significant disk space. Instead, it delegates the task of replication to these underlying storage systems. This approach simplifies the BanyanDB architecture and reduces the complexity of managing replication at the application level. This approach also results in significant data savings.

The comparison between using a storage system and application-level replication boils down to several key factors: reliability, scalability, and complexity.

Reliability: A storage system provides built-in data durability by automatically storing data across multiple systems. It’s designed to deliver 99.999999999% durability, ensuring data is reliably stored and available when needed. While replication can increase data availability, it’s dependent on the application’s implementation. Any bugs or issues in the replication logic can lead to data loss or inconsistencies.

Scalability: A storage system is highly scalable by design and can store and retrieve any amount of data from anywhere. As your data grows, the system grows with you. You don’t need to worry about outgrowing your storage capacity. Scaling application-level replication can be challenging. As data grows, so does the need for more disk space and compute resources, potentially leading to increased costs and management complexity.

Complexity: With the storage system handling replication, the complexity is abstracted away from the user. The user need not concern themselves with the details of how replication is handled. Managing replication at the application level can be complex. It requires careful configuration, monitoring, and potentially significant engineering effort to maintain.

Furthermore, the storage system might be cheaper. For instance, S3 can be more cost-effective because it eliminates the need for additional resources required for application-level replication. Application-level replication also requires ongoing maintenance, potentially increasing operational costs.

5.2 Data Sharding

Data distribution across the cluster is determined by the shard_num setting for a group and the specified entity in each resource, whether it is a stream, measure or trace. The combination of the resource’s name and its entity forms the sharding key, which guides data distribution to the appropriate Data Node during write operations.

For example, if a group has 5 shards, the data is distributed across these shards based on the sharding key:

Group measure-minute Shard 0
Group measure-minute Shard 1
Group stream-log Shard 0
Group stream-log Shard 1
Group stream-log Shard 2

A measure named service_cpm belonging to measure-minute with an entity of service_id “frontend” will be written to a specific shard based on the hashed value of the sharding key service_cpm:frontend.

Similarly, a stream named system_log belonging to stream-log with an entity combination of service_id “frontend” and instance_id “10.0.0.1” will be written to a specific shard based on the hashed value of the sharding key system_log:frontend|10.0.0.1.

Note: If there are “:” or “|” in the entity, they will be prefixed with a backslash “\”.

Liaison Nodes play a crucial role in this process by retrieving Group configurations and Data Node information from the schema registry. Using this metadata, Liaison Nodes dynamically calculate shard-to-node assignments using a deterministic round-robin algorithm.

This sharding strategy ensures that the write load is evenly distributed across the cluster, thereby enhancing write performance and overall system efficiency. BanyanDB sorts the shards by the Group name and the shard ID, then calculates node assignments using the formula: node = (shard_index + replica_id) % node_count. This deterministic calculation ensures consistent routing: the same shard always maps to the same nodes as long as the node list remains unchanged. When nodes are added or removed, assignments are automatically recalculated, eliminating the need to maintain explicit shard allocation mappings.

For example, consider a group with 5 shards and a cluster with 3 Data Nodes. The shards are distributed as follows:

Group measure-minute Shard 0: Data Node 1
Group measure-minute Shard 1: Data Node 2
Group stream-log Shard 0: Data Node 3
Group stream-log Shard 1: Data Node 1
Group stream-log Shard 2: Data Node 2

5.3 Data Write Path

Here’s a text-based diagram illustrating the data write path in BanyanDB:

User
 |
 | API Request (Write)
 |
 v
------------------------------------
|          Liaison Node             |   <--- Stateless Node, Routes Request
| (Identifies relevant Data Nodes   |
|  and dispatches write request)    |
------------------------------------
       |
       v
-----------------  -----------------  -----------------
|  Data Node 1  |  |  Data Node 2  |  |  Data Node 3  |
|  (Shard 1)    |  |  (Shard 2)    |  |  (Shard 3)    |
-----------------  -----------------  -----------------

A user makes an API request to the Liaison Node. This request is a write request, containing the data to be written to the database.
The Liaison Node, which is stateless, identifies the relevant Data Nodes that will store the data based on the entity specified in the request.
The write request is executed across the identified Data Nodes. Each Data Node writes the data to its shard.

This architecture allows BanyanDB to execute write requests efficiently across a distributed system, leveraging the stateless nature and routing/writing capabilities of the Liaison Node, and the distributed storage of Data Nodes.

6. Queries in a Cluster

BanyanDB utilizes a distributed architecture that allows for efficient query processing. When a query is made, it is directed to a Liaison Node.

6.1 Query Routing

Liaison Nodes do not use shard mapping information from the schema registry to execute distributed queries. Instead, they access all Data Nodes to retrieve the necessary data for queries. As the query load is lower, it is practical for liaison nodes to access all data nodes for this purpose. It may increase network traffic, but simplifies scaling out of the cluster.

Compared to the write load, the query load is relatively low. For instance, in a time series database, the write load is typically 100x higher than the query load. This is because the write load is driven by the number of devices sending data to the database, while the query load is driven by the number of users accessing the data.

This strategy enables scaling out of the cluster. When the cluster scales out, the liaison node can access all data nodes without any mapping info changes. It eliminates the need to backup previous shard mapping information, reducing complexity of scaling out.

6.2 Query Execution

Parallel execution significantly enhances the efficiency of data retrieval and reduces the overall query processing time. It allows for faster response times as the workload of the query is shared across multiple shards, each working on their part of the problem simultaneously. This feature makes BanyanDB particularly effective for large-scale data analysis tasks.

In summary, BanyanDB’s approach to querying leverages its unique distributed architecture, enabling high-performance data retrieval across multiple shards in parallel.

6.3 Query Path

User
 |
 | API Request (Query)
 |
 v
------------------------------------
|          Liaison Node            |   <--- Stateless Node, Distributes Query
|  (Access all Data nodes to       |
|  execute distributed queries)    |
------------------------------------
       |              |              |
       v              v              v
-----------------  -----------------  -----------------
|  Data Node 1  |  |  Data Node 2  |  |  Data Node 3  |
|  (Shard 1)    |  |  (Shard 2)    |  |  (Shard 3)    |
-----------------  -----------------  -----------------

A user makes an API request to the Liaison Node. This request may be a query for specific data.
The Liaison Node builds a distributed query to select all data nodes.
The query is executed in parallel across all Data Nodes. Each Data Node execute a local query plan to process the data stored in its shard concurrently with the others.
The results from each shard are then returned to the Liaison Node, which consolidates them into a single response to the user.

This architecture allows BanyanDB to execute queries efficiently across a distributed system, leveraging the distributed query capabilities of the Liaison Node and the parallel processing of Data Nodes.

7. Failover

BanyanDB is designed to be highly available and fault-tolerant.

In case of a Data Node failure, the system can automatically recover and continue to operate.

Liaison nodes have a built-in mechanism to detect the failure of a Data Node. When a Data Node fails, the Liaison Node will automatically route requests to other available Data Nodes with the same shard. This ensures that the system remains operational even in the face of node failures. Thanks to the query mode, which allows Liaison Nodes to access all Data Nodes, the system can continue to function even if some Data Nodes are unavailable. When the failed data nodes are restored, the system won’t reply data to them since the data is still retrieved from other nodes.

In the case of a Liaison Node failure, the system can be configured to have multiple Liaison Nodes for redundancy. If one Liaison Node fails, the other Liaison Nodes can take over its responsibilities, ensuring that the system remains available.

Please note that any written request which triggers the failover process will be rejected, and the client should re-send the request.