Disk Management
BanyanDB provides comprehensive disk management capabilities to prevent disk space exhaustion and ensure optimal storage utilization. This includes both automatic forced retention cleanup and disk usage monitoring.
Overview
BanyanDB includes a disk monitor that can automatically trigger forced retention cleanup when disk usage exceeds configured thresholds. This feature helps prevent disk space exhaustion by automatically removing old data segments and snapshots.
Configuration
Data & Standalone Servers
The following flags are used to configure forced retention cleanup for data and standalone servers:
Measure Service
--measure-retention-high-watermark float
: Disk usage percentage that triggers forced retention cleanup (0-100, default: 95.0).--measure-retention-low-watermark float
: Disk usage percentage where forced retention cleanup stops (0-100, default: 85.0).--measure-retention-check-interval duration
: Interval for checking disk usage (default: 5m).--measure-retention-cooldown duration
: Cooldown period between forced segment deletions (default: 30s).--measure-retention-force-cleanup-enabled bool
: Enable forced retention cleanup when disk usage exceeds high watermark (default: false).
Stream Service
--stream-retention-high-watermark float
: Disk usage percentage that triggers forced retention cleanup (0-100, default: 95.0).--stream-retention-low-watermark float
: Disk usage percentage where forced retention cleanup stops (0-100, default: 85.0).--stream-retention-check-interval duration
: Interval for checking disk usage (default: 5m).--stream-retention-cooldown duration
: Cooldown period between forced segment deletions (default: 30s).--stream-retention-force-cleanup-enabled bool
: Enable forced retention cleanup when disk usage exceeds high watermark (default: false).
Trace Service
--trace-retention-high-watermark float
: Disk usage percentage that triggers forced retention cleanup (0-100, default: 95.0).--trace-retention-low-watermark float
: Disk usage percentage where forced retention cleanup stops (0-100, default: 85.0).--trace-retention-check-interval duration
: Interval for checking disk usage (default: 5m).--trace-retention-cooldown duration
: Cooldown period between forced segment deletions (default: 30s).--trace-retention-force-cleanup-enabled bool
: Enable forced retention cleanup when disk usage exceeds high watermark (default: false).
Liaison Servers
Liaison servers use the disk usage flags to manage their write queue and prevent disk space exhaustion:
--measure-max-disk-usage-percent int
: The maximum disk usage percentage allowed for measure data (0-100, default: 95).--stream-max-disk-usage-percent int
: The maximum disk usage percentage allowed for stream data (0-100, default: 95).--trace-max-disk-usage-percent int
: The maximum disk usage percentage allowed for trace data (0-100, default: 95).
Write Queue Mechanism: Liaison servers maintain a write queue that buffers incoming data before syncing it to data servers. When the queue fills up and disk usage exceeds the configured threshold, the liaison server throttles incoming writes to allow the sync process to catch up and free up disk space.
Property Service
--property-max-disk-usage-percent int
: The maximum disk usage percentage allowed (0-100, default: 95).
Property Service Behavior: The property service uses a similar disk management approach to liaison servers, where disk usage monitoring is handled at the write operation level rather than through forced cleanup.
How It Works
Data & Standalone Servers: Disk Management Process
When Force Cleanup is Enabled
- Monitoring: The disk monitor periodically checks disk usage on the service’s data path
- Trigger: When disk usage exceeds the high watermark, forced cleanup begins
- Cleanup: The system removes old data segments and snapshots in controlled steps
- Cooldown: A configurable cooldown period prevents thrashing between deletions
- Stop: Cleanup continues until disk usage falls below the low watermark
When Force Cleanup is Disabled (Default)
- Monitoring: The disk monitor periodically checks disk usage for metrics and monitoring only
- No Cleanup: No automatic segment or snapshot deletion occurs regardless of disk usage
- Write Throttling: Write operations are throttled/rejected when disk usage exceeds the high watermark
- Manual Management: Disk space management must be handled manually or through external processes
Results When High Watermark is Reached
When disk usage exceeds the high watermark threshold, the following actions occur:
Immediate Actions:
- Forced cleanup activation: The system sets
forced_retention_active
metric to 1 - Logging: An INFO-level log message is generated indicating disk usage above high watermark
- Metrics update:
forced_retention_runs_total
counter is incremented - Write throttling: New write requests may be throttled or rejected with
STATUS_DISK_FULL
error
Cleanup Process:
- Snapshot cleanup: First, old snapshots (older than 24 hours) are removed
- Disk usage recheck: After snapshot cleanup, disk usage is checked again
- Segment deletion: If still above low watermark, the oldest data segment is deleted
- Iterative process: The process repeats with cooldown periods between deletions
- Completion: Cleanup stops when disk usage falls below the low watermark
Data Preservation:
- Snapshots newer than 24 hours are always preserved
- Only one segment is deleted per iteration to maintain system stability
- The system follows an oldest-first deletion strategy across all groups
Liaison Servers: Write Queue Management
Liaison servers handle disk management differently due to their role as data coordinators:
- Write Queue: Incoming data is buffered in a write queue before being synced to data servers
- Queue Monitoring: The system monitors both disk usage and queue capacity
- Write Throttling: When disk usage exceeds the threshold, incoming writes are throttled
- Sync Process: The throttling allows the sync process to catch up and transfer data to data servers
- Queue Drainage: As data is synced and removed from the queue, disk space is freed up
- Write Resumption: Once disk usage drops below the threshold, normal write operations resume
Results When Max Disk Usage Percent is Reached
When disk usage exceeds the configured max disk usage percentage threshold, the following actions occur:
Immediate Actions:
- Health check failure: The service’s
CheckHealth()
method returnsSTATUS_DISK_FULL
error - Write rejection: All incoming write requests are rejected with
STATUS_DISK_FULL
status - Logging: A WARN-level log message is generated indicating disk usage is too high
- Service status: The service becomes read-only until disk usage decreases
Error Response Details:
- Status Code:
STATUS_DISK_FULL
(status code 6) - Error Message: “disk usage is too high, stop writing”
- Client Impact: Write operations fail immediately with the disk full error
- Recovery: Writes resume automatically when disk usage drops below the threshold
Queue Behavior:
- Write Queue: Data already in the queue continues to be processed and synced
- New Writes: New write requests are rejected at the health check level
- Sync Process: Continues to drain the queue and sync data to data servers
Property Service: Write Operation Management
The property service handles disk management through write operation health checks rather than forced cleanup:
- Health Check: Property update operations check disk usage before processing
- Write Rejection: When disk usage exceeds the threshold, property updates are rejected
- Read Operations: Query and delete operations continue to work normally
- No Cleanup: The property service does not perform automatic data cleanup
Results When Max Disk Usage Percent is Reached (Property Service)
When disk usage exceeds the configured max disk usage percentage threshold, the following actions occur:
Immediate Actions:
- Health check failure: Property update operations'
CheckHealth()
method returnsSTATUS_DISK_FULL
error - Update rejection: All property update requests are rejected with
STATUS_DISK_FULL
status - Logging: A WARN-level log message is generated indicating disk usage is too high
- Service status: Property update operations become read-only until disk usage decreases
Error Response Details:
- Status Code:
STATUS_DISK_FULL
(status code 6) - Error Message: “disk usage is too high, stop writing”
- Client Impact: Property update operations fail immediately with the disk full error
- Recovery: Property updates resume automatically when disk usage drops below the threshold
Operation Behavior:
- Property Updates: Rejected at the health check level when disk usage is too high
- Property Queries: Continue to work normally (no disk usage check)
- Property Deletes: Continue to work normally (no disk usage check)
- Snapshots: Continue to work normally (no disk usage check)
- Repair Operations: Continue to work normally (no disk usage check)
Data Preservation
- Snapshots newer than 24 hours are always preserved during cleanup
- The system follows an oldest-first deletion strategy across all groups
- Only one segment is deleted per iteration to maintain system stability
- Liaison servers preserve data integrity by ensuring all queued data is synced before cleanup
Important Considerations
Data & Standalone Servers
- The high watermark must be greater than the low watermark for each service
- When disk usage exceeds the high watermark, the service will start forced cleanup and may throttle writes
- The disk monitor measures usage on the service’s data path only
- Snapshots newer than 24 hours are always preserved during cleanup
Liaison Servers
- Write throttling occurs when disk usage exceeds the configured threshold
- The throttling mechanism allows the sync process to catch up with incoming data
- Liaison servers do not perform automatic data cleanup - they rely on data servers for that
- The write queue acts as a buffer to handle temporary spikes in data ingestion
- Proper tuning of the disk usage threshold is crucial for optimal performance
Monitoring and Metrics
The disk monitor exposes the following metrics for each service (measure, stream, trace) under the storage.retention.{service}
namespace:
forced_retention_active{service}
(gauge): Whether forced retention cleanup is currently active (1 = active, 0 = inactive)forced_retention_runs_total{service}
(counter): Total number of forced retention cleanup runsforced_retention_segments_deleted_total{service}
(counter): Total number of segments deleted during forced retention cleanupforced_retention_last_run_seconds{service}
(gauge): Timestamp of the last forced retention cleanup runforced_retention_cooldown_seconds{service}
(gauge): Cooldown period between forced segment deletionsdisk_usage_percent{service}
(gauge): Current disk usage percentage for the servicesnapshots_deleted_total{service}
(counter): Total number of snapshots deleted during cleanup
Configuration Examples
Standalone/Data Server Configuration
With Force Cleanup Enabled
banyand standalone \
--measure-retention-high-watermark=90.0 \
--measure-retention-low-watermark=80.0 \
--measure-retention-check-interval=2m \
--measure-retention-force-cleanup-enabled=true \
--stream-retention-high-watermark=85.0 \
--stream-retention-low-watermark=75.0 \
--stream-retention-force-cleanup-enabled=true \
--trace-retention-high-watermark=95.0 \
--trace-retention-low-watermark=85.0 \
--trace-retention-force-cleanup-enabled=true
With Force Cleanup Disabled (Default - Write Throttling Only)
banyand standalone \
--measure-retention-high-watermark=90.0 \
--stream-retention-high-watermark=85.0 \
--trace-retention-high-watermark=95.0
# Force cleanup flags omitted = disabled by default
# Only write throttling occurs when watermarks are exceeded
Liaison Server Configuration
banyand liaison \
--measure-max-disk-usage-percent=90 \
--stream-max-disk-usage-percent=85 \
--trace-max-disk-usage-percent=95
Property Service Configuration
banyand data \
--property-max-disk-usage-percent=90
Environment Variables
# For standalone/data servers with force cleanup enabled
export BYDB_MEASURE_RETENTION_HIGH_WATERMARK=90.0
export BYDB_MEASURE_RETENTION_LOW_WATERMARK=80.0
export BYDB_MEASURE_RETENTION_FORCE_CLEANUP_ENABLED=true
export BYDB_STREAM_RETENTION_HIGH_WATERMARK=85.0
export BYDB_STREAM_RETENTION_LOW_WATERMARK=75.0
export BYDB_STREAM_RETENTION_FORCE_CLEANUP_ENABLED=true
export BYDB_TRACE_RETENTION_HIGH_WATERMARK=95.0
export BYDB_TRACE_RETENTION_LOW_WATERMARK=85.0
export BYDB_TRACE_RETENTION_FORCE_CLEANUP_ENABLED=true
# For standalone/data servers with force cleanup disabled (default)
export BYDB_MEASURE_RETENTION_HIGH_WATERMARK=90.0
export BYDB_STREAM_RETENTION_HIGH_WATERMARK=85.0
export BYDB_TRACE_RETENTION_HIGH_WATERMARK=95.0
# Force cleanup environment variables omitted = disabled by default
# For liaison servers
export BYDB_MEASURE_MAX_DISK_USAGE_PERCENT=90
export BYDB_STREAM_MAX_DISK_USAGE_PERCENT=85
export BYDB_TRACE_MAX_DISK_USAGE_PERCENT=95
# For property service
export BYDB_PROPERTY_MAX_DISK_USAGE_PERCENT=90
Troubleshooting
High Disk Usage
If you’re experiencing high disk usage:
- Check the current disk usage metrics:
disk_usage_percent{service}
- Verify that forced retention is active:
forced_retention_active{service}
- Monitor cleanup progress:
forced_retention_segments_deleted_total{service}
- Adjust watermarks if needed to trigger cleanup earlier
Write Throttling
If writes are being throttled:
- For data/standalone servers: Check if forced retention is running and wait for cleanup to complete
- For liaison servers:
- Check if the write queue is full and sync process is behind
- Monitor the sync process performance to data servers
- Consider reducing the
*-max-disk-usage-percent
value to trigger throttling earlier - Ensure data servers have sufficient capacity to receive synced data
- Monitor the
forced_retention_active
metric to see if cleanup is in progress
Liaison Server Specific Issues
If liaison servers are experiencing persistent write throttling:
- Sync Process Bottleneck: Check if data servers are overloaded or have high disk usage
- Network Issues: Verify network connectivity between liaison and data servers
- Queue Size: Consider increasing write queue capacity if temporary spikes are common
- Threshold Tuning: Lower the disk usage threshold to provide more buffer space
Property Service Specific Issues
If property service is experiencing persistent write rejection:
- Disk Space: Check available disk space on the property service data path
- Threshold Tuning: Lower the
property-max-disk-usage-percent
value to provide more buffer space - Data Cleanup: Consider manually cleaning up old property data if automatic cleanup is not available
- Read Operations: Note that property queries and deletes continue to work even when updates are rejected
Configuration Validation
Ensure your configuration is valid:
- High watermark > Low watermark (for data/standalone servers)
- All percentage values are between 0 and 100
- Check interval and cooldown are positive durations
- Liaison server thresholds should account for expected queue size and sync latency
- Property service threshold should account for expected property data growth and available disk space