Property Background Repair Observability
Based on the Property Background Repair Strategy documentation, this article explains how to visualize and monitor each synchronization cycle to enhance observability and debugging.
This feature is enabled by default. You can configure whether to record the data through the --property-repair-obs-enabled
option.
Tracing
Tracing is used to record the operation flow at each node during gossip-based Property repair. This allows for fast and accurate diagnosis in the event of issues or inconsistencies during the repair process.
All trace data is written to the _property_repair_spans
stream group and can be queried for inspection and analysis.
Tag Name | Type | Description |
---|---|---|
tracing_id |
string |
Unique trace ID for the entire repair task, typically sender_node_id + start_time . |
span_id |
int |
Unique span ID within the trace. |
parent_span_id |
int |
ID of the parent span (0 if root span). |
sender_node_id |
string |
ID of the sender node where the repair was executed. |
current_node_id |
string |
ID of the node where the span was executed. |
start_time |
int64 |
Unix timestamp of span start (in milliseconds). |
end_time |
int64 |
Unix timestamp of span end (in milliseconds). |
duration |
int64 |
Total duration of the span, in milliseconds. |
message |
string |
Descriptive log of the action performed in this span. |
tags_json |
string |
JSON-formatted key-value tags attached to the span (e.g. group, shard_id). |
is_error |
bool |
Whether this span encountered an error. |
error_reason |
string |
Error message or failure reason if is_error is true. |
sequence_number |
int |
Which round number of gossip propagation. |
All Property-related metadata is encapsulated within the tags_json
field of each trace span.
This allows for flexible and structured logging of contextual information. The following are key fields commonly included in tags_json
:
Tag Name | Description |
---|---|
target_node |
The ID of the target node involved in the current operation. |
group_name |
The name of the Property group being synchronized. |
shard_id |
The shard identifier being processed. |
operate_type |
The type of operation being performed, such as "send_summary" , "compare_leaf" , or "update_property" . |
property_id |
The identifier of the specific Property being updated or compared. |
Data TTL
By default, the system automatically retains all background repair records for three days.
This retention period can be configured via the --property-repair-history-days
option.
Metrics
All metrics are reported through the internal self-observation system and can be queried using standard tools as described in the Observability documentation.
In the context of Property background repair, the following key metrics are exposed:
Metric Name | Type | Description |
---|---|---|
property_repair_success_count |
Counter | Total number of Properties successfully repaired across all nodes. |
property_repair_failure_count |
Counter | Total number of Properties that failed to repair due to validation, write errors, or version conflicts. |
property_repair_gossip_abort_count |
Counter | Total number of gossip repair sessions that were forcefully aborted due to unrecoverable errors or unavailable peers. |
property_repair_finished_count |
Counter | Total count of the repair process was triggered (either scheduled or event-based). |
property_repair_finished_latency |
Counter | Total latency of the whole property repair process. |
property_repair_per_node_sync_finished |
Counter | The number of completed synchronization operations executed by each node during Property repair. |
property_repair_per_node_sync_latency |
Counter | The latency seconds of completed synchronization operations executed by each node during Property repair. |
property_repair_total_propagation_count |
Counter | Total success propagation count across all nodes. |
property_repair_total_propagation_percent |
Histogram | Total percent have propagation in each round. total_propagation_count / max_propagation_count . |