Celestia Testnet Log Analysis
Source Code used for this analysis: celestia-log-analysis
Introduction:
Analyzing logs is crucial for understanding the behavior and performance of any node - be it a validator node or a data-availability node. Log data provides detailed insights about system performance metrics, error occurrences, and other significant events. However, comprehending this raw data can often be a complex task. That's where data visualization steps in, transforming the raw log data into meaningful, digestible insights. In this post, we'll walk you through the process of our log analysis, explaining how we leveraged Python to visualize the data, uncovering key patterns and correlations.
Agenda:
The main objective of our analysis is to investigate two primary aspects of the node's performance - the time taken for sampling headers and the occurrence of errors. We intend to understand how these metrics behave over time and if there is any correlation between them. To do this, we'll visualize the data through different types of plots, each providing unique perspectives of the data.
Before we delve into the visualizations, let's understand how we extracted the necessary data from the log files:
Data Extraction from Log Files:
We started with raw log files, each line containing a wealth of information. The log line we were particularly interested in looked like this:
2023-05-13T00:00:12.575Z [INF] [headerfs] finished sampling headers, "from": 472391, "to": 472391, "errors": 0, "time_taken": 0.31604745
This line indicates the timestamp, the header range sampled ("from" and "to"), the number of errors that occurred during the operation, and the time taken for the operation.
We used Python's inbuilt re
library to extract this data using regular expressions. The extracted data was then stored in a structured CSV file format, making it ready for analysis. Here's how the data looked post-transformation:
timestamp,from_height,to_height,errors,time_taken
2023-05-13T00:00:12.575Z,472391,472391,0,0.31604745
2023-05-13T00:00:22.960Z,280502,280601,0,4200.943984111
...
Now, let's jump into the analysis and visualizations:
- Line Plot of Time Taken for Sampling Headers Over Time:
The first visualization represents how the time taken for sampling headers changes over time. We plotted a line graph where the x-axis represents timestamps and the y-axis signifies the time taken for sampling headers. This visualization helps identify trends and spot potential irregularities in the system performance over time.
- Line Plot of Time Taken for Sampling Headers Over Time (Smoothed and Outlier Removed):
While the first plot provides a general overview, it can be influenced by outliers and abrupt changes. Therefore, we created a smoothed version of the graph, minimizing the impact of noise and outliers to highlight the underlying trend better. We utilized the Savitzky-Golay filter from the SciPy library for smoothing the graph.
- Histogram of Errors:
To understand the distribution of errors in our system, we plotted a histogram. This plot helps identify the most frequent number of errors and any unusual occurrences, providing insights into the error handling of the node.
- Histogram of Time Taken:
Similar to the error histogram, this visualization represents the distribution of time taken for operations. This plot aids in understanding the most common time durations and spotting any outliers, thus providing insights into the node's efficiency.
- Scatter Plot of Sampling Time vs. Errors:
In our quest to find a correlation between the time taken for sampling and the number of errors, we plotted a scatter plot. This plot helps identify if there's any relationship between these two variables, i.e., whether an increase in the time taken for sampling corresponds to an increase or decrease in the number of errors. Such an analysis assists in spotting potential bottlenecks and inefficiencies in the node's operation.
- Heatmap of Errors by Hour of the Day:
Our final visualization is a heatmap that illustrates how the number of errors varies by the hour of the day. This plot is crucial to detect any patterns in errors based on the time of day, such as if certain hours are more prone to errors than others. This information can be instrumental for system maintenance and downtime planning. To create this, we added an 'hour' column to our DataFrame and used it along with the 'errors' column to create a pivot table, which was then visualized as a heatmap.
Conclusion:
Through these visualizations, we aimed to provide a comprehensive overview of the node's performance, focusing on the time taken for sampling headers and the occurrence of errors. Each plot offered unique insights and collectively, they provided a holistic view of the node's behavior over time. This analysis can be an effective tool for performance tuning, troubleshooting and proactive maintenance of the node.
Remember, the key to efficient system performance lies not just in generating logs but in effectively analyzing them to drive actionable insights. As the saying goes, "What gets measured, gets managed!"
We hope you found this guide useful and that it provides a good starting point for you to perform your own log analysis. Happy analyzing!