Ivan Drobyshev, Business Data Analyst, G&L
Media content delivery generates a lot of logs. This is a fact well understood at G&L, since we facilitate the distribution of audio and video content, live and on-demand, for some major broadcasters and official bodies to end users. We know well that log data has no lesser commercial value than the content itself. Log misdelivery can lead to short-term profit losses for streaming and broadcasting service providers. These issues can affect advertising exposure assessment, long-term planning, and more. Providing accurate data and analytics alongside our core services is our dedication, duty, bread and butter.
Maintaining log consistency stands as a critical task. With log sizes fluctuating, especially during soccer tournaments or any other major events, distinguishing between normal situations and incidents, such as indexing errors or misdelivery, becomes crucial.
A noticeable part of CDN logs generated while distributing our customers’ content is indexed to Elasticsearch 8.6.0. It incorporates two built-in features that hold promise for identifying consistency issues: Anomaly Detection (to identify an incident) and a classification model. We assessed both to determine their suitability for our customers’ needs.
First one out
Anomaly Detection didn’t make the cut due to its purely analytical (not predictive) purpose: to gain insights into the overall past picture. When configured properly, it seems to be “more focused” on sudden drops in figures rather than peaks, which is exactly what we need. Yet, the method lacked sensitivity: it identified discrete data points, not periods.
The Single Metric Viewer displaying the results of Elastic Anomaly Detector job. Only the central portion is related to an actual incident!
So, the Anomaly Detection tool from Elasticsearch may hint that some abnormal activity might occur at a specific moment. Whether it is true or not and the length of the incident is up to your manual check.
Machine Learning with Elasticsearch
The built-in ML functionality’s documentation doesn’t exactly unfold the red carpet of clarity, but at least it drops the model type hint: a gradient-boosted decision tree ensemble. That’s the only explanation we get. Feature rescaling? Class weighing? Who knows? Ah, the mysteries of proprietary magic! We get the warning to avoid “…to avoid highly imbalanced situations”, however. That’s our situation with an obnoxious 10 to 0.47 class-to-class ratio, so let’s keep this warning in mind.
Now, let’s get down to business – metrics business. We wanted to minimize the number of missed incidents while maximizing rightfully detected anomalies; other outcomes are slightly less relevant. So, we kept track of three metrics:
- ROC AUC score – a numerical whiz at quantifying the sensitivity-versus-noise dance.
- Accuracy – a share of correct predictions among all predictions.
- Recall – a ratio of correctly detected incidents to themselves + missed incidents, which answers the question “How many relevant items are retrieved?” (thanks, Wikipedia, for the question phrasing assist).
The ideal (and rarely obtainable) score of each metric is 1. In Data Science, anything close to the highest score possible raises an eyebrow and urges a scientist to double- and triple-check the results instead of celebrating them immediately.
And that’s what we had to do during our initial tests with the Elastic classification model. All three metrics returned results no less than 0.995! Okay, the dataset was labeled good. The values of features varied significantly between classes, making the classification task supposedly easy to compute. There were some mispredicted classes, yet the overall results seemed TOO impressive.
When results are too promising
Log misdelivery/mis-indexing is a rare event, so no validation subsample was available to us. The only way to validate it was to apply other strategies to the current data. Just a little overview of what we did to retest the findings:
- Removing multicollinearity by deleting correlated features.
- Undersampling the majority class to match the minority.
- Upsampling the minority class with synthetic records (the SMOTE technique) to two different values.
- Running each procedure in Elastic, and then doing the same using Python implementations of Logistic regression (with feature rescaling), Random Forest classification, and alternative gradient boosting on decision trees from the CatBoost library. The class weighing was not forgotten, too.
Metrics and a confusion matrix from the Elastic classification model.
0% errors are not equal to 0 errors!
The results obtained with standalone models were coherent with our prior findings from Elastic and sometimes even exceeded the latter. Some metrics went as low as 0.985, which is still incredible in the Data Science world. Yet we learned from class-balanced tests that it’s possible and isn’t necessarily a sign of the model being overfitted.
Observing worse confusion matrix figures and better metric values from the Elasticsearch tool was a bit confusing. The only plausible explanation is some tricky Elastic metric calculation procedure, whereas independent models were evaluated with tools from the open-source scikit-learn library widely accepted in the Data Science community.
So, is Elastic ML any good for incident detection?
Elasticsearch offers a good entry-level solution for classification problems. Its performance on well-labeled data is enough to identify a continuous series of an event, a category log misdelivery and indexing issues fall under. The user-friendly Kibana interface provides means to get the model ready (train it) without extensive prior knowledge of ML libraries and environment (being familiar with general concepts might be handy, though). Deploying the model so that it would detect incidents in real time requires a more in-depth understanding of Elastic operations. Nevertheless, those operations are managed from the same Kibana interface (or corresponding APIs), using the same stack: no additional software, plugins, scripts, etc.
What’s the catch, then? Isn’t that ideal? Well, let’s start with the fact that ML instruments are available only with “Platinum” and “Enterprise” subscriptions. Free license users must rely solely on standalone models with their own stack and resources.
Also, there’s a trade-off: lower performance. Yes, the metrics are incredible! Yet, confusion matrices show that native Elastic models may be more likely to give false alarms or miss an actual anomaly. Luckily, log misdelivery/mis-indexing is rarely a single discrete event. Its consecutive nature mitigates the risk of missing an incident: at least some of the abnormal values in a series will be detected.
When there’s a need to identify more discrete standalone events that generate a single log message, one might want to rely on other tools independent from the Elastic stack. In our experience, the CatBoostClassifier outperforms any other library, even without hyperparameter tuning. Yet, any open-source library of choice offers more controllability and transparency, which proprietary solutions often lack.
What did we choose for our application? After giving it a long thought, we went down the third road. We recently started a project to improve our log metadata collection pipelines. We know from first-hand experience that log misdelivery can be identified from this metadata. The tests we performed on Elastic built-in instruments and standalone models made it clear that implementing a single computation-costly feature would require some effort. So, we decided to take a whole other approach with log metadata. But that’s a story for another article.