Detecting Anomalies in Social Media Quantity Time Sequence | by Lorenzo Mezzini

Creating an AI Agent to Write Weblog Posts with CrewAI

Are We Watching Extra Advertisements Than Content material? Analyzing YouTube Sponsor Knowledge

Analyzing a Pattern Twitter Quantity Dataset

Let’s begin by loading and visualizing a pattern Twitter quantity dataset for Apple:

Quantity and log-Quantity noticed for AAPL Twitter volumes
Picture by Creator

From this plot, we will see that there are a number of spikes (anomalies) in our information. These spikes in volumes are those we need to establish.

Wanting on the second plot (log-scale) we will see that the Twitter quantity information reveals a transparent day by day cycle, with larger exercise throughout the day and decrease exercise at evening. This seasonal sample is widespread in social media information, because it displays the day-night exercise of customers. It additionally presents a weekly seasonality, however we’ll ignore it.

Eradicating Seasonal Developments

We need to make it possible for this cycle doesn’t intrude with our conclusions, thus we’ll take away it. To take away this seasonality, we’ll carry out a seasonal decomposition.

First, we’ll calculate the shifting common (MA) of the quantity, which is able to seize the pattern. Then, we’ll compute the ratio of the noticed quantity to the MA, which supplies us the multiplicative seasonal impact.

Multiplicative impact of time on volumes
Picture by Creator

As anticipated, the seasonal pattern follows a day/evening cycle with its peak throughout the day hours and its saddle at nighttime.

To additional proceed with the decomposition we have to calculate the anticipated worth of the quantity given the multiplicative pattern discovered earlier than.

Quantity and log-Quantity noticed and anticipated for AAPL Twitter volumes
Picture by Creator

Analyzing Residuals and Detecting Anomalies

The ultimate element of the decomposition is the error ensuing from the subtraction between the anticipated worth and the true worth. We will contemplate this measure because the de-meaned quantity accounting for seasonality:

Absolute Error and log-Error after seasonal decomposition of AAPL Twitter volumes
Picture by Creator

Curiously, the residual distribution intently follows a Pareto distribution. This property permits us to make use of the Pareto distribution to set a threshold for detecting anomalies, as we will flag any residuals that fall above a sure percentile (e.g., 0.9995) as potential anomalies.

Absolute Error and log-Error quantiles Vs Pareto quantiles
Picture by Creator

Now, I’ve to do an enormous disclaimer: this property I’m speaking about is just not “True” per se. In my expertise in social listening, I’ve noticed that holds true with most social information. Aside from some proper skewness in a dataset with many anomalies.

On this particular case, we now have effectively over 15k observations, therefore we’ll set the p-value at 0.9995. Given this threshold, roughly 5 anomalies for each 10.000 observations shall be detected (assuming an ideal Pareto distribution).

Due to this fact, if we test which commentary in our information has an error whose p-value is larger than 0.9995, we get the next alerts:

Indicators anomalies of AAPL Twitter volumes
Picture by Creator

From this graph, we see that the observations with the very best volumes are highlighted as anomalies. In fact, if we need extra or fewer alerts, we will regulate the chosen p-value, preserving in thoughts that, because it decreases, it can enhance the variety of alerts.