Scale Model, one of the newest companies to launch out of betaworks, helps identify, follow and reach communities on Twitter. While there’s a great visual dashboard that gives us a way to look at what’s bubbling up from within communities, it is still hard to evaluate which items appear on a regular basis, and which are more unique. For example, in the US politics model, the #WakeUpAmerica hashtag is used on a regular basis by conservatives, hence appears on the dashboard quite often. Wouldn’t it be great to know when activity around a certain hashtag is unique? Or more specifically, deviates from the expected behavior? Since we can’t expect users to be continuously glued to our dashboard, it’d be great if we could send out notifications whenever something important happens.
In the following post, we detail work done by Rohit Jain, a Master’s student at Cornell Tech, who spent the summer with the betaworks data team. Rohit’s work lays the groundwork for a number of new features we hope to integrate into Scale Model.
Anomaly Detection: seasonal hybrid ESD approach
This problem is common to a variety of areas like DevOps, fault detection, and automatic monitoring. From existing research around Twitter trend analysis, here are some interesting methods we found:
The parametric approach in #1 requires labeled data and the probabilistic approach proposed in
#2 doesn’t seem to work nicely with periodic data. The Seasonal Hybrid ESD approach in #3 — described in detail in Vallis et al (2014) — had been primarily proposed for long term anomaly detection, but seemed like a great candidate for the cyclical time series data we have. Below, we briefly describe this approach.
As a first step, we plotted the frequency of tweets containing a certain hashtag every 6 minutes for the community discussing politics in the United States. We observed the following patterns:
- Some hashtags like #tcot exhibit a daily periodic pattern where the number of tweets containing that hashtag remains steady throughout the day, then falls dramatically during the night.
- Hashtags like #LoveWins were triggered by the historic Supreme Court ruling in favor of marriage equality, generating a big, sudden spike.
- #CruzCrew has a weekly cycle, where hashtag usage spikes on a certain day on a weekly basis.
We’re certainly interested in identifying significant trends such as #lovewins as early as possible, but we also want to know when periodic hashtags deviate from the expected behavior. Our assumption is that whenever the number of tweets containing a certain hashtag deviate from expected behaviour, there is an event responsible for it.
We begin by decomposing each hashtag’s time series of frequencies using STL, a decomposition procedure introduced in Cleveland et al (1990). This process yields trend, seasonal, and residual components for each time series, such as those displayed below.With these components in hand, we replace the trend component for each observed frequency cycle with the median for that cycle. We then remove the seasonal and trend components from the time series and apply the generalized extreme studentized deviate (ESD) anomaly detection technique introduced in Rosner (1983) to the residual component. This test yields the anomalies in the hashtag’s frequency time series, which we then threshold based on statistical significance of the anomalies.
We applied this approach to our data and added a number of heuristics to cluster anomalies based on the time of occurrence and similarity of tweets.In the graph below, red dots represent all the anomalies detected and the yellow dots represent the starting point for clustered anomalies. The text in the box is the text of the top tweet for the anomalous period.
Finally, we added a ranking algorithm to score these anomalies based on the element of surprise in order to identify the anomalies that should trigger user notification. Though the results look promising we need a better way to evaluate and measure this (or any other) method’s performance. To do so we need labeled data that is not readily available (e.g. “what is a good abnormal event in the Animal Lovers model…”).
Our solution? We created a Slack bot that not only notifies us whenever an anomaly is detected, but also urges users to vote whether the anomaly corresponds with an outlying trend on Twitter.
Just to give a few examples, these are the notifications we received from US politics community during the recent Republican debate on Fox News:
For our (beloved) animal lovers community, we received the following notification about #WorldElephantDay at around 8 am EST.
We also used this technique to analyze data for a longer period of time, with the aim of generating a monthly report that summarizes important events for several Scale Model communities. Below you can see the top 5 surprising hashtags and a related tweet for each in the US Politics and animal lovers communities for the duration of 23 June – 15 Jul 2015.
The technique has yielded promising results so far, identifying relevant trends within several key communities in a timely manner. We also “push” these trends to the Scale Model team via Slack, so that event detection no longer requires frequent community monitoring. Further work will introduce smarter spam detection and evaluate alternative techniques using data we’ve collected so far. Ultimately, we would like to roll this feature out to all Scale Model communities and help our customers stay abreast of #NewTrends in communities in real time.