In the following post, we detail work done by Rohit Jain, a Master’s student at Cornell Tech, who spent the summer with the betaworks data team. Rohit’s work lays the groundwork for a number of new features we hope to integrate into Scale Model.
Anomaly Detection: seasonal hybrid ESD approach
This problem is common to a variety of areas like DevOps, fault detection, and automatic monitoring. From existing research around Twitter trend analysis, here are some interesting methods we found:
The parametric approach in #1 requires labeled data and the probabilistic approach proposed in
#2 doesn’t seem to work nicely with periodic data. The Seasonal Hybrid ESD approach in #3 — described in detail in Vallis et al (2014) — had been primarily proposed for long term anomaly detection, but seemed like a great candidate for the cyclical time series data we have. Below, we briefly describe this approach.
As a first step, we plotted the frequency of tweets containing a certain hashtag every 6 minutes for the community discussing politics in the United States. We observed the following patterns:
We’re certainly interested in identifying significant trends such as #lovewins as early as possible, but we also want to know when periodic hashtags deviate from the expected behavior. Our assumption is that whenever the number of tweets containing a certain hashtag deviate from expected behaviour, there is an event responsible for it.
We begin by decomposing each hashtag’s time series of frequencies using STL, a decomposition procedure introduced in Cleveland et al (1990). This process yields trend, seasonal, and residual components for each time series, such as those displayed below.With these components in hand, we replace the trend component for each observed frequency cycle with the median for that cycle. We then remove the seasonal and trend components from the time series and apply the generalized extreme studentized deviate (ESD) anomaly detection technique introduced in Rosner (1983) to the residual component. This test yields the anomalies in the hashtag’s frequency time series, which we then threshold based on statistical significance of the anomalies.
We applied this approach to our data and added a number of heuristics to cluster anomalies based on the time of occurrence and similarity of tweets.In the graph below, red dots represent all the anomalies detected and the yellow dots represent the starting point for clustered anomalies. The text in the box is the text of the top tweet for the anomalous period.
Finally, we added a ranking algorithm to score these anomalies based on the element of surprise in order to identify the anomalies that should trigger user notification. Though the results look promising we need a better way to evaluate and measure this (or any other) method’s performance. To do so we need labeled data that is not readily available (e.g. “what is a good abnormal event in the Animal Lovers model…”).
Our solution? We created a Slack bot that not only notifies us whenever an anomaly is detected, but also urges users to vote whether the anomaly corresponds with an outlying trend on Twitter.
Just to give a few examples, these are the notifications we received from US politics community during the recent Republican debate on Fox News:
For our (beloved) animal lovers community, we received the following notification about #WorldElephantDay at around 8 am EST.
We also used this technique to analyze data for a longer period of time, with the aim of generating a monthly report that summarizes important events for several Scale Model communities. Below you can see the top 5 surprising hashtags and a related tweet for each in the US Politics and animal lovers communities for the duration of 23 June – 15 Jul 2015.
The technique has yielded promising results so far, identifying relevant trends within several key communities in a timely manner. We also “push” these trends to the Scale Model team via Slack, so that event detection no longer requires frequent community monitoring. Further work will introduce smarter spam detection and evaluate alternative techniques using data we’ve collected so far. Ultimately, we would like to roll this feature out to all Scale Model communities and help our customers stay abreast of #NewTrends in communities in real time.
Rohit & Alex]]>
This is a project developed by Chris Hidey, who spent this summer interning with us at the betaworks data science team, focusing on natural language understanding and event detection in news streams.
In Chris’ words:
The project I worked on this summer was to develop a method that algorithmically generates timelines around a given news subject. A “subject” can be any topic or event, such as the Sony hacks or the FIFA corruption scandal or the ongoing news coverage of Hillary Clinton or Donald Trump, or even specific issues such as the tax policy of presidential candidates. The goal is to determine key events over a specified time window that indicate new or significant developments in the story. The result is a retrospective look at how these events unfolded within a particular topic’s lifetime.
(It is important to note that the developed approach outlined below is not limited to social data; ultimately the only requirement is textual data with timestamps.)
The data for this project was derived from social media comprising primarily of tweets with URLs, group by subject using a bag-of-words approach. One signal that is available to help determine events is the number of new links we see about a given topic over time. This time series data indicates when the news coverage around a subject has peaked, based on the velocity of links published about it.
Although this velocity signal can be used to determine when an unusual event has happened, it is difficult to understand what _actually_ happened or whether this is a follow-up news report about a story we’d seen in the past. Fortunately, we have lexical information available – the words in the tweets and “slug”. The slug is the section of the URL that both identifies and describes a link and usually contains keywords selected to promote the link in search results (i.e. nytimes.com/politics/first-draft/2015/07/07/marco-rubio-attacks-higher-education-cartel-and-jabs-rivals). Using some NLP magic, we are able to extract the title, text and description of articles. Using this lexical signal we then represent an event as a collection of words that occur together often and around the same time. Ideally we want to identify events that are new and significant so we need to balance uniqueness with frequency.
To model this problem, I represent the counts of word co-occurrence over time as a tensor and then do a PARAFAC decomposition. This is a similar approach to Latent Semantic Analysis but we are also considering the temporal element. In LSA, we have a matrix of terms and documents that we can use to determine the latent factors. For this data, we have terms, documents, and time, but each document is only associated with a single timestamp so just adding temporal information would not provide additional benefit. Thus we can think of one term as a “document” and each term is represented by a matrix of context vectors over time.
[Read more about the tensor decomposition process, data preprocessing and algorithm in this ipython notebook ]
The overall result of the tensor decomposition method generates a list of sub-topics around the news story, and then provides the ability to track these sub-topics over time. In the figure below, we show sub-topics generated from analyzing all links about FIFA. The x-axis represents the days elapsed since we started tracking this subject (0 is at July 1 2015). Each colored line represents a sub-topic discovered via the method. The Y-axis indicates the score attained by the sub-topic at a certain point in time. Every time a sub-topic peaks, it marks the point where an important event most likely occurred. It is evident that the majority of big stories about FIFA occur after May 27, 2015, when seven FIFA officials were arrested. This is where we find sub-topics that reach the highest peak.
For each sub-topic, the algorithm generates a representative set of words which we can map back to links using their slugs. As a result, for each event we have a list of ranked links that best represent it. With different datasets, this method can produce interesting results that allow for a retrospective analysis into subjects, events and topics.
Here is an example of a visual timeline populated by this method’s output, including major events around coverage of Donald Trump’s campaign:
[click on the gif to play it]
Another cool thing about this algorithm is that it is largely customizable, meaning you could specify a period of time and the top-K events within that period. This allows us to expand and converge the number of items presented within any chosen period. For example, below we see the top-10 sub-events around Hillary Clinton’s campaign in July 2015:
[click on the gif to play it]
In a world where we are overwhelmed with multitude of information, our attention is pulled every which way. It is natural to feel like we are always playing “catch-up” with the news. Many times, it is hard to consume a current story without enough background knowledge. This algorithm’s output – a retrospective timeline of major events – could be an interesting way to understand the context for any current story. Perhaps this lower the bar, and further our reader’s appreciation for current news stories.
— Chris & Suman]]>
Here’s the overview:
On October 29th and December 18th, 2014, something very strange happened to the iTunes top apps chart. Like an earthquake shaking up the region, all app positions in the chart were massively rearranged, some booted off completely. These two extremely volatile days displayed rank changes that are orders of magnitude higher than the norm — lots of apps moving around, lots of uncertainly.
If you build apps for iOS devices, you know that the success of your app is contingent on chart placement. If you use apps on iPhones and iPads, you should realize just how difficult it is for app developers to get you to download their app. Apple deploys an algorithm that identifies the Top Apps across various categories within its iTunes app store. This is effectively a black box. We don’t know exactly how it works, yet many have come to the conclusion that the dominant factor affecting chart placement is the number of downloads within a short period of time.
If a bunch of people all of a sudden download your app, you climb up the charts, and as a results, gain significant visibility, which results in many more downloads. Some estimate that topping the charts may lead to tens of thousands of downloads per day.
Encoded within the iTunes app store algorithm is the power to make or break an app. If you get on its good side, you do really well, and if not, you lose.
If these volatile days are deliberate, shouldn’t we be informed? There are over 9 million registered developers who have shipped 1.2 million apps into iTunes. Algorithmic glitches on wall street can set off hundreds of millions of dollars in losses. What’s the dollar cost to entrepreneurs affected by these iTunes glitches? These are people who pour countless hours and resources into adding value to Apple’s ecosystem. Whether running experiments or A/B tests, shouldn’t Apple show due respect by taking issues like this seriously?
While the app store’s ranking algorithm is opaque, there’s much to be learned by looking at it’s output over time. In his work on Algorithmic Accountability, Nick Diakopoulos highlights ways to investigate the inner-workings of algorithmic systems by tracking inputs and outputs.
Analyzing this type of data gives us a way to hold accountable systems of power, in this case, Apple and its algorithm.
Perhaps Apple is not aware of these glitches? Or maybe my data is flawed? I’ll let you be the judge of that. I did manage to find another person complaining about abnormal chart rank fluctuation around the same time. If you’ve witnessed something similar, please add a note or get in touch.
Read the full piece here.]]>
Massive pro-democracy protests took place in Hong Kong last week. More than half a million individuals (most of whom were students) decided to occupy Central, the heart of Hong Kong. I chronologically tracked five news events reported from Hong Kong during the protests and studied their evolution within the Facebook media ecology. Using Facebook trending data, I was able to spot which news stories became trends while which ones failed to, and which trends persisted while others died off.
Online news media has rapidly transformed into a mobile, real-time phenomenon. There are several news stories that compete within the Facebook ecosystem to make it into the trending list. Trends sustain on the top-10 trending list when enough people see a post and themselves share it. Few news stories make it to the trending list, even lesser sustain in the list for long periods of time. Sustained presence of a news in the trending list promises increased attention from the users and a possibility of further sharing that in turn will make sure the story remain trending.
Here are the highlights from the original post:
We found three driving factors that determine if a news is going to sustain in the trending list. Two of the factors are well known in news-cycle evolution, but the third one seems to be a Facebook-only phenomenon. Our data provides evidence that these factors led to disproportional attention regarding Hong Kong protests in different geographical communities on Facebook. However, the interesting thing with this dataset is we can quantitatively measure the impact of each of these factors on the news story.
People don’t share when they are sleeping (at least we hope not). Diurnal patterns are common in social media, and there is no exception in Facebook. A piece of news that breaks late in the evening has a lesser chance of sustaining as a trend.
2. Number of Competing News Stories
The number of competing news trends in a geographical community affects the trend sustainability. Competing stories reflects the ecological conflict that a piece of news faces to break into the top-10 list and maintain its spot. Using a technique called Likelihood Estimation, we can estimate the chance that a news story will get into the trending list. Note this is not a measure of sustainability, its only a reading of the probability of even breaking into the top-10 trend list.
Competing news stories offer lesser likelihood that any particular story will be able to make an appearance in the top-10 trending list. This likelihood is unbalanced depending strongly on the geographical region. For example, the likelihood of a news story making it as a Facebook trend is nearly 1.75 times as high in Australia compared to the USA. [INTERACTIVE visualization]
3. The Escape Velocity
The potential of a news story to sustain in the top-10 list appears to be strongly influenced by a key phenomenon: how long the trend can maintain a top-3 spot on the trending list. If the news piece maintains such a top-3 spot for more than 1.5 hours after breaking into the trending list, then it has a significant chance of persisting as a trend for the next ~12-18 hours.
In fact, between Aug. 26th and Sept. 4th, I found that only 12% of news stories which break on a day end up occupying ~72% of the trending slots for the next ~16–18 hours. What’s common among these 12% of stories? They all had risen as high as the top-3 trend and had survived there for at least 1.5 hrs. In fact, stories that did not last for 1.5 hrs in the top-3 had a 57% chance of falling off from the top-10 list within the next 6 hours. Thus, the first 1.5 hours in the trending list is critical in a news story’s longevity and a powerful symptom for trend sustenance on Facebook.
The initial number of shares within the first 1.5 hrs is critical in giving the news enough ‘escape velocity’. This velocity enables the trend to last in the top-3 long enough in the face of competing trends. And this is what makes a marked difference; why we observe the same piece of breaking news being able to sustain as a trend in one geographical region while dying in another.
But what’s so special about the top-3 trending slots? Possibly a subtle Facebook design feature !
Read our entire research here: beta.works/fbtrends . Also, check out our interactive timelines of FB trends for Hong Kong news events: beta.works/hk_trends]]>
Using Twitter trending topics data, I show how the country’s attention gradually turned to the events transpiring in Ferguson, what were the driving factors and what it tells us about certain latent kinetics of the Social Web. The data indicates there is specific evidence of persistent attention in certain cities and an ensuing social contagion effect, that possibly swung national attention to Ferguson.
Here are some highlights from the post:
Imagine a city’s attention like a moving carousel and an impending trend is someone trying to get aboard. The more the volatility, the faster the carousel is spinning. If you are an impending trend, it is much easier to jump into a slowly spinning (less volatile) merry-go-round of attention and stick to it, than it is to get on and tag along on a carousel which is spinning very fast (more volatile).
The more Ferguson-related trends occur in the Trending Topic List of the location, the higher will be the amplitude of the signal.
The signal was highest in St. Louis, the epicenter of the news.
Geographical information diffusion on Twitter is driven by the social interconnection structure among users in various cities and the sensitivity of a city’s tweeting population to new topics. Complex contagion is the study of social networks as conduits for ‘infectious’ idea/topic transmission.The simplest way to study contagious information spread is to examine when the various cities got infected with a Ferguson-related trend.
The data shows that on Aug. 10th, Ferguson-related trends were observed in four cities, namely the epicenter St. Louis itself, Miami, Boston and New York. The time between subsequent infections in every case was more than 3 hours.
Past midnight on Aug. 11th, Ferguson-related trends had persisted in New York for about 15 hours. Then something interesting happens. The trend first appears in Washington DC. It trends in Washington D.C for about 70 minutes, before erupting into national trend — infecting 12 cities in the next 3 hours! This is a remarkable escalation in dispersion, given Ferguson-related topics trended in only 4 cities in the previous 53.4 hours. Following the topic’s persistence in New York and inception at Washington DC , 62% of the remaining cities get infected in the next 5 hours.
Such points in a social phenomenon’s space-time, like the one observed in Washington DC just after midnight, are called ‘tipping point’ or ‘critical mass point’. At the tipping point, a huge fraction of the cities adopt a previously rare trending topic over a drastically short period of time.
What the study tells us
The elementary reason why such social contagions and tipping points are decisive in information dissemination is that it draws national attention to a trending topic quickly, and in many cases, can be the difference between the news going national or not. A possible hypothesis of why this phenomenon emerges in the Ferguson scenario is that several journalists/media are positioned in New York and the city might be a critical node en-route to a national trend. Sustained attention in New York (~ 15 hours) could be a key factor in making this “trend” tip.
Read the entire post here]]>
Here are some highlights:
… on social networks:
The landscape is much more nuanced, and highly personalized. We construct a representation of our interest by choosing to follow or like specific pages. The more we engage with certain type of content, the more similar content is made visible in our feeds. Recommendation and scoring functions learn from our social connections and our actions online, constructing a model that optimizes for engagement; the more engagement, the more traffic, clicks, likes, shares, and so forth, the higher the company’s supposed value. Our capitalistic markets appreciate a growing value.
… on algorithmic filtering:
The better we get at modeling user preferences, the more accurately we construct recommendation engines that fully capture user attention. In a way, we are building personalized propaganda engines that feed users content which makes them feel good and throws away the uncomfortable bits.
We used to be able to hold media accountable for misinforming the public. Now we only have ourselves to blame.
… on capitalism vs. democracy:
Personalized online spaces are architected to keep us coming back for more. Content that is likely to generate more clicks, or traffic is prioritized in our feeds, while what makes us uncomfortable, fades into the ether. We construct our social spaces — we may choose to follow a user, like a page or subscribe to updates from a given topic.
The underlying algorithmics powering this recommendation engine help reinforce our values and bake more of the same voices into our information streams.
…and a beautiful explanatory network visualization : https://www.youtube.com/watch?v=ajmE0hjkM24
Read the entire article here]]>
Weather Data: Poncho has a data-intensive backend, which collects weather data from approximately 42,000 zip codes across the US. Each zip code can be thought of as a data point in vector form, with features such as temperature, humidity, wind speed, precipitation, a natural language summary (‘cloudy’, ‘rain’ ) etc. Some features contain time series information – such as precipitation or temperature variation over the next 36 hrs. Given these feature vectors, our goal is to find zip codes that possess striking similarity in weather patterns over the next x hours (where x=12, 24, 36 ; depending on the cycle of computation). During each cycle (see figure below -left), our algorithm is tasked with detecting all zip codes that bear substantially similar weather given the dynamic variability in weather conditions. In other words, we must cluster zip codes by weather similarity.
Weather Subspaces: Poncho uses a dynamic heuristic subspace clustering algorithm, which differs from traditional data clustering techniques such as k-means or spectral clustering. This choice is primarily motivated due to constrains including: (1) the inherent nature of weather data, which is extremely dynamic and (2) the required alignment of retrieved cluster centroids to the editorial message standards. What this means is many of the data dimensions are irrelevant to editors/creative writers under different weather conditions. Instead of searching over the entire vector space of data, we search for patterns in specific subspaces within the data space. The choice of subspaces is strongly driven by editorial messaging rules. For example, a message for (windspeed > 12 mph + rain) vs. (windspeed > 12 mph + no rain) will be quite different (e.g., for the first condition the message might be: ‘you will find it difficult to hold on to your umbrella’). In fact, the priority of these conditions changes seasonally, e.g., you probably want to omit talking about humidity in the winter.
Thus, clusters are extracted not just based on the sensor data, but also the heuristic rules.These rules determine which subspaces of weather to prioritize when clustering. Searching in specific subspaces enables extraction of cluster centroids which can be easily transformed into intuitive and actionable messages. Our algorithm is an extension of the general subspace clustering, adjusted for weather data specifically. In the figure below (right), (A) shows the k-means cluster whereas (B) shows the result of dynamic heuristic subspace clustering. Notice if the wind speed is greater than a certain threshold, we need to separate rainy vs. no-rain areas into different clusters.
Weather Cluster Patterns: Shown below are results of dynamic heuristic subspace clusters in New York state. Normally we would assume a cluster to contain geographically closer (neighboring) locations exclusively. However, this is not always true. Notice how Long Island and some parts of Rochester have quite similar weather patterns for this day. There are other factors at play which can cause distant locations to bear very similar weather conditions, such as distance from a water body, terrain etc.
In some states, we find weather cluster boundaries ordered by their distance from the ocean. A good example of this pattern is Massachusetts, where we often detect four clusters – ordered by distance from the coast. This feels intuitive. Weather conditions near the coast are strongly influenced by the ocean. As we go more inland, the clusters resemble clear demarcations in weather boundaries, typically based on wind speed and precipitation levels.
Delivering Weather Forecasts: Once clusters within a state are determined, it is presented in an admin panel (shown below). Each entry in the admin panel refers to one cluster. Each cluster contains a multitude of zip codes that share very similar weather patterns. The admin panel allows for a single message to be broadcasted to all the geo-locations within that cluster. The message is constructed based on the centroid of the particular cluster, in additional to the editorial voice that humanizes the forecast experience.
One of the key things we aim to achieve with this form of data science is to connect algorithmic results to interpretable artifacts. A priority is making the cluster results as actionable as possible for editors/writers. This post takes a bird’s eye view on a critical pipeline – from the writers’ message composition factors to translating them to mathematical heuristics and partitioning the data using dynamic heuristic subspace clustering, followed by feeding the cluster results back to the admin panel so messages can be broadcasted to users in specific zip codes.
Related Topics: Natural Language Generation, Bipartite Matching, Subspace Clustering]]>
We began by grabbing Twitter data using the public APIs, looking for tweets that have certain keywords, hashtags and twitter handles connected to March Madness events and games. Our dataset grew rapidly, hundreds of thousands of tweets from different games, referencing teams, players and many other random facts. But surprisingly, we found very little hate. Using a number of sentiment analysis libraries, and services. Even when trying to poll the data for certain common words or phrases to put together a training set to build our own classifier yielded poor results. We just couldn’t find sufficient examples in the data.
Weirdly enough, we came across numerous users who tweeted about their “love/hate” relationship with March Madness. But not much anger or hate, and very little directed at specific teams. Perhaps a byproduct of being a part of a networked public, one’s actions are not only seen by their peers, but visible to the network at large, hence one might be more cautious before displaying strong sentiment.
March madness is a love / hate relationship. I hate it right now.
— Dylan Lamb (@dylanlamb_) March 30, 2014
Constructing their Fan-base
Instead we chose to focus on each team’s networked audience (“fan-base”). If we can identify users rooting for each team, we’d be able to pull out some potentially interesting facts about each cluster of university supporters and run some broader comparisons. For the final four teams, we chose a number of popularly followed Twitter handles used to represent each team. Then we collected all users who retweeted those accounts into sets, making the *very crude* assumption that a retweet mostly represents an endorsement. While certainly not true in political discourse, especially when trying to make content visible to one’s audience for critique, this was rare in our case. And even if it did happen, the event would pretty much be an outlier in our data (users not part of the connected component).
Finally for every user in one of the four sets, we grabbed information about Twitter accounts that they follow. The Twitter graph represents both friendships as well as interest. By using network analysis techniques, we were able to run comparisons across the different teams.
For the digg post, the editors pulled out the following data points: devices used, user bios, and mainstream media preference (Fox News, CNN, MSNBC) reflected by who they follow. Some excerpts:
“Florida is uh, well, the only place where Fox News is popular. And it’s not even that popular. In fact, Florida is the only school where Fox News even shows up in the data. Obviously, this is some sort of conspiracy related to Benghazi.”
“Aaron Rodgers is more popular than sex. Well, actually we don’t know how popular sex is, but 50.25% of Wisconsin fans follow Aaron Rodgers and he didn’t even go to Wisconsin.”
Analyzing the Networks
At the end of this process we were left with some very interesting graphs representing the different fan-bases. For example, if we take all the Kentucky Wildcats fans, and insert each Twitter user as a node, and connect them by who they follow on Twitter, a fairly dense cluster emerges. The larger a node, the higher its degree in the network – the more connected it is to other nodes. The average degree for this graph is 11.731. This means that on average, every user is following or is followed by a total of ~12 other users from this same network. The higher the number, the more densely connected the community. BTW – the median degree for this graph is 5, still quite high.
Wisconsin’s fan-base looks quite different. Generally the number of retweets for Wisconsin team accounts is significantly higher compared to the other three teams. The size of the Wisconsin user set is almost three times as large as that of Kentucky’s. While the graph is more dense, there are fewer central nodes, and many more tiny specs representing less connected users. Still, the average degree for nodes in this graph is 16.89, and the median, 8. Both higher than the Kentucky fan graph. So even though the Wisconsin fan-base is much larger in our sample, it is also much more inter-connected.
If we double down on this WI graph, and run a network detection algorithm, we can start to identify regional friend networks and interest groups, outlined by similarly colored areas. For each of these clusters, we can see trends – users in the bottom purple region tend to have “Green Bay, WI” in their bios, while those in the Red region have either La Crosse or Galesville, WI. Many of them are high school or college students, on various sports teams, connected to their friends in the network. I was surprised to see many of them using Vine. Curious if that’s a general trend across student populations, or simply a regional one.
This is a pretty standard analysis, and only touching the surface in terms of audience insight that can be reached. We’ll be highlighting much more of this type of work here, and get deeper into methods and techniques.]]>
Our attempt to alleviate the first problem was in creating InstaRank, an algorithm that scores Instapaper’d articles based on the several signals originating from user activity on the content (especially whether they read/like the article), its source and semantic value. The second challenge, that of temporal significance, is much more interesting. Of course, we did not want to provide users with just another feed or stream where interest slips away with time. We wanted users to be able to take stock – present reading content that is as absorbing and interestingness in one month, as in six months or a year. Additionally, we wanted users to have a concept of time associated with this content – when was it popular or relevant?
Instapaper Daily. We introduced Instapaper Daily last year. It is essentially a diurnal timeline that features the most popular article on Instapaper each day (and extends all the way to the beginning of 2013). There also exists categorical timelines, allowing users to compare the most popular ‘entertainment’ vs. ‘sports’ articles on each past day. It is a great way to explore quality content that the user might have missed out on, adding a retrospective lens on what had been popular.
Thus, on each day, Instapaper Daily features one most popular article in the respective categories, algorithmically chosen by our data driven backend. Sometimes, the daily slot is occupied by a prime quality article published on the day. On other occasions, the featured article concerns one big piece of news that happened on the day. We have 7 categorical timelines, including ‘politics’, ‘arts & entertainment’, ‘business’, ‘computer & internet’, ‘science and tech’, ‘health’ and ‘sports’. All of this Instapaper Daily data is available as an RSS feed. It culminates into a fascinating dataset for exploring story topics and topic longevity along each timeline. The following analysis uses the last three months of timeline data.
Story Topics. Let me explain what we mean by ‘topic’ in reference to a story. Topics are extracted from each article based on its title and summary. In simple scenarios topics are just named entities in the text – say if ‘Amazon’ is in the title of a technology timeline article, we attribute it to topic: Amazon. However, if the title word space is too sparse, we employ a DBpedia based categorization scheme, where identifier words (& named entities) are matched to certain topics. Thus, ‘Beyonce’ might map to topic ‘music’ whereas ‘The Hunger Games’ will map to topic ‘movie’. Sometimes, identifier words will occur in the URL itself, allowing us to recognize if the topic of the article is ‘baseball’ or ‘football’. For certain timelines like ‘Computers & Internet’, we focus on very precise topics based on named entities like Google whereas for other timelines like ‘Arts & Entertainment’, broader topics are preferred (‘music’, ‘movie’, ‘books‘). The resolution of the topic category can be adjusted based on the facet of information we want to explore in the data. The extraction of topics from the text requires several tools including NLTK and NER tools for entity extraction and a DBpedia-based categorization that uses Ensemble Decision Trees to pick one principal topic per article. When the categorizer fails to find a relevant topic for an article, it is classified as ‘other‘.
Topical Composition. The distribution of topic occurrence in each timeline is intriguing. While we do not observe any surprising anomalies, the featured topics do reveal aggregate interest patterns of Instapaper users. For example in computer and internet category, we find Google and Apple stories dominate the timeline, followed by Facebook and stories involving NSA privacy. In Science and Tech timeline, majority of stories revolve around space exploration, education system, genetics and physics. International news and economic issues dominate the Politics timeline, followed by stories about organizational systems (e.g., judiciary, revenue service etc.), NSA, Obama and gun control. The ‘Business’ timeline is led by articles about Investment and startup business, followed by business culture and somewhat surprisingly, the food industry. The Sports category is most reactive to real-world events, which is why although the main percentage of stories are on american football and soccer, winter olympics was the persistent topic during the Sochi games. Movie articles predominate the Arts/Entertainment timeline, followed by articles about TV shows, Music, Books and online media. Health is a fascinating timeline, where we found users are most interested in healthcare articles, followed by stories on drugs and antibiotics, eating habits, psychotherapy, abortion and smoking issues.
Life of a topic. What really fascinated me is the how often the same topic surfaced on Instapaper Daily timelines. Although continuous growth and decay patterns (e.g., that of a Twitter meme) are not visible in discrete categorical timelines, we can still find patterns as to how long and how often a topic appears on the time, reflecting the fact that it significantly captured user interest. I was curious to see if certain topics persist in contiguous slots of the timeline, meaning they held user attention for more than one day. The other interesting aspect is how many times a certain topic appeared on the timeline intermediated by other topics, which would mean they ‘reoccured’. The two metrics I discuss here is Persistence and Recurrence. The maximum number of contiguous timeline slots occupied by a topic is its Persistence. The number of times a topic reoccurs in the timeline after breaks is its Recurrence. We convert the timeline into a binary array for each topic, where if the topic occurs on that day it gets represented by ‘1’, otherwise by ‘0’. Then, persistence and recurrence can be calculated as:
A common way to visualize multivariate data is parallel coordinates, which is also very useful in analyzing n-dimensional categorical time series data. The real advantage of parallel coordinates (over orthogonal coordinates) is that the number of dimensions required to be visualized is only restricted by the horizontal resolution of the screen. The two key factors in creating parallel coordinate representation is the order and scaling of the axes. We follow the order: occurrence – persistence – recurrence. Scaling is a natural aftereffect of interpolation among consecutive pairs of variables. Our data reveals that a relative scaling of 24:7:20 for the occurrence:persistence:recurrence axes fits the parallel coordinates best.
Visualized below is the occurrence, persistence and recurrence of topics on various categorical timelines of Instapaper Daily. In the table, ‘name’ represents the topic and ‘group’ identifies its categorical timeline.
Parrallel Coordinate Representation of Persistence/ Recurrence of Topics
[Link to full page viz]
Topic Patterns. Using the parallel coordinates, we found four common topic patterns in various categorical timelines of Instapaper Daily. They exhibit how certain topics sustain and possibly regain user attention in time. Topics around stories generally behave in four different ways:
concave/convex patterns of relative persistence
Pattern B is the most infectious and sustains reader attention. These are topics that usually occur rarely but persists, bearing a concave curve on the three axes. Examples include stories about Amazon drone delivery and Sochi Olympics. Pattern D involves topics that occur periodically but persists, and users seem to enjoy reading about it during or right after that period. Examples include ‘football’, which gains interest periodically close to game days, or even international news (often crisis) that captures user attention.
Pattern C involves story topics that receive repeated visibility. They usually have high occurrence but low persistence; thus they exhibit a convex curve on the parallel coordinates. These include articles about TV shows (like Sherlock), discussions about investments in the business category and several topics in ‘science’ (like space, physics etc.). Finally, the most frequently observed pattern is A – topics that usually have low occurrence and where persistence nestles at approximately 1/3 the occurrence, resulting in almost a horizontal line across the parallel coordinates. Examples include discussion about ‘phone carriers’ and ‘food’ in the business timeline, and stories about ‘energy solutions’ in the science timeline. In terms of the parallel coordinates, the more convex a topic’s curve, the lesser is its persistence relative to its occurrence. Conversely, the more concave a topic’s curve, the greater its persistence relative to the number of occurrences.
What is it about Flappy Bird that made it so successful, and why did it take so long for the game to go “viral“? The app stores are littered with thousands of free casual games that use similarly addictive gameplay. What can we learn about the rise in uptake of this game specifically? And can we perhaps identify a tipping point where engagement around the game crossed a certain threshold, gaining momentum that was impossible to stop?
These were some of the questions that I set to answer earlier this week. At betaworks, we have a unique longitudinal view of mainstream media and social media streams. Our services, at varied scales, span across content from publishers and social networks, giving us the ability to analyze the attention given to events over time. Inspired by Zach Will’s analysis of the Flappy Bird phenomenon through scraped iTunes review data, I wanted to see what else we could learn about the massive adoption of this game, specifically through the lens of digital and social media streams.
My data shows two clear tipping points, where there was significant rise in user adoption of the game. The first, January 22nd, happened when the phrase ‘Flappy Bird’ started trending on Twitter across all major cities in the United States. It would continue trending for the next 6 days, driving increased visibility and further adoption. The second, on February 2nd, was the point in which media coverage of the game quadrupled on a daily basis. Most media outlets were clearly late in the game, covering the phenomenon only after it had topped all app store charts and was already a massive success.
We live in a networked world, where social streams drive massive attention spikes flocking from one piece of content to another. While it is a chaotic system, difficult to fully predict or control, there are early warning signals we can heed, as events unfold.
First, a little explanation. The bar chart above highlights three signals over the period of the past month. All values on the y-axis are normalized to display percentage from the max:
Media (& Social Media) Attention over Time
Throughout the first weeks of January we barely see any coverage of the game. The first story comes from ibtimes.co.uk on Jan. 24th. Described as the most frustrating game ever, it highlights a number of tweets where users complain about the game’s difficulty. This was the common sentiment seen on social networks:
Why On Earth Is This Borderline Crappy, Impossibly Hard Game The Most Popular Download On The App Store? http://t.co/oOrMRjR5dv
— Charlie Warzel (@cwarzel) January 23, 2014
The game’s difficultly was getting users to tweet out to their friends, and enough tweets were making the game a trend on Twitter (more on that below), which then powered a feedback loop: more tweets, more trends, and so on… Another early article comes from nativex.com, positively reviewing the game and its incredible adoption rate. At that point it had already hit the #1 spot in the iTunes App store for free apps.
As people continued to play it, and they kept tweeting out their frustration, their anger and their high scores. Every day there were a growing number of users posting screenshots to Tumblr blogs, Instagram and Twitter accounts. But on February 2nd, something changed. While the amount of media coverage (in purple) grows steadily, the number of users sharing content related to the game began to quadruple on a daily basis (in green).
We see significantly higher levels of engaged users on social networks responding to and sharing stories about the game starting on Feb 2nd.
In order to better understand why this happened, we can break down the coverage by sources (plot below). At first we see the nativex.com piece (1/24-25) and people mainly posting screenshots of their scores using pic-twit.co as well as links to the game in the iTunes app store. One post on wpcentral.com details Dong’s tweet promise to release the game for the Windows Phone. A gaming blog detailed tips and cheats for the game, a subreddit had hundreds of comments, and my personal favorite, a super geeky blog post calculated Flappy Bird’s size given its velocity-time graph while in free fall. Throughout the period, there were lots and lots of people sharing screenshots of their scores.
On Jan 31st, VentureBeat published a story questioning the game’s success, and the following day published a listicle-style story on why the game is so successful. Over the next few days we see a rapid increase in Instagram shares of scores, likely due to the heightened media coverage, starting on Feb 2nd, when we see posts from Techcrunch FastCompany and The Atlantic, as well as Huffington Post and entrepreneur.com. This explains the significantly larger population of users posting and responding to these articles.
Twitter Trending Topics across Geographical Regions
By tracking where the phrase ‘Flappy Bird’ appeared in Twitter’s trending topics lists across geographic locations, we can identify when and where the game became popular. We use a dispersion plot (below) to show these trends across geographic locations over time. The lower a city appears, the earlier Flappy Bird trended in that location. The x-axis denotes days. The first trend happened on January 22nd in Jackson Mississippi, spreading on to Las Vegas, Birmingham, Harrisburg and Baton Rouge. Within a day, it was trending across all major cities in the United States. On January 28th, the game started trending in the UK, then it spread to Canada, later east Asia, Europe and South America (larger screenshots of the dispersion plot can be found in my project site drop). When analyzing trends, this type of behavior isn’t uncommon. We tend to see trends spread within countries first, before hopping over to another country or region.
Based on this diagram along with the previous charts, some closing thoughts:
Finally, it is important to highlight this post describing potential bot-related activity around the game’s earlier time in the app store. This is based on oddly worded reviews and sheer number of downloads for a suspiciously new game. If true, this points to a troubling phenomenon, where it is not meritocracy that gets your app ranked at the top of the charts, but rather how well you can manipulate the system. When launching web services, this is called Search Engine Optimization. When setting up social media accounts, it has become common to purchase followers or likes and boost the account’s presence.
So when launching an app, specifically a casual game, is it necessary to manipulate the app store scoring mechanisms in order to even have a chance at success?
I’d love to hear your thoughts or comments.
Gilad | @gilgul]]>