Retrospective Event Detection in News

This is a project developed by Chris Hidey, who spent this summer interning with us at the betaworks data science team, focusing on natural language understanding and event detection in news streams.


In Chris’ words:

The project I worked on this summer was to develop a method that algorithmically generates timelines around a given news subject. A “subject” can be any topic or event, such as the Sony hacks or the FIFA corruption scandal or the ongoing news coverage of Hillary Clinton or Donald Trump, or even specific issues such as the tax policy of presidential candidates. The goal is to determine key events over a specified time window that indicate new or significant developments in the story. The result is a retrospective look at how these events unfolded within a particular topic’s lifetime.

(It is important to note that the developed approach outlined below is not limited to social data; ultimately the only requirement is textual data with timestamps.)


The data for this project was derived from social media comprising primarily of tweets with URLs, group by subject using a bag-of-words approach. One signal that is available to help determine events is the number of new links we see about a given topic over time. This time series data indicates when the news coverage around a subject has peaked, based on the velocity of links published about it.


Although this velocity signal can be used to determine when an unusual event has happened, it is difficult to understand what _actually_ happened or whether this is a follow-up news report about a story we’d seen in the past. Fortunately, we have lexical information available – the words in the tweets and “slug”. The slug is the section of the URL that both identifies and describes a link and usually contains keywords selected to promote the link in search results (i.e. Using some NLP magic, we are able to extract the title, text and description of articles. Using this lexical signal we then represent an event as a collection of words that occur together often and around the same time. Ideally we want to identify events that are new and significant so we need to balance uniqueness with frequency.

To model this problem, I represent the counts of word co-occurrence over time as a tensor and then do a PARAFAC decomposition. This is a similar approach to Latent Semantic Analysis but we are also considering the temporal element. In LSA, we have a matrix of terms and documents that we can use to determine the latent factors. For this data, we have terms, documents, and time, but each document is only associated with a single timestamp so just adding temporal information would not provide additional benefit. Thus we can think of one term as a “document” and each term is represented by a matrix of context vectors over time.

[Read more about the tensor decomposition process, data preprocessing and algorithm in this ipython notebook ]

The overall result of the tensor decomposition method generates a list of sub-topics around the news story, and then provides the ability to track these sub-topics over time. In the figure below, we show sub-topics generated from analyzing all links about FIFA. The x-axis represents the days elapsed since we started tracking this subject (0 is at July 1 2015). Each colored line represents a sub-topic discovered via the method. The Y-axis indicates the score attained by the sub-topic at a certain point in time. Every time a sub-topic peaks, it marks the point where an important event most likely occurred. It is evident that the majority of big stories about FIFA occur after May 27, 2015, when seven FIFA officials were arrested. This is where we find sub-topics that reach the highest peak.

Screen Shot 2015-08-25 at 10.30.54 AM


For each sub-topic, the algorithm generates a representative set of words which we can map back to links using their slugs. As a result, for each event we have a list of ranked links that best represent it. With different datasets, this method can produce interesting results that allow for a retrospective analysis into subjects, events and topics.

Here is an example of a visual timeline populated by this method’s output, including major events around coverage of Donald Trump’s campaign:


Click on the gif to open it

[click on the gif to play it]

Another cool thing about this algorithm is that it is largely customizable, meaning you could specify a period of time and the top-K events within that period. This allows us to expand and converge the number of items presented within any chosen period. For example, below we see the top-10 sub-events around Hillary Clinton’s campaign in July 2015:


Click on the gif to open it

[click on the gif to play it]

In a world where we are overwhelmed with multitude of information, our attention is pulled every which way. It is natural to feel like we are always playing “catch-up” with the news. Many times, it is hard to consume a current story without enough background knowledge. This algorithm’s output – a retrospective timeline of major events – could be an interesting way to understand the context for any current story. Perhaps this lower the bar, and further our reader’s appreciation for current news stories.

Chris & Suman

Will your news trend on Facebook? Driving factors behind Facebook trending topics

Earlier this year, Facebook announced the launch of trending topics on its newsfeed page. Like Twitter trends, which reflect the attention landscape in the Twittersphere, Facebook trends showcase the most popular news stories within the Facebook world — algorithmically determined from publicly shared posts of its 757 million daily users. Here at betaworks, we have always been curious to explore how news events spread on Facebook via trending topics.

Massive pro-democracy protests took place in Hong Kong last week. More than half a million individuals (most of whom were students) decided to occupy Central, the heart of Hong Kong. I chronologically tracked five news events reported from Hong Kong during the protests and studied their evolution within the Facebook media ecology. Using Facebook trending data, I was able to spot which news stories became trends while which ones failed to, and which trends persisted while others died off.

Screen Shot 2014-10-12 at 12.49.10 AM

Online news media has rapidly transformed into a mobile, real-time phenomenon. There are several news stories that compete within the Facebook ecosystem to make it into the trending list. Trends sustain on the top-10 trending list when enough people see a post and themselves share it. Few news stories make it to the trending list, even lesser sustain in the list for long periods of time. Sustained presence of a news in the trending list promises increased attention from the users and a possibility of further sharing that in turn will make sure the story remain trending.

Here are the highlights from the original post:

We found three driving factors that determine if a news is going to sustain in the trending list. Two of the factors are well known in news-cycle evolution, but the third one seems to be a Facebook-only phenomenon. Our data provides evidence that these factors led to disproportional attention regarding Hong Kong protests in different geographical communities on Facebook. However, the interesting thing with this dataset is we can quantitatively measure the impact of each of these factors on the news story.

  1.  Diurnal Patterns

People don’t share when they are sleeping (at least we hope not). Diurnal patterns are common in social media, and there is no exception in Facebook. A piece of news that breaks late in the evening has a lesser chance of sustaining as a trend. 


     2.     Number of Competing News Stories

The number of competing news trends in a geographical community affects the trend sustainability. Competing stories reflects the ecological conflict that a piece of news faces to break into the top-10 list and maintain its spot. Using a technique called Likelihood Estimation, we can estimate the chance that a news story will get into the trending list. Note this is not a measure of sustainability, its only a reading of the probability of even breaking into the top-10 trend list.

Screen Shot 2014-10-12 at 2.13.30 AM

Competing news stories offer lesser likelihood that any particular story will be able to make an appearance in the top-10 trending list. This likelihood is unbalanced depending strongly on the geographical region. For example, the likelihood of a news story making it as a Facebook trend is nearly 1.75 times as high in Australia compared to the USA. [INTERACTIVE visualization]

3.      The Escape Velocity

The potential of a news story to sustain in the top-10 list appears to be strongly influenced by a key phenomenon: how long the trend can maintain a top-3 spot on the trending list. If the news piece maintains such a top-3 spot for more than 1.5 hours after breaking into the trending list, then it has a significant chance of persisting as a trend for the next ~12-18 hours.

In fact, between Aug. 26th and Sept. 4th, I found that only 12% of news stories which break on a day end up occupying ~72% of the trending slots for the next ~16–18 hoursWhat’s common among these 12% of stories? They all had risen as high as the top-3 trend and had survived there for at least 1.5 hrs. In fact, stories that did not last for 1.5 hrs in the top-3 had a 57% chance of falling off from the top-10 list within the next 6 hours. Thus, the first 1.5 hours in the trending list is critical in a news story’s longevity and a powerful symptom for trend sustenance on Facebook.


The initial number of shares within the first 1.5 hrs is critical in giving the news enough ‘escape velocity’. This velocity enables the trend to last in the top-3 long enough in the face of competing trends. And this is what makes a marked difference; why we observe the same piece of breaking news being able to sustain as a trend in one geographical region while dying in another.

But what’s so special about the top-3 trending slots? Possibly a subtle Facebook design feature !

Read our entire research here: . Also, check out our interactive timelines of FB trends for Hong Kong news events:

The Digital Flames of Ferguson

I wrote a recent post , articulating the fragmentation around media coverage in Ferguson and the impact of social media in spreading news about events occurring in Ferguson, eventually gaining national attention.

Using Twitter trending topics data, I show how the country’s attention gradually turned to the events transpiring in Ferguson, what were the driving factors and what it tells us about certain latent kinetics of the Social Web. The data indicates there is specific evidence of persistent attention in certain cities and an ensuing social contagion effect, that possibly swung national attention to Ferguson.

Here are some highlights from the post:

  • Understanding attention in Twittersphere and trending topics:

Imagine a city’s attention like a moving carousel and an impending trend is someone trying to get aboard. The more the volatility, the faster the carousel is spinning. If you are an impending trend, it is much easier to jump into a slowly spinning (less volatile) merry-go-round of attention and stick to it, than it is to get on and tag along on a carousel which is spinning very fast (more volatile).

  • How the attention signal varied in different cities:

The more Ferguson-related trends occur in the Trending Topic List of the location, the higher will be the amplitude of the signal. 

The signal was highest in St. Louis, the epicenter of the news.

Screen Shot 2014-08-26 at 6.22.11 PM

  • Contagion effects due to persistent attention and tipping points

Geographical information diffusion on Twitter is driven by the social interconnection structure among users in various cities and the sensitivity of a city’s tweeting population to new topics. Complex contagion is the study of social networks as conduits for ‘infectious’ idea/topic transmission.The simplest way to study contagious information spread is to examine when the various cities got infected with a Ferguson-related trend.

The data shows that on Aug. 10th, Ferguson-related trends were observed in four cities, namely the epicenter St. Louis itself, MiamiBoston and New York. The time between subsequent infections in every case was more than 3 hours.

Screen Shot 2014-08-26 at 6.27.56 PM

Past midnight on Aug. 11th, Ferguson-related trends had persisted in New York for about 15 hours. Then something interesting happens. The trend first appears in Washington DC. It trends in Washington D.C for about 70 minutes, before erupting into national trend — infecting 12 cities in the next 3 hours! This is a remarkable escalation in dispersion, given Ferguson-related topics trended in only 4 cities in the previous 53.4 hours. Following the topic’s persistence in New York and inception at Washington DC , 62% of the remaining cities get infected in the next 5 hours.

Such points in a social phenomenon’s space-time, like the one observed in Washington DC just after midnight, are called ‘tipping point’ or ‘critical mass point’. At the tipping point, a huge fraction of the cities adopt a previously rare trending topic over a drastically short period of time.

What the study tells us

The elementary reason why such social contagions and tipping points are decisive in information dissemination is that it draws national attention to a trending topic quickly, and in many cases, can be the difference between the news going national or not. A possible hypothesis of why this phenomenon emerges in the Ferguson scenario is that several journalists/media are positioned in New York and the city might be a critical node en-route to a national trend. Sustained attention in New York (~ 15 hours) could be a key factor in making this “trend” tip.

Read the entire post here 

Israel, Gaza, War & Data

Gilad Lotan, chief scientist at betaworks wrote a recent post that was featured by NPR, BBC and several other media. Gilad shows that Social Media is being actively used in the art of personalizing propaganda, and while war rages on the ground in Gaza and across Israeli skies, there’s an all-out information war unraveling in social networked spaces.



Here are some highlights:

… on social networks:

The landscape is much more nuanced, and highly personalized. We construct a representation of our interest by choosing to follow or like specific pages. The more we engage with certain type of content, the more similar content is made visible in our feeds. Recommendation and scoring functions learn from our social connections and our actions online, constructing a model that optimizes for engagement; the more engagement, the more traffic, clicks, likes, shares, and so forth, the higher the company’s supposed value. Our capitalistic markets appreciate a growing value.

… on algorithmic filtering:

The better we get at modeling user preferences, the more accurately we construct recommendation engines that fully capture user attention. In a way, we are building personalized propaganda engines that feed users content which makes them feel good and throws away the uncomfortable bits. 

We used to be able to hold media accountable for misinforming the public. Now we only have ourselves to blame. 

… on capitalism vs. democracy: 

Personalized online spaces are architected to keep us coming back for more. Content that is likely to generate more clicks, or traffic is prioritized in our feeds, while what makes us uncomfortable, fades into the ether. We construct our social spaces — we may choose to follow a user, like a page or subscribe to updates from a given topic.

The underlying algorithmics powering this recommendation engine help reinforce our values and bake more of the same voices into our information streams.

…and a beautiful explanatory network visualization :

Read the entire article here

A data-driven approach to verbalize weather forecasts at scale

Few datasets are as rich, complex, dynamic, near-chaotic and close to real-world as weather data. In the age where apps are intuitive and extremely user-friendly, consumers still receive weather forecasts in a traditional format – predominantly through numbers such as temperature, humidity etc. A new weather app called Poncho aims to disrupt this – by transforming weather prediction numbers into personalized verbal messages of communication. In this blog post we detail how we implemented a data-driven approach to scale the Poncho service across the US by detecting specific subspaces in geo-spatial weather patterns. Each subspace is matched with an editorial message which is strongly guided by how humans cognize weather prediction into actionable insights.

Weather Data: Poncho has a data-intensive backend, which collects weather data from approximately 42,000 zip codes across the US. Each zip code can be thought of as a data point in vector form, with features such as temperature, humidity, wind speed, precipitation, a natural language summary (‘cloudy’, ‘rain’ ) etc. Some features contain time series information – such as precipitation or temperature variation over the next 36 hrs. Given these feature vectors, our goal is to find zip codes that possess striking similarity in weather patterns over the next x hours (where x=12, 24, 36 ; depending on the cycle of computation). During each cycle (see figure below -left), our algorithm is tasked with detecting all zip codes that bear substantially similar weather given the dynamic variability in weather conditions. In other words, we must cluster zip codes by weather similarity.

Weather Subspaces: Poncho uses a dynamic heuristic subspace clustering algorithm, which differs from traditional data clustering techniques such as k-means or spectral clustering. This choice is primarily motivated due to constrains including: (1) the inherent nature of weather data, which is extremely dynamic and (2) the required alignment of retrieved cluster centroids to the editorial message standards. What this means is many  of the data dimensions are irrelevant to editors/creative writers under different weather conditions. Instead of searching over the entire vector space of data, we search for patterns in specific subspaces within the data space. The choice of subspaces is strongly driven by editorial messaging rules. For example, a message for (windspeed > 12 mph + rain) vs. (windspeed > 12 mph + no rain) will be quite different (e.g., for the first condition the message might be: ‘you will find it difficult to hold on to your umbrella’). In fact, the priority of these conditions changes seasonally, e.g., you probably want to omit talking about humidity in the winter.

Thus, clusters are extracted not just based on the sensor data, but also the heuristic rules.These rules determine which subspaces of weather to prioritize when clustering. Searching in specific subspaces enables extraction of cluster centroids which can be easily transformed into intuitive and actionable messages. Our algorithm is an extension of the general subspace clustering, adjusted  for weather data specifically. In the figure below (right), (A) shows the k-means cluster whereas (B) shows the result of dynamic heuristic subspace clustering. Notice if the wind speed is greater than a certain threshold, we need to separate rainy vs. no-rain areas into different clusters.

Screen Shot 2014-05-06 at 11.37.39 AM

The dynamic heuristic subspace clustering (C) detects weather cluster based on heuristic rules determined by Poncho editors. This creates more intuitive clusters than traditional clustering like k-means (B)


Weather Cluster Patterns: Shown below are results of dynamic heuristic subspace clusters in New York state. Normally we would assume a cluster to contain geographically closer (neighboring) locations exclusively. However, this is not always true. Notice how Long Island and some parts of Rochester have quite similar weather patterns for this day. There are other factors at play which can cause distant locations to bear very similar weather conditions, such as distance from a water body, terrain etc.

New York Weather Clusters


In some states, we find weather cluster boundaries ordered by their distance from the ocean. A good example of this pattern is Massachusetts, where we often detect four clusters – ordered by distance from the coast. This feels intuitive. Weather conditions near the coast are strongly influenced by the ocean. As we go more inland, the clusters resemble clear demarcations in weather boundaries, typically based on wind speed and precipitation levels.

Massachusetts Weather Clusters


Delivering Weather Forecasts: Once clusters within a state are determined, it is presented in an admin panel (shown below). Each entry in the admin panel refers to one cluster. Each cluster contains a multitude of zip codes that share very similar weather patterns. The admin panel allows for a single message to be broadcasted to all the geo-locations within that cluster. The message is constructed based on the centroid of the particular cluster, in additional to the editorial voice that humanizes the forecast experience.

Delivering Weather Forecasts


One of the key things we aim to achieve with this form of data science is to connect algorithmic results to interpretable artifacts. A priority is making the cluster results as actionable as possible for editors/writers. This post takes a bird’s eye view on a critical pipeline – from the writers’ message composition factors to translating them to mathematical heuristics and partitioning the data using dynamic heuristic subspace clustering, followed by feeding the cluster results back to the admin panel so messages can be broadcasted to users in specific zip codes.

Related Topics: Natural Language Generation, Bipartite Matching, Subspace Clustering