This is a project developed by Chris Hidey, who spent this summer interning with us at the betaworks data science team, focusing on natural language understanding and event detection in news streams.
In Chris’ words:
The project I worked on this summer was to develop a method that algorithmically generates timelines around a given news subject. A “subject” can be any topic or event, such as the Sony hacks or the FIFA corruption scandal or the ongoing news coverage of Hillary Clinton or Donald Trump, or even specific issues such as the tax policy of presidential candidates. The goal is to determine key events over a specified time window that indicate new or significant developments in the story. The result is a retrospective look at how these events unfolded within a particular topic’s lifetime.
(It is important to note that the developed approach outlined below is not limited to social data; ultimately the only requirement is textual data with timestamps.)
The data for this project was derived from social media comprising primarily of tweets with URLs, group by subject using a bag-of-words approach. One signal that is available to help determine events is the number of new links we see about a given topic over time. This time series data indicates when the news coverage around a subject has peaked, based on the velocity of links published about it.
Although this velocity signal can be used to determine when an unusual event has happened, it is difficult to understand what _actually_ happened or whether this is a follow-up news report about a story we’d seen in the past. Fortunately, we have lexical information available – the words in the tweets and “slug”. The slug is the section of the URL that both identifies and describes a link and usually contains keywords selected to promote the link in search results (i.e. nytimes.com/politics/first-draft/2015/07/07/marco-rubio-attacks-higher-education-cartel-and-jabs-rivals). Using some NLP magic, we are able to extract the title, text and description of articles. Using this lexical signal we then represent an event as a collection of words that occur together often and around the same time. Ideally we want to identify events that are new and significant so we need to balance uniqueness with frequency.
To model this problem, I represent the counts of word co-occurrence over time as a tensor and then do a PARAFAC decomposition. This is a similar approach to Latent Semantic Analysis but we are also considering the temporal element. In LSA, we have a matrix of terms and documents that we can use to determine the latent factors. For this data, we have terms, documents, and time, but each document is only associated with a single timestamp so just adding temporal information would not provide additional benefit. Thus we can think of one term as a “document” and each term is represented by a matrix of context vectors over time.
The overall result of the tensor decomposition method generates a list of sub-topics around the news story, and then provides the ability to track these sub-topics over time. In the figure below, we show sub-topics generated from analyzing all links about FIFA. The x-axis represents the days elapsed since we started tracking this subject (0 is at July 1 2015). Each colored line represents a sub-topic discovered via the method. The Y-axis indicates the score attained by the sub-topic at a certain point in time. Every time a sub-topic peaks, it marks the point where an important event most likely occurred. It is evident that the majority of big stories about FIFA occur after May 27, 2015, when seven FIFA officials were arrested. This is where we find sub-topics that reach the highest peak.
For each sub-topic, the algorithm generates a representative set of words which we can map back to links using their slugs. As a result, for each event we have a list of ranked links that best represent it. With different datasets, this method can produce interesting results that allow for a retrospective analysis into subjects, events and topics.
Here is an example of a visual timeline populated by this method’s output, including major events around coverage of Donald Trump’s campaign:
[click on the gif to play it]
Another cool thing about this algorithm is that it is largely customizable, meaning you could specify a period of time and the top-K events within that period. This allows us to expand and converge the number of items presented within any chosen period. For example, below we see the top-10 sub-events around Hillary Clinton’s campaign in July 2015:
[click on the gif to play it]
In a world where we are overwhelmed with multitude of information, our attention is pulled every which way. It is natural to feel like we are always playing “catch-up” with the news. Many times, it is hard to consume a current story without enough background knowledge. This algorithm’s output – a retrospective timeline of major events – could be an interesting way to understand the context for any current story. Perhaps this lower the bar, and further our reader’s appreciation for current news stories.