How ‘Instapaper Daily’ remembers the most interesting stories

Instapaper facilitates time-shifted reading, enabling you to bookmark links which appear on your stream and ‘read later’ at your leisure. However, the reality is that users can only save stories that they have been exposed to in the first place, which requires continuous attention to the stream. Having to re-check streams every hour can be distracting. The challenge then, is to automatically find quality content from the stream, and present an interface for the user to browse through at a later time. The problem is thus two-fold: (1) how do you algorithmically compile interesting and popular content and (2) how do you present the temporal significance of the content to the user.

Our attempt to alleviate the first problem was in creating InstaRank, an algorithm that scores Instapaper’d articles based on the several signals originating from user activity on the content (especially whether they read/like the article), its source and semantic value. The second challenge, that of temporal significance, is much more interesting. Of course, we did not want to provide users with just another feed or stream where interest slips away with time. We wanted users to be able to take stock – present reading content that is as absorbing and interestingness in one month, as in six months or a year. Additionally, we wanted users to have a concept of time associated with this content – when was it popular or relevant?

Instapaper Daily. We introduced Instapaper Daily last year. It is essentially a diurnal timeline that features the most popular article on Instapaper each day (and extends all the way to the beginning of 2013). There also exists categorical timelines, allowing users to compare the most popular ‘entertainment’  vs. ‘sports’ articles on each past day. It is a great way to explore quality content that the user might have missed out on, adding a retrospective lens on what had been popular.

instapaperDaily.png

Thus, on each day, Instapaper Daily features one most popular article in the respective categories, algorithmically chosen by our data driven backend. Sometimes, the daily slot is occupied by a prime quality article published on the day. On other occasions, the featured article concerns one big piece of news that happened on the day. We have 7 categorical timelines, including ‘politics’, ‘arts & entertainment’, ‘business’, ‘computer & internet’, ‘science and tech’, ‘health’ and ‘sports’. All of this Instapaper Daily data is available as an RSS feed. It culminates into a fascinating dataset for exploring story topics and topic longevity along each timeline. The following analysis uses the last three months of timeline data.

Story Topics. Let me explain what we mean by ‘topic’ in reference to a story. Topics are extracted from each article based on its title and summary. In simple scenarios topics are just named entities in the text – say if ‘Amazon’ is in the title of a technology timeline article, we attribute it to topic: Amazon. However, if the title word space is too sparse, we employ a DBpedia based categorization scheme, where identifier words (& named entities) are matched to certain topics. Thus, ‘Beyonce’ might map to topic ‘music’ whereas ‘The Hunger Games’ will map to topic ‘movie’. Sometimes, identifier words will occur in the URL itself, allowing us to recognize if the topic of the article is ‘baseball’ or ‘football’. For certain timelines like ‘Computers & Internet’, we focus on very precise topics based on named entities like Google  whereas for other timelines like ‘Arts & Entertainment’, broader topics are preferred (‘music’, ‘movie’, ‘books‘). The resolution of the topic category can be adjusted based on the facet of information we want to explore in the data. The extraction of topics from the text requires several tools including NLTK and NER tools for entity extraction and a DBpedia-based categorization that uses Ensemble Decision Trees to pick one principal topic per article. When the categorizer fails to find a relevant topic for an article, it is classified as ‘other‘.

Topical Composition. The distribution of topic occurrence in each timeline is intriguing. While we do not observe any surprising anomalies, the featured topics do reveal aggregate interest patterns of Instapaper users. For example in computer and internet category, we find Google and Apple stories dominate the timeline, followed by Facebook and stories involving NSA privacy.  In Science and Tech timeline, majority of stories revolve around space exploration, education system, genetics and physics. International news and economic issues dominate the Politics timeline, followed by stories about organizational systems (e.g., judiciary, revenue service etc.), NSA, Obama and gun control. The ‘Business’ timeline is led by articles about Investment and startup business, followed by business culture and somewhat surprisingly, the food industry. The Sports category is most reactive to real-world events, which is why although the main percentage of stories are on american football and soccer, winter olympics was the persistent topic during the Sochi games. Movie articles predominate the Arts/Entertainment timeline, followed by articles about TV shows, Music, Books and online media. Health is a fascinating timeline, where we found users are most interested in healthcare articles, followed by stories on drugs and antibiotics, eating habits, psychotherapy, abortion and smoking issues.

Life of a topic. What really fascinated me is the how often the same topic surfaced on Instapaper Daily timelines. Although continuous growth and decay patterns (e.g.,  that of a Twitter meme) are not visible in discrete categorical timelines, we can still find patterns as to how long and how often a topic appears on the time, reflecting the fact that it significantly captured user interest. I was curious to see if certain topics persist in contiguous slots of the timeline, meaning they held user attention for more than one day. The other interesting aspect is how many times a certain topic appeared on the timeline intermediated by other topics, which would mean they ‘reoccured’. The two metrics I discuss here is Persistence  and  Recurrence. The maximum number of contiguous timeline slots occupied by a topic is its Persistence. The number of times a topic reoccurs in the timeline after breaks is its Recurrence. We convert the timeline into a binary array for each topic, where if the topic occurs on that day it gets represented by ‘1’, otherwise by ‘0’. Then, persistence and recurrence can be calculated as:

Screen Shot 2014-03-11 at 10.06.56 AM

A common way to visualize multivariate data is parallel coordinates, which is also very useful in analyzing n-dimensional categorical time series data. The real advantage of parallel coordinates (over orthogonal coordinates) is that the number of dimensions required to be visualized is only restricted by the horizontal resolution of the screen. The two key factors in creating parallel coordinate representation is the order and scaling of the axes. We follow the order: occurrence – persistence – recurrence. Scaling is a natural aftereffect of interpolation among consecutive pairs of variables. Our data reveals that a relative scaling of 24:7:20 for the occurrence:persistence:recurrence axes fits the parallel coordinates best.

Visualized below is the occurrence, persistence and recurrence of topics on various categorical timelines of Instapaper Daily. In the table, ‘name’ represents the topic and ‘group’ identifies its categorical timeline.

Parrallel Coordinate Representation of Persistence/ Recurrence of Topics

[Link to full page viz]

Topic Patterns. Using the parallel coordinates, we found four common topic patterns in various categorical timelines of Instapaper Daily. They exhibit how certain topics sustain and possibly regain user attention in time. Topics around stories generally behave in four different ways:

  • A – The topic occurs rarely, and does not persist.
  • B – The topic occurs rarely, but it persists when it occurs ( the concave topic )
  • C – The topic occurs often, but does not persist relative to its occurrence ( the convex topic)
  • D – The topic occurs sometimes and persists relative to its occurrence

  concave/convex patterns of relative persistence

conwave

Pattern B is the most infectious and sustains reader attention. These are topics that usually occur rarely but persists, bearing a concave curve on the three axes. Examples include stories about Amazon drone delivery and Sochi Olympics. Pattern D involves topics that occur periodically but persists, and users seem to enjoy reading about it during or right after that period. Examples include ‘football’, which gains interest periodically close to game days, or even international news (often crisis) that captures user attention.

Pattern C involves story topics that receive repeated visibility. They usually have high occurrence but low persistence; thus they exhibit a convex curve on the parallel coordinates. These include articles about TV shows (like Sherlock), discussions about investments in the business category and several topics in ‘science’ (like space, physics etc.). Finally, the most frequently observed pattern is A – topics that usually have low occurrence and where persistence nestles at approximately 1/3 the occurrence, resulting in almost a horizontal line across the parallel coordinates. Examples include discussion about ‘phone carriers’ and ‘food’ in the business timeline, and stories about ‘energy solutions’ in the science timeline. In terms of the parallel coordinates, the more convex a topic’s curve, the lesser is its persistence relative to its occurrence. Conversely, the more concave a topic’s curve, the greater its persistence relative to the number of occurrences.

Final Remarks. 

  • In a world where our attention is dissected in various ways every single day, its fascinating to explore what can sustain user interest. Do persistent topics cover a global event or some outrageous vision of entrepreneurism? Of equal curiosity is what readers want to devour repeatedly, so that the topic keeps recurring on the featured timelines. Our analysis is grounded in reading habits of Instapaper users. This is in contrast to page-views and social shares; both of which have been used to measure attention (perhaps with low fidelity).
  • There’s a bunch of exploration we could further do on this data. Should we have included the top 100 InstaRank’d articles per day instead of the top 7 categorical links? Yes, though we have to be wary about the possible introduction of categorical bias (more political articles than sports in top 100); alternately it would require normalization. Next, how to deal with the scenario of different topic resolutions for different timelines? This constraint is fundamentally imposed by the implicit nature of data generation in different categories. A movie title might occur once in the ‘entertainment’ timeline, but ‘Google’ occurs 15 times in ‘computer/internet’. Therefore, it compels you to have broader (more general) topics when the named entity space of some timeline is very sparse. This is also the reason why we avoid comparing average persistence between ‘computer/internet’ and ‘sports’ timelines, since topic resolutions are somewhat different (‘Google’ vs. ‘Arsenal FC’ would be fairer). Finally, I would like to test how auto-regressive models perform on this kind of data.
  • I scrupulously strive to reduce the physical understanding of some phenomenon into a computational problem, because deciphering phenomena statistically does not necessarily entail our ability to automate them. Once we relate the phenomenon (and its measure) to some computational equivalent, we can delve into how and if at all the process can be automated. Here, we used ‘Persistence’ to represent the continued attention on some topic on a categorical timeline. But what is ‘Persistence’ – computationally? In Computer Science, the classic maximal subarray problem aims to find the largest sum from a contagious subset of numbers in a 1-D array. ‘Persistence’ is the sum of the maximal subarray in a timeline envisioned as a bit array, where ‘1’ indicates that the topic occurred on that day and ‘0’ otherwise. The solution was quite obvious then; I had use dynamic programming which finds the result in O(n).

Submit a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>