Its been an exciting data-filled month. With the launch of Digg Reader, we’ve been ingesting many millions of RSS feeds from all shapes and sizes. As we continue crunching this dataset, we wanted to share some interesting insight around . RSS remains an important information transmission protocol across the web, spanning multiple languages and topics. We see a significant number of feeds in Mandarin and in Japanese.
The mass adoption of RSS as a publishing/subscription protocol, means that it is a great way to bypass the growing number of walled gardens created by social networks, where homophily is dominant. Alternatively, with feed subscription we see much more cross-domain bridging happening.
In the following post we visualize two datasets representing RSS feeds. The first is a co-subscription network graph, including the top most subscribed feeds. The second includes all feeds that had been categorized by users as ‘data’ related, giving us a nuanced way to untangle topicality of published content.
The feed co-subscription graph includes the top most subscribed feeds (in this case 25,000) and organizes them by co-subscription: the closer two feeds are in the graph, the more overlap they have in terms of subscribers. What emerges are clearly defined regions, or clusters, which outline feeds that are much more inter-connected amongst each other. The different colored regions tend to represent language differences, as highlighted below. This is not surprising, since a user who subscribes to a feed in Mandarin, will most likely subscribe to other content in Mandarin. But the defined regions aren’t only based on language differences. For example, the dense region in yellow to the right of the graph includes technical feeds focused on web development. If a user subscribes to one of these technical feeds, they most likely are also subscribed to others.
Within the English content region, the popular feeds split into three main types:
1) The central cluster (in light green) represents popular blogs
- – XKCD is pretty high up there
- – John Gruber’s Daring Fireball
- – Joel Spolsky’s blog (Joel on Software)
2) To the right, we see a dense yellow cluster highlighting various technical web development-focused blogs, such as John Resig, HTML5 Watch and MooTools. There’s no clear central node here, meaning that if a user subscribes to one of these tech-dev blogs, they most likely also subscribe to others.
3) And to the left, the light blue region includes RSS feeds from MSM – including Business Insider, The Atlantic Wire and The New Yorker feeds.
A large number of heavily subscribed RSS feeds are not in English. The dark green and blue regions of the graph represent content in Mandarin, many of which host technical content but not limited to. William Long’s blog on Sina is one of the most heavily subscribed to blog in Mandarin. Others include the popular Solidot, a Slashdot copy and various non-technology feeds, such as BBC China, and Songshuhui.net, a website for science enthusiasts.
Some topics we’re interested to further explore are around bridging. What content (or feed) tends to bridge between the Chinese blogosphere and the English one? Which online publishers help information flow across language barriers?
About 34% of the top feeds displayed in this visualization use WordPress to generate the feed. Other popular feed generators include Blogger and Tumblr. Unsurprisingly, within the Russian content cluster, we see many more feeds generated by LiveJournal, the Arabic content feeds skews slightly towards Blogger while the popular Korean feeds tend to use Tistory. Around 33% of the feeds, use their own RSS generators – many of these feeds are media and other non-hosted publishers.
Organizing Feeds by Labels
The ways in which users tag or label feeds give us the ability to better organize the content. By looking at the folder names under which feeds are placed, we can get a sense for which label or topic best fits a feed or group of feeds. Given our inclination towards data geekery, we decided to run a little experiment on the top feeds that were ever tagged with variants of the word ‘data’. This could be ‘dataviz’ or simply ‘data’. But the fact that a user placed a feed within a folder referencing ‘data’ means that source most likely publishes content that’s somewhat related to data.
Next step is to create a network graph where nodes are feeds, but connections reflect the number of shared labels feeds have. The more shared labels a set of feeds have, the more likely they are a part of the same logical group. The resulting graph is utterly fascinating.
We see five distinct regions emerging from our data graph. While the central group in blue boasts the largest feeds (most followed compared to the rest), it is also the least interesting group. These are all very popular blogs, which would have been classified to various categories by many users.
The largest group (in purple) represents feeds that are tech heavy. With numerous feeds spanning database technologies such as MySQL, MongoDB and Postgresql, this group spans content in both English and Mandarin. This means that many users who subscribe to the ‘MySQL Performance Blog’ are likely to also read the ‘Alibaba DBA Team’ blog as well as other mandarin content.
The second largest group of feeds (in red) represents content about data visualization. This includes the popular David McCandless ‘Information Is Beautiful’ blog, as well as the always entertaining JunkCharts. The green region on top is all about data science. Here we have The Natural Language Processing Blog, R-bloggers, math babe and Hilary Mason’s blog. The tiny region at the bottom (orange) includes data journalism-related feeds, including Jonathan Stray, Hacks/Hackers and Data Miner UK.
Using this relatively simple method, we identified dominant feeds that have to do with data, but also classified them into four very different categories. Using this mapping, I’ve already updated my steadily growing digg reader ‘data’ folder. By generalizing this method, we’re hoping to provide similar insight and recommendations for users.