Digg Reader Visualized: what millions of RSS feeds look like

Its been an exciting data-filled month. With the launch of Digg Reader, we’ve been ingesting many millions of RSS feeds from all shapes and sizes. As we continue crunching this dataset, we wanted to share some interesting insight around . RSS remains an important information transmission protocol across the web, spanning multiple languages and topics. We see a significant number of feeds in Mandarin and in Japanese.

The mass adoption of RSS as a publishing/subscription protocol, means that it is a great way to bypass the growing number of walled gardens created by social networks, where homophily is dominant. Alternatively, with feed subscription we see much more cross-domain bridging happening.

In the following post we visualize two datasets representing RSS feeds. The first is a co-subscription network graph, including the top most subscribed feeds. The second includes all feeds that had been categorized by users as ‘data’ related, giving us a nuanced way to untangle topicality of published content.

Feed Co-Subscription

The feed co-subscription graph includes the top most subscribed feeds (in this case 25,000) and organizes them by co-subscription: the closer two feeds are in the graph, the more overlap they have in terms of subscribers. What emerges are clearly defined regions, or clusters, which outline feeds that are much more inter-connected amongst each other. The different colored regions tend to represent language differences, as highlighted below. This is not surprising, since a user who subscribes to a feed in Mandarin, will most likely subscribe to other content in Mandarin. But the defined regions aren’t only based on language differences. For example, the dense region in yellow to the right of the graph includes technical feeds focused on web development. If a user subscribes to one of these technical feeds, they most likely are also subscribed to others.

feed co-subscription

Within the English content region, the popular feeds split into three main types:

1) The central cluster (in light green) represents popular blogs

 

2) To the right, we see a dense yellow cluster highlighting various technical web development-focused blogs, such as John Resig, HTML5 Watch and MooTools. There’s no clear central node here, meaning that if a user subscribes to one of these tech-dev blogs, they most likely also subscribe to others.

3) And to the left, the light blue region includes RSS feeds from MSM – including Business Insider, The Atlantic Wire and The New Yorker feeds.

Non-English Content

A large number of heavily subscribed RSS feeds are not in English. The dark green and blue regions of the graph represent content in Mandarin, many of which host technical content but not limited to. William Long’s blog on Sina is one of the most heavily subscribed to blog in Mandarin. Others include the popular Solidot, a Slashdot copy and various non-technology feeds, such as BBC China, and Songshuhui.net, a website for science enthusiasts.

055aace29a52d4492a7198593ee54488

Some topics we’re interested to further explore are around bridging. What content (or feed) tends to bridge between the Chinese blogosphere and the English one? Which online publishers help information flow across language barriers?

RSS Generators

About 34% of the top feeds displayed in this visualization use WordPress to generate the feed. Other popular feed generators include Blogger and Tumblr. Unsurprisingly, within the Russian content cluster, we see many more feeds generated by LiveJournal, the Arabic content feeds skews slightly towards Blogger while the popular Korean feeds tend to use Tistory. Around 33% of the feeds, use their own RSS generators – many of these feeds are media and other non-hosted publishers.

Organizing Feeds by Labels

The ways in which users tag or label feeds give us the ability to better organize the content. By looking at the folder names under which feeds are placed, we can get a sense for which label or topic best fits a feed or group of feeds. Given our inclination towards data geekery, we decided to run a little experiment on the top feeds that were ever tagged with variants of the word ‘data’. This could be ‘dataviz’ or simply ‘data’. But the fact that a user placed a feed within a folder referencing ‘data’ means that source most likely publishes content that’s somewhat related to data.

Next step is to create a network graph where nodes are feeds, but connections reflect the number of shared labels feeds have. The more shared labels a set of feeds have, the more likely they are a part of the same logical group. The resulting graph is utterly fascinating.

191e68b2ca3485c2c1b1fe1f8247ac23

We see five distinct regions emerging from our data graph. While the central group in blue boasts the largest feeds (most followed compared to the rest), it is also the least interesting group. These are all very popular blogs, which would have been classified to various categories by many users.

The largest group (in purple) represents feeds that are tech heavy. With numerous feeds spanning database technologies such as MySQL, MongoDB and Postgresql, this group spans content in both English and Mandarin. This means that many users who subscribe to the ‘MySQL Performance Blog’ are likely to also read the ‘Alibaba DBA Team’ blog as well as other mandarin content.

The second largest group of feeds (in red) represents content about data visualization. This includes the popular David McCandless ‘Information Is Beautiful’ blog, as well as the always entertaining JunkCharts. The green region on top is all about data science. Here we have The Natural Language Processing Blog, R-bloggers, math babe and Hilary Mason’s blog. The tiny region at the bottom (orange) includes data journalism-related feeds, including Jonathan Stray, Hacks/Hackers and Data Miner UK.

Using this relatively simple method, we identified dominant feeds that have to do with data, but also classified them into four very different categories. Using this mapping, I’ve already updated my steadily growing digg reader ‘data’ folder. By generalizing this method, we’re hoping to provide similar insight and recommendations for users.

Digg Originals Data Piece: when your Twitter friend turns out to be the Boston bomber

We published this piece in collaboration with the Digg team as a first experiment in original content. The post studies the Twitter network of Dzhokhar Tsarnaev, the younger of the two Boston bombers, who had an active Twitter account, and highlights some of the unintended consequences that come with publicizing something or someone, making them significantly more visible, and thus exposing potentially revealing information.

“We live in a networked age, where you can be found not only through the data trails that you leave behind, but also based on the patterns around your network – who you’re connected to. A lot can be inferred about someone just by looking at their friends and how they’re interconnected on social networks. But what happens when one of your Twitter connections happens to be one of the alleged Boston Marathon bombers?

 

The more the media pointed to Dzhokhar’s friends online, the more they closed down, eventually turning private and shutting down their Twitter profiles completely. The more his high school friends were pointed to by the media, the more they were harassed for humanizing their friend, Dzhokhar. In a networked environment, the potential for visibility exists. But visibility comes with consequences.”

 

jtsar_network

Can you click me now?

From the betaworks “Department of Insights That Make Sense, But Are Cool to See Play Out in the Real World” (better known as DITMSBACSPORW), we bring you a snippet of statistics about bit.ly clicks from mobile devices. Using two months of click history from one of SocialFlow’s customers, a user with a Twitter account with over 500,000 followers, we separated the clicks coming from mobile devices (such as iPhones and Android-powered phones) from all other clicks (such as from desktop computers). We could then see how mobile browsing varied over the course of each day as a percentage of total browsing.

Some highlights:

➢ Mobile usage is really important – at the peak (at 8 PM on Saturdays), an average of 41% of all clicks came from mobile devices.
➢ Weekends generally have a higher percentage than weekdays. (It’s good to know you’re not just shifting from following Twitter on your office PC to following Twitter on your home PC. You’re using Twitter on your phone!)
➢ Friday night looks more like a weekend, and Sunday night looks more like a weeknight. (Your mother will be happy to know you’re staying home on Sunday nights.)
➢ People are less likely to click from their mobile devices in the middle of the day and wee hours of the morning than during morning rush hour or at night

While conventional wisdom might hold that the power of Twitter is based on the brevity of its 140-character messages and the convenience of receiving these messages on the go, it’s becoming clear that people on the move are also reading long-form content that appears (in link form) in their Twitter timelines.

SocialFlow’s algorithms help publishers select which of their potential Twitter and Facebook messages will be most interesting to their followers and fans, and it sends those selected messages at the best time. As web browsing on mobile devices increases, the right time to send a message will often be when people are on the go. That’s all the more reason to have a tool help you send messages – you’re too busy partying at night (and, um, compulsively checking your phone) to be at your desk sending tweets.


UPDATE: Editor’s note

We had a few questions regarding the timezone of the graph and whether we mapped clicks to the time in their respective zones.

1) All times are Eastern.
2) We did not normalize clicks to the times in their respective zones. Ultimately, when you send a tweet, it will go to all of your followers at the same time, regardless of what timezone they are in. We therefore felt it was appropriate to lump all the timezones into one.
3) As the client writes English-language articles, the vast majority of the client’s followers are clicking from US timezones.
4) The general trends, i.e. weekend higher than weekday, night higher than day, still hold even without accounting for different timezones.

Around the World [Cup] in 30 Days

bit.ly world cup logo Like millions of people around the world, we watched the World Cup matches unfold. People around the world tweeted and clicked through to news, updates, and more.

We collected tweets from the twitter API (twitter also created their own visualization) that used worldcup hashtags (such as #worldcup and #copamundial), extracted the bit.ly URLs, and plotted the click traffic on a world map.

Each frame of the video is approximately one hour. The color ranges from white (low traffic) to dark blue (highest traffic), and is normalized by click count per country.

You can see the peaks in the US vs Ghana and Brazil vs Netherlands games and, most intriguingly, the immediate lack of traffic in the Netherlands after losing while the attention in Spain persisted through the next morning.

This effect is clearly visible if we zoom in on Europe:

If you want to relive the final, dramatic game, we loved this recreation of the final moments in LEGO:

LeBron, Twitter Phenomenon

Some of the data and arguments included in this post were first incorporated into Peter Kafka’s July 8, 2010 post entitled “LeBron James and the Giant Twitter Link” (http://bit.ly/9Hh2Mh).

As was widely reported, LeBron James made his initial foray into the world of Twitter last week and had quickly amassed a following of close to 300,000 followers. While many reports and posts had analyzed the diffusion of LeBron’s original tweet and the speed with which he has gained followers, the data we have here at betaworks provides us with unique insights into the awesome distribution surrounding LeBron’s first tweets and its lessons for the Twitter universe in general. LeBron’s tweet, despite a dearth of initial followers, serves as an extreme counterexample to the common misconception that “More followers equals more clicks.” As the LeBron example shows, and what we know from other betaworks data, relevance of message and follower engagement are for more important.

Click-Click-Click

With the simplest of messages…

Check http://bit.ly/apxlxx (lebronjames.com) for updated info on my decision.

…LeBron’s bit.ly link had one of the most popular first days ever, both in terms of the volume and velocity of clicks by individual users.


The velocity profile (though not the absolute level) for the first hour of the shortened link is fairly typical of popular links – they peak soon after being published, drop sharply over the next ten minutes, and then decline slowly but steadily. LeBron’s shortened link is unique because of its sheer magnitude coupled with the fact that his number of Twitter followers is relatively low (more on this point later). Furthermore, people’s interest just doesn’t seem to die; 24 hours later, the link still gets about 50 clicks a minute. Below are some statistics on the massive click-storm unleashed by that single tweet.

Total followers 24 hours after post ~300,000
Total bit.ly clicks 24 hours after post ~180,000
% of first day’s clicks within first hour ~25% (compared to ~50% for a typical popular tweet)
Peak velocity ~3,000 clicks per minute (reached almost immediately)
Time to 50% of peak velocity ~5 minutes
Time to 25% of peak velocity ~14 minutes
Time to 12.5% of peak velocity ~39 minutes

LeBron’s Lessons

LeBron’s post puts in clear focus the importance of tweet relevance to potential readers and the idea that it is better to have fewer dedicated followers than more disinterested followers.

With all the hoopla surrounding LeBron – Decision 2010, sports writers and fans had whipped themselves into a frenzy, jumping on any and every piece of LeBron-related news on which they can get their hands. LeBron’s post, table scraps in terms of its informational content, was consumed ravenously by a large swath of people precisely because they were starving for anything at that moment. At another time, as popular as LeBron may be, it is unlikely that the same type of tweet with so little actual news would have generated the same kind of response.

The other lesson we drew is that it is a far better strategy to ensure that your readers actually care about your posts than to amass an ever-growing legion of followers who couldn’t care less. LeBron started with almost no followers, yet the response was crazy.

These patterns and lessons bolster the thesis behind one of betaworks’ companies, SocialFlow. SocialFlow provides a tool for Twitter publishers to ensure that their posts go out when their followers actually care about the content with the proven ability to increase click-through rates and follower-retention. Examining the SocialFlow data, the cliche seen in the response to LeBron’s post, of quality (relevance) over quantity, plays out again and again; one of their clients that focuses on cycling and has about 17,500 followers generates a higher number of clicks, not just a higher click-through rate, than do some Twitter accounts that gained over a million followers because of their placement on suggested lists. In the world of micro-blogging, relevance and quality beat quantity any day.

Though he didn’t sign on with our hometown Knicks, we still appreciate LeBron for the important insights he has highlighted as a Twitter Phenomenon.