A data-driven approach to verbalize weather forecasts at scale

Few datasets are as rich, complex, dynamic, near-chaotic and close to real-world as weather data. In the age where apps are intuitive and extremely user-friendly, consumers still receive weather forecasts in a traditional format – predominantly through numbers such as temperature, humidity etc. A new weather app called Poncho aims to disrupt this – by transforming weather prediction numbers into personalized verbal messages of communication. In this blog post we detail how we implemented a data-driven approach to scale the Poncho service across the US by detecting specific subspaces in geo-spatial weather patterns. Each subspace is matched with an editorial message which is strongly guided by how humans cognize weather prediction into actionable insights.

Weather Data: Poncho has a data-intensive backend, which collects weather data from approximately 42,000 zip codes across the US. Each zip code can be thought of as a data point in vector form, with features such as temperature, humidity, wind speed, precipitation, a natural language summary (‘cloudy’, ‘rain’ ) etc. Some features contain time series information – such as precipitation or temperature variation over the next 36 hrs. Given these feature vectors, our goal is to find zip codes that possess striking similarity in weather patterns over the next x hours (where x=12, 24, 36 ; depending on the cycle of computation). During each cycle (see figure below -left), our algorithm is tasked with detecting all zip codes that bear substantially similar weather given the dynamic variability in weather conditions. In other words, we must cluster zip codes by weather similarity.

Weather Subspaces: Poncho uses a dynamic heuristic subspace clustering algorithm, which differs from traditional data clustering techniques such as k-means or spectral clustering. This choice is primarily motivated due to constrains including: (1) the inherent nature of weather data, which is extremely dynamic and (2) the required alignment of retrieved cluster centroids to the editorial message standards. What this means is many  of the data dimensions are irrelevant to editors/creative writers under different weather conditions. Instead of searching over the entire vector space of data, we search for patterns in specific subspaces within the data space. The choice of subspaces is strongly driven by editorial messaging rules. For example, a message for (windspeed > 12 mph + rain) vs. (windspeed > 12 mph + no rain) will be quite different (e.g., for the first condition the message might be: ‘you will find it difficult to hold on to your umbrella’). In fact, the priority of these conditions changes seasonally, e.g., you probably want to omit talking about humidity in the winter.

Thus, clusters are extracted not just based on the sensor data, but also the heuristic rules.These rules determine which subspaces of weather to prioritize when clustering. Searching in specific subspaces enables extraction of cluster centroids which can be easily transformed into intuitive and actionable messages. Our algorithm is an extension of the general subspace clustering, adjusted  for weather data specifically. In the figure below (right), (A) shows the k-means cluster whereas (B) shows the result of dynamic heuristic subspace clustering. Notice if the wind speed is greater than a certain threshold, we need to separate rainy vs. no-rain areas into different clusters.

Screen Shot 2014-05-06 at 11.37.39 AM

The dynamic heuristic subspace clustering (C) detects weather cluster based on heuristic rules determined by Poncho editors. This creates more intuitive clusters than traditional clustering like k-means (B)

 

Weather Cluster Patterns: Shown below are results of dynamic heuristic subspace clusters in New York state. Normally we would assume a cluster to contain geographically closer (neighboring) locations exclusively. However, this is not always true. Notice how Long Island and some parts of Rochester have quite similar weather patterns for this day. There are other factors at play which can cause distant locations to bear very similar weather conditions, such as distance from a water body, terrain etc.

New York Weather Clusters

 

In some states, we find weather cluster boundaries ordered by their distance from the ocean. A good example of this pattern is Massachusetts, where we often detect four clusters – ordered by distance from the coast. This feels intuitive. Weather conditions near the coast are strongly influenced by the ocean. As we go more inland, the clusters resemble clear demarcations in weather boundaries, typically based on wind speed and precipitation levels.

Massachusetts Weather Clusters

 

Delivering Weather Forecasts: Once clusters within a state are determined, it is presented in an admin panel (shown below). Each entry in the admin panel refers to one cluster. Each cluster contains a multitude of zip codes that share very similar weather patterns. The admin panel allows for a single message to be broadcasted to all the geo-locations within that cluster. The message is constructed based on the centroid of the particular cluster, in additional to the editorial voice that humanizes the forecast experience.

Delivering Weather Forecasts

 

One of the key things we aim to achieve with this form of data science is to connect algorithmic results to interpretable artifacts. A priority is making the cluster results as actionable as possible for editors/writers. This post takes a bird’s eye view on a critical pipeline – from the writers’ message composition factors to translating them to mathematical heuristics and partitioning the data using dynamic heuristic subspace clustering, followed by feeding the cluster results back to the admin panel so messages can be broadcasted to users in specific zip codes.

Related Topics: Natural Language Generation, Bipartite Matching, Subspace Clustering

No #MarchMadness Hate on Twitter?

Last week the Digg editors came to us with an idea. Wouldn’t it be great if we could do a joint data post on March Madness, but instead of the usual – looking at the account with the most likes, or retweets – we would take a very different approach. Their idea was to find the most disliked team on social media, and map that out.

We began by grabbing Twitter data using the public  APIs, looking for tweets that have certain keywords, hashtags and twitter handles connected to March Madness events and games. Our dataset grew rapidly, hundreds of thousands of tweets from different games, referencing teams, players and many other random facts. But surprisingly, we found very little hate. Using a number of sentiment analysis libraries, and services. Even when trying to poll the data for certain common words or phrases to put together a training set to build our own classifier yielded poor results. We just couldn’t find sufficient examples in the data.

Weirdly enough, we came across numerous users who tweeted about their “love/hate” relationship with March Madness. But not much anger or hate, and very little directed at specific teams. Perhaps a byproduct of being a part of a networked public, one’s actions are not only seen by their peers, but visible to the network at large, hence one might be more cautious before displaying strong sentiment.

Constructing their Fan-base

Instead we chose to focus on each team’s networked audience (“fan-base”). If we can identify users rooting for each team, we’d be able to pull out some potentially interesting facts about each cluster of university supporters and run some broader comparisons. For the final four teams, we chose a number of popularly followed Twitter handles used to represent each team. Then we collected all users who retweeted those accounts into sets, making the *very crude* assumption that a retweet mostly represents an endorsement. While certainly not true in political discourse, especially when trying to make content visible to one’s audience for critique, this was rare in our case. And even if it did happen, the event would pretty much be an outlier in our data (users not part of the connected component).

Finally for every user in one of the four sets, we grabbed information about  Twitter accounts that they follow. The Twitter graph represents both friendships as well as interest. By using network analysis techniques, we were able to run comparisons across the different teams.

For the digg post, the editors pulled out the following data points: devices used, user bios, and mainstream media preference (Fox News, CNN, MSNBC) reflected by who they follow. Some excerpts:

“Florida is uh, well, the only place where Fox News is popular. And it’s not even that popular. In fact, Florida is the only school where Fox News even shows up in the data. Obviously, this is some sort of conspiracy related to Benghazi.”


“Aaron Rodgers is more popular than sex. Well, actually we don’t know how popular sex is, but 50.25% of Wisconsin fans follow Aaron Rodgers and he didn’t even go to Wisconsin.”



Analyzing the Networks

At the end of this process we were left with some very interesting graphs representing the different fan-bases. For example, if we take all the Kentucky Wildcats fans, and insert each Twitter user as a node, and connect them by who they follow on Twitter, a fairly dense cluster emerges. The larger a node, the higher its degree in the network – the more connected it is to other nodes. The average degree for this graph is 11.731. This means that on average, every user is following or is followed by a total of ~12 other users from this same network. The higher the number, the more densely connected the community. BTW – the median degree for this graph is 5, still quite high.

Kentucky MBB Fans

Wisconsin’s fan-base looks quite different. Generally the number of retweets for Wisconsin team accounts is significantly higher compared to the other three teams. The size of the Wisconsin user set is almost three times as large as that of Kentucky’s. While the graph is more dense, there are fewer central nodes, and many more tiny specs representing less connected users. Still, the average degree for nodes in this graph is 16.89, and the median, 8. Both higher than the Kentucky fan graph. So even though the Wisconsin fan-base is much larger in our sample, it is also much more inter-connected.

Screen Shot 2014-04-03 at 7.11.39 PM

If we double down on this WI graph, and run a network detection algorithm, we can start to identify regional friend networks and interest groups, outlined by similarly colored areas. For each of these clusters, we can see trends – users in the bottom purple region tend to have “Green Bay, WI” in their bios, while those in the Red region have either La Crosse or Galesville, WI. Many of them are high school or college students, on various sports teams, connected to their friends in the network. I was surprised to see many of them using Vine. Curious if that’s a general trend across student populations, or simply a regional one.

Screen Shot 2014-04-03 at 7.28.13 PM

This is a pretty standard analysis, and only touching the surface in terms of audience insight that can be reached. We’ll be highlighting much more of this type of work here, and get deeper into methods and techniques.

How ‘Instapaper Daily’ remembers the most interesting stories

Instapaper facilitates time-shifted reading, enabling you to bookmark links which appear on your stream and ‘read later’ at your leisure. However, the reality is that users can only save stories that they have been exposed to in the first place, which requires continuous attention to the stream. Having to re-check streams every hour can be distracting. The challenge then, is to automatically find quality content from the stream, and present an interface for the user to browse through at a later time. The problem is thus two-fold: (1) how do you algorithmically compile interesting and popular content and (2) how do you present the temporal significance of the content to the user.

Our attempt to alleviate the first problem was in creating InstaRank, an algorithm that scores Instapaper’d articles based on the several signals originating from user activity on the content (especially whether they read/like the article), its source and semantic value. The second challenge, that of temporal significance, is much more interesting. Of course, we did not want to provide users with just another feed or stream where interest slips away with time. We wanted users to be able to take stock – present reading content that is as absorbing and interestingness in one month, as in six months or a year. Additionally, we wanted users to have a concept of time associated with this content – when was it popular or relevant?

Instapaper Daily. We introduced Instapaper Daily last year. It is essentially a diurnal timeline that features the most popular article on Instapaper each day (and extends all the way to the beginning of 2013). There also exists categorical timelines, allowing users to compare the most popular ‘entertainment’  vs. ‘sports’ articles on each past day. It is a great way to explore quality content that the user might have missed out on, adding a retrospective lens on what had been popular.

instapaperDaily.png

Thus, on each day, Instapaper Daily features one most popular article in the respective categories, algorithmically chosen by our data driven backend. Sometimes, the daily slot is occupied by a prime quality article published on the day. On other occasions, the featured article concerns one big piece of news that happened on the day. We have 7 categorical timelines, including ‘politics’, ‘arts & entertainment’, ‘business’, ‘computer & internet’, ‘science and tech’, ‘health’ and ‘sports’. All of this Instapaper Daily data is available as an RSS feed. It culminates into a fascinating dataset for exploring story topics and topic longevity along each timeline. The following analysis uses the last three months of timeline data.

Story Topics. Let me explain what we mean by ‘topic’ in reference to a story. Topics are extracted from each article based on its title and summary. In simple scenarios topics are just named entities in the text – say if ‘Amazon’ is in the title of a technology timeline article, we attribute it to topic: Amazon. However, if the title word space is too sparse, we employ a DBpedia based categorization scheme, where identifier words (& named entities) are matched to certain topics. Thus, ‘Beyonce’ might map to topic ‘music’ whereas ‘The Hunger Games’ will map to topic ‘movie’. Sometimes, identifier words will occur in the URL itself, allowing us to recognize if the topic of the article is ‘baseball’ or ‘football’. For certain timelines like ‘Computers & Internet’, we focus on very precise topics based on named entities like Google  whereas for other timelines like ‘Arts & Entertainment’, broader topics are preferred (‘music’, ‘movie’, ‘books‘). The resolution of the topic category can be adjusted based on the facet of information we want to explore in the data. The extraction of topics from the text requires several tools including NLTK and NER tools for entity extraction and a DBpedia-based categorization that uses Ensemble Decision Trees to pick one principal topic per article. When the categorizer fails to find a relevant topic for an article, it is classified as ‘other‘.

Topical Composition. The distribution of topic occurrence in each timeline is intriguing. While we do not observe any surprising anomalies, the featured topics do reveal aggregate interest patterns of Instapaper users. For example in computer and internet category, we find Google and Apple stories dominate the timeline, followed by Facebook and stories involving NSA privacy.  In Science and Tech timeline, majority of stories revolve around space exploration, education system, genetics and physics. International news and economic issues dominate the Politics timeline, followed by stories about organizational systems (e.g., judiciary, revenue service etc.), NSA, Obama and gun control. The ‘Business’ timeline is led by articles about Investment and startup business, followed by business culture and somewhat surprisingly, the food industry. The Sports category is most reactive to real-world events, which is why although the main percentage of stories are on american football and soccer, winter olympics was the persistent topic during the Sochi games. Movie articles predominate the Arts/Entertainment timeline, followed by articles about TV shows, Music, Books and online media. Health is a fascinating timeline, where we found users are most interested in healthcare articles, followed by stories on drugs and antibiotics, eating habits, psychotherapy, abortion and smoking issues.

Life of a topic. What really fascinated me is the how often the same topic surfaced on Instapaper Daily timelines. Although continuous growth and decay patterns (e.g.,  that of a Twitter meme) are not visible in discrete categorical timelines, we can still find patterns as to how long and how often a topic appears on the time, reflecting the fact that it significantly captured user interest. I was curious to see if certain topics persist in contiguous slots of the timeline, meaning they held user attention for more than one day. The other interesting aspect is how many times a certain topic appeared on the timeline intermediated by other topics, which would mean they ‘reoccured’. The two metrics I discuss here is Persistence  and  Recurrence. The maximum number of contiguous timeline slots occupied by a topic is its Persistence. The number of times a topic reoccurs in the timeline after breaks is its Recurrence. We convert the timeline into a binary array for each topic, where if the topic occurs on that day it gets represented by ’1′, otherwise by ’0′. Then, persistence and recurrence can be calculated as:

Screen Shot 2014-03-11 at 10.06.56 AM

A common way to visualize multivariate data is parallel coordinates, which is also very useful in analyzing n-dimensional categorical time series data. The real advantage of parallel coordinates (over orthogonal coordinates) is that the number of dimensions required to be visualized is only restricted by the horizontal resolution of the screen. The two key factors in creating parallel coordinate representation is the order and scaling of the axes. We follow the order: occurrence – persistence – recurrence. Scaling is a natural aftereffect of interpolation among consecutive pairs of variables. Our data reveals that a relative scaling of 24:7:20 for the occurrence:persistence:recurrence axes fits the parallel coordinates best.

Visualized below is the occurrence, persistence and recurrence of topics on various categorical timelines of Instapaper Daily. In the table, ‘name’ represents the topic and ‘group’ identifies its categorical timeline.

Parrallel Coordinate Representation of Persistence/ Recurrence of Topics

[Link to full page viz]

Topic Patterns. Using the parallel coordinates, we found four common topic patterns in various categorical timelines of Instapaper Daily. They exhibit how certain topics sustain and possibly regain user attention in time. Topics around stories generally behave in four different ways:

  • A – The topic occurs rarely, and does not persist.
  • B – The topic occurs rarely, but it persists when it occurs ( the concave topic )
  • C – The topic occurs often, but does not persist relative to its occurrence ( the convex topic)
  • D – The topic occurs sometimes and persists relative to its occurrence

  concave/convex patterns of relative persistence

conwave

Pattern B is the most infectious and sustains reader attention. These are topics that usually occur rarely but persists, bearing a concave curve on the three axes. Examples include stories about Amazon drone delivery and Sochi Olympics. Pattern D involves topics that occur periodically but persists, and users seem to enjoy reading about it during or right after that period. Examples include ‘football’, which gains interest periodically close to game days, or even international news (often crisis) that captures user attention.

Pattern C involves story topics that receive repeated visibility. They usually have high occurrence but low persistence; thus they exhibit a convex curve on the parallel coordinates. These include articles about TV shows (like Sherlock), discussions about investments in the business category and several topics in ‘science’ (like space, physics etc.). Finally, the most frequently observed pattern is A – topics that usually have low occurrence and where persistence nestles at approximately 1/3 the occurrence, resulting in almost a horizontal line across the parallel coordinates. Examples include discussion about ‘phone carriers’ and ‘food’ in the business timeline, and stories about ‘energy solutions’ in the science timeline. In terms of the parallel coordinates, the more convex a topic’s curve, the lesser is its persistence relative to its occurrence. Conversely, the more concave a topic’s curve, the greater its persistence relative to the number of occurrences.

Final Remarks. 

  • In a world where our attention is dissected in various ways every single day, its fascinating to explore what can sustain user interest. Do persistent topics cover a global event or some outrageous vision of entrepreneurism? Of equal curiosity is what readers want to devour repeatedly, so that the topic keeps recurring on the featured timelines. Our analysis is grounded in reading habits of Instapaper users. This is in contrast to page-views and social shares; both of which have been used to measure attention (perhaps with low fidelity).
  • There’s a bunch of exploration we could further do on this data. Should we have included the top 100 InstaRank’d articles per day instead of the top 7 categorical links? Yes, though we have to be wary about the possible introduction of categorical bias (more political articles than sports in top 100); alternately it would require normalization. Next, how to deal with the scenario of different topic resolutions for different timelines? This constraint is fundamentally imposed by the implicit nature of data generation in different categories. A movie title might occur once in the ‘entertainment’ timeline, but ‘Google’ occurs 15 times in ‘computer/internet’. Therefore, it compels you to have broader (more general) topics when the named entity space of some timeline is very sparse. This is also the reason why we avoid comparing average persistence between ‘computer/internet’ and ‘sports’ timelines, since topic resolutions are somewhat different (‘Google’ vs. ‘Arsenal FC’ would be fairer). Finally, I would like to test how auto-regressive models perform on this kind of data.
  • I scrupulously strive to reduce the physical understanding of some phenomenon into a computational problem, because deciphering phenomena statistically does not necessarily entail our ability to automate them. Once we relate the phenomenon (and its measure) to some computational equivalent, we can delve into how and if at all the process can be automated. Here, we used ‘Persistence’ to represent the continued attention on some topic on a categorical timeline. But what is ‘Persistence’ – computationally? In Computer Science, the classic maximal subarray problem aims to find the largest sum from a contagious subset of numbers in a 1-D array. ‘Persistence’ is the sum of the maximal subarray in a timeline envisioned as a bit array, where ’1′ indicates that the topic occurred on that day and ’0′ otherwise. The solution was quite obvious then; I had use dynamic programming which finds the result in O(n).

A Data-Driven Take on Flappy Bird’s Meteoric Success

Much has been written about the meteoric rise and abrupt demise of Flappy Bird, the highly addictive mobile game that seems to have captured the world over the past couple of weeks. Dong Nguyen, an independent game developer in Vietnam, who launched the game obscurely last May, decided to take it down from all app stores after achieving heights previously touched only by major franchises like Candy Crush and Angry Birds, ending the frenzy around the frustratingly difficult game, while adding to the already heightened media spectacle.

What is it about Flappy Bird that made it so successful, and why did it take so long for the game to go “viral“? The app stores are littered with thousands of free casual games that use similarly addictive gameplay. What can we learn about the rise in uptake of this game specifically? And can we perhaps identify a tipping point where engagement around the game crossed a certain threshold, gaining momentum that was impossible to stop?

These were some of the questions that I set to answer earlier this week. At betaworks, we have a unique longitudinal view of mainstream media and social media streams. Our services, at varied scales, span across content from publishers and social networks, giving us the ability to analyze the attention given to events over time. Inspired by Zach Will’s analysis of the Flappy Bird phenomenon through scraped iTunes review data, I wanted to see what else we could learn about the massive adoption of this game, specifically through the lens of digital and social media streams.

My data shows two clear tipping points, where there was significant rise in user adoption of the game. The first, January 22nd, happened when the phrase ‘Flappy Bird’ started trending on Twitter across all major cities in the United States. It would continue trending for the next 6 days, driving increased visibility and further adoption. The second, on February 2nd, was the point in which media coverage of the game quadrupled on a daily basis. Most media outlets were clearly late in the game, covering the phenomenon only after it had topped all app store charts and was already a massive success.

We live in a networked world, where social streams drive massive attention spikes flocking from one piece of content to another. While it is a chaotic system, difficult to fully predict or control, there are early warning signals we can heed, as events unfold.

flappybird_plot_final2

This plot shows aggregate unique trending topics locations (in red) , versus unique media sources covering the game on a daily basis (in purple) versus the number of unique users sharing links related to game in social streams (in green).

(more…)

Digg Reader Visualized: what millions of RSS feeds look like

Its been an exciting data-filled month. With the launch of Digg Reader, we’ve been ingesting many millions of RSS feeds from all shapes and sizes. As we continue crunching this dataset, we wanted to share some interesting insight around . RSS remains an important information transmission protocol across the web, spanning multiple languages and topics. We see a significant number of feeds in Mandarin and in Japanese.

The mass adoption of RSS as a publishing/subscription protocol, means that it is a great way to bypass the growing number of walled gardens created by social networks, where homophily is dominant. Alternatively, with feed subscription we see much more cross-domain bridging happening.

In the following post we visualize two datasets representing RSS feeds. The first is a co-subscription network graph, including the top most subscribed feeds. The second includes all feeds that had been categorized by users as ‘data’ related, giving us a nuanced way to untangle topicality of published content.

Feed Co-Subscription

The feed co-subscription graph includes the top most subscribed feeds (in this case 25,000) and organizes them by co-subscription: the closer two feeds are in the graph, the more overlap they have in terms of subscribers. What emerges are clearly defined regions, or clusters, which outline feeds that are much more inter-connected amongst each other. The different colored regions tend to represent language differences, as highlighted below. This is not surprising, since a user who subscribes to a feed in Mandarin, will most likely subscribe to other content in Mandarin. But the defined regions aren’t only based on language differences. For example, the dense region in yellow to the right of the graph includes technical feeds focused on web development. If a user subscribes to one of these technical feeds, they most likely are also subscribed to others.

feed co-subscription

Within the English content region, the popular feeds split into three main types:

1) The central cluster (in light green) represents popular blogs

 

2) To the right, we see a dense yellow cluster highlighting various technical web development-focused blogs, such as John Resig, HTML5 Watch and MooTools. There’s no clear central node here, meaning that if a user subscribes to one of these tech-dev blogs, they most likely also subscribe to others.

3) And to the left, the light blue region includes RSS feeds from MSM – including Business Insider, The Atlantic Wire and The New Yorker feeds.

Non-English Content

A large number of heavily subscribed RSS feeds are not in English. The dark green and blue regions of the graph represent content in Mandarin, many of which host technical content but not limited to. William Long’s blog on Sina is one of the most heavily subscribed to blog in Mandarin. Others include the popular Solidot, a Slashdot copy and various non-technology feeds, such as BBC China, and Songshuhui.net, a website for science enthusiasts.

055aace29a52d4492a7198593ee54488

Some topics we’re interested to further explore are around bridging. What content (or feed) tends to bridge between the Chinese blogosphere and the English one? Which online publishers help information flow across language barriers?

RSS Generators

About 34% of the top feeds displayed in this visualization use WordPress to generate the feed. Other popular feed generators include Blogger and Tumblr. Unsurprisingly, within the Russian content cluster, we see many more feeds generated by LiveJournal, the Arabic content feeds skews slightly towards Blogger while the popular Korean feeds tend to use Tistory. Around 33% of the feeds, use their own RSS generators – many of these feeds are media and other non-hosted publishers.

Organizing Feeds by Labels

The ways in which users tag or label feeds give us the ability to better organize the content. By looking at the folder names under which feeds are placed, we can get a sense for which label or topic best fits a feed or group of feeds. Given our inclination towards data geekery, we decided to run a little experiment on the top feeds that were ever tagged with variants of the word ‘data’. This could be ‘dataviz’ or simply ‘data’. But the fact that a user placed a feed within a folder referencing ‘data’ means that source most likely publishes content that’s somewhat related to data.

Next step is to create a network graph where nodes are feeds, but connections reflect the number of shared labels feeds have. The more shared labels a set of feeds have, the more likely they are a part of the same logical group. The resulting graph is utterly fascinating.

191e68b2ca3485c2c1b1fe1f8247ac23

We see five distinct regions emerging from our data graph. While the central group in blue boasts the largest feeds (most followed compared to the rest), it is also the least interesting group. These are all very popular blogs, which would have been classified to various categories by many users.

The largest group (in purple) represents feeds that are tech heavy. With numerous feeds spanning database technologies such as MySQL, MongoDB and Postgresql, this group spans content in both English and Mandarin. This means that many users who subscribe to the ‘MySQL Performance Blog’ are likely to also read the ‘Alibaba DBA Team’ blog as well as other mandarin content.

The second largest group of feeds (in red) represents content about data visualization. This includes the popular David McCandless ‘Information Is Beautiful’ blog, as well as the always entertaining JunkCharts. The green region on top is all about data science. Here we have The Natural Language Processing Blog, R-bloggers, math babe and Hilary Mason’s blog. The tiny region at the bottom (orange) includes data journalism-related feeds, including Jonathan Stray, Hacks/Hackers and Data Miner UK.

Using this relatively simple method, we identified dominant feeds that have to do with data, but also classified them into four very different categories. Using this mapping, I’ve already updated my steadily growing digg reader ‘data’ folder. By generalizing this method, we’re hoping to provide similar insight and recommendations for users.

Digg Originals Data Piece: when your Twitter friend turns out to be the Boston bomber

We published this piece in collaboration with the Digg team as a first experiment in original content. The post studies the Twitter network of Dzhokhar Tsarnaev, the younger of the two Boston bombers, who had an active Twitter account, and highlights some of the unintended consequences that come with publicizing something or someone, making them significantly more visible, and thus exposing potentially revealing information.

“We live in a networked age, where you can be found not only through the data trails that you leave behind, but also based on the patterns around your network – who you’re connected to. A lot can be inferred about someone just by looking at their friends and how they’re interconnected on social networks. But what happens when one of your Twitter connections happens to be one of the alleged Boston Marathon bombers?

 

The more the media pointed to Dzhokhar’s friends online, the more they closed down, eventually turning private and shutting down their Twitter profiles completely. The more his high school friends were pointed to by the media, the more they were harassed for humanizing their friend, Dzhokhar. In a networked environment, the potential for visibility exists. But visibility comes with consequences.”

 

jtsar_network

Can you click me now?

From the betaworks “Department of Insights That Make Sense, But Are Cool to See Play Out in the Real World” (better known as DITMSBACSPORW), we bring you a snippet of statistics about bit.ly clicks from mobile devices. Using two months of click history from one of SocialFlow’s customers, a user with a Twitter account with over 500,000 followers, we separated the clicks coming from mobile devices (such as iPhones and Android-powered phones) from all other clicks (such as from desktop computers). We could then see how mobile browsing varied over the course of each day as a percentage of total browsing.

Some highlights:

➢ Mobile usage is really important – at the peak (at 8 PM on Saturdays), an average of 41% of all clicks came from mobile devices.
➢ Weekends generally have a higher percentage than weekdays. (It’s good to know you’re not just shifting from following Twitter on your office PC to following Twitter on your home PC. You’re using Twitter on your phone!)
➢ Friday night looks more like a weekend, and Sunday night looks more like a weeknight. (Your mother will be happy to know you’re staying home on Sunday nights.)
➢ People are less likely to click from their mobile devices in the middle of the day and wee hours of the morning than during morning rush hour or at night

While conventional wisdom might hold that the power of Twitter is based on the brevity of its 140-character messages and the convenience of receiving these messages on the go, it’s becoming clear that people on the move are also reading long-form content that appears (in link form) in their Twitter timelines.

SocialFlow’s algorithms help publishers select which of their potential Twitter and Facebook messages will be most interesting to their followers and fans, and it sends those selected messages at the best time. As web browsing on mobile devices increases, the right time to send a message will often be when people are on the go. That’s all the more reason to have a tool help you send messages – you’re too busy partying at night (and, um, compulsively checking your phone) to be at your desk sending tweets.


UPDATE: Editor’s note

We had a few questions regarding the timezone of the graph and whether we mapped clicks to the time in their respective zones.

1) All times are Eastern.
2) We did not normalize clicks to the times in their respective zones. Ultimately, when you send a tweet, it will go to all of your followers at the same time, regardless of what timezone they are in. We therefore felt it was appropriate to lump all the timezones into one.
3) As the client writes English-language articles, the vast majority of the client’s followers are clicking from US timezones.
4) The general trends, i.e. weekend higher than weekday, night higher than day, still hold even without accounting for different timezones.

Around the World [Cup] in 30 Days

bit.ly world cup logo Like millions of people around the world, we watched the World Cup matches unfold. People around the world tweeted and clicked through to news, updates, and more.

We collected tweets from the twitter API (twitter also created their own visualization) that used worldcup hashtags (such as #worldcup and #copamundial), extracted the bit.ly URLs, and plotted the click traffic on a world map.

Each frame of the video is approximately one hour. The color ranges from white (low traffic) to dark blue (highest traffic), and is normalized by click count per country.

You can see the peaks in the US vs Ghana and Brazil vs Netherlands games and, most intriguingly, the immediate lack of traffic in the Netherlands after losing while the attention in Spain persisted through the next morning.

This effect is clearly visible if we zoom in on Europe:

If you want to relive the final, dramatic game, we loved this recreation of the final moments in LEGO:

LeBron, Twitter Phenomenon

Some of the data and arguments included in this post were first incorporated into Peter Kafka’s July 8, 2010 post entitled “LeBron James and the Giant Twitter Link” (http://bit.ly/9Hh2Mh).

As was widely reported, LeBron James made his initial foray into the world of Twitter last week and had quickly amassed a following of close to 300,000 followers. While many reports and posts had analyzed the diffusion of LeBron’s original tweet and the speed with which he has gained followers, the data we have here at betaworks provides us with unique insights into the awesome distribution surrounding LeBron’s first tweets and its lessons for the Twitter universe in general. LeBron’s tweet, despite a dearth of initial followers, serves as an extreme counterexample to the common misconception that “More followers equals more clicks.” As the LeBron example shows, and what we know from other betaworks data, relevance of message and follower engagement are for more important.

Click-Click-Click

With the simplest of messages…

Check http://bit.ly/apxlxx (lebronjames.com) for updated info on my decision.

…LeBron’s bit.ly link had one of the most popular first days ever, both in terms of the volume and velocity of clicks by individual users.


The velocity profile (though not the absolute level) for the first hour of the shortened link is fairly typical of popular links – they peak soon after being published, drop sharply over the next ten minutes, and then decline slowly but steadily. LeBron’s shortened link is unique because of its sheer magnitude coupled with the fact that his number of Twitter followers is relatively low (more on this point later). Furthermore, people’s interest just doesn’t seem to die; 24 hours later, the link still gets about 50 clicks a minute. Below are some statistics on the massive click-storm unleashed by that single tweet.

Total followers 24 hours after post ~300,000
Total bit.ly clicks 24 hours after post ~180,000
% of first day’s clicks within first hour ~25% (compared to ~50% for a typical popular tweet)
Peak velocity ~3,000 clicks per minute (reached almost immediately)
Time to 50% of peak velocity ~5 minutes
Time to 25% of peak velocity ~14 minutes
Time to 12.5% of peak velocity ~39 minutes

LeBron’s Lessons

LeBron’s post puts in clear focus the importance of tweet relevance to potential readers and the idea that it is better to have fewer dedicated followers than more disinterested followers.

With all the hoopla surrounding LeBron – Decision 2010, sports writers and fans had whipped themselves into a frenzy, jumping on any and every piece of LeBron-related news on which they can get their hands. LeBron’s post, table scraps in terms of its informational content, was consumed ravenously by a large swath of people precisely because they were starving for anything at that moment. At another time, as popular as LeBron may be, it is unlikely that the same type of tweet with so little actual news would have generated the same kind of response.

The other lesson we drew is that it is a far better strategy to ensure that your readers actually care about your posts than to amass an ever-growing legion of followers who couldn’t care less. LeBron started with almost no followers, yet the response was crazy.

These patterns and lessons bolster the thesis behind one of betaworks’ companies, SocialFlow. SocialFlow provides a tool for Twitter publishers to ensure that their posts go out when their followers actually care about the content with the proven ability to increase click-through rates and follower-retention. Examining the SocialFlow data, the cliche seen in the response to LeBron’s post, of quality (relevance) over quantity, plays out again and again; one of their clients that focuses on cycling and has about 17,500 followers generates a higher number of clicks, not just a higher click-through rate, than do some Twitter accounts that gained over a million followers because of their placement on suggested lists. In the world of micro-blogging, relevance and quality beat quantity any day.

Though he didn’t sign on with our hometown Knicks, we still appreciate LeBron for the important insights he has highlighted as a Twitter Phenomenon.