Last week the Digg editors came to us with an idea. Wouldn’t it be great if we could do a joint data post on March Madness, but instead of the usual – looking at the account with the most likes, or retweets – we would take a very different approach. Their idea was to find the most disliked team on social media, and map that out.
We began by grabbing Twitter data using the public APIs, looking for tweets that have certain keywords, hashtags and twitter handles connected to March Madness events and games. Our dataset grew rapidly, hundreds of thousands of tweets from different games, referencing teams, players and many other random facts. But surprisingly, we found very little hate. Using a number of sentiment analysis libraries, and services. Even when trying to poll the data for certain common words or phrases to put together a training set to build our own classifier yielded poor results. We just couldn’t find sufficient examples in the data.
Weirdly enough, we came across numerous users who tweeted about their “love/hate” relationship with March Madness. But not much anger or hate, and very little directed at specific teams. Perhaps a byproduct of being a part of a networked public, one’s actions are not only seen by their peers, but visible to the network at large, hence one might be more cautious before displaying strong sentiment.
March madness is a love / hate relationship. I hate it right now.
— Dylan Lamb (@dylanlamb_) March 30, 2014
Constructing their Fan-base
Instead we chose to focus on each team’s networked audience (“fan-base”). If we can identify users rooting for each team, we’d be able to pull out some potentially interesting facts about each cluster of university supporters and run some broader comparisons. For the final four teams, we chose a number of popularly followed Twitter handles used to represent each team. Then we collected all users who retweeted those accounts into sets, making the *very crude* assumption that a retweet mostly represents an endorsement. While certainly not true in political discourse, especially when trying to make content visible to one’s audience for critique, this was rare in our case. And even if it did happen, the event would pretty much be an outlier in our data (users not part of the connected component).
Finally for every user in one of the four sets, we grabbed information about Twitter accounts that they follow. The Twitter graph represents both friendships as well as interest. By using network analysis techniques, we were able to run comparisons across the different teams.
For the digg post, the editors pulled out the following data points: devices used, user bios, and mainstream media preference (Fox News, CNN, MSNBC) reflected by who they follow. Some excerpts:
“Florida is uh, well, the only place where Fox News is popular. And it’s not even that popular. In fact, Florida is the only school where Fox News even shows up in the data. Obviously, this is some sort of conspiracy related to Benghazi.”
“Aaron Rodgers is more popular than sex. Well, actually we don’t know how popular sex is, but 50.25% of Wisconsin fans follow Aaron Rodgers and he didn’t even go to Wisconsin.”
Analyzing the Networks
At the end of this process we were left with some very interesting graphs representing the different fan-bases. For example, if we take all the Kentucky Wildcats fans, and insert each Twitter user as a node, and connect them by who they follow on Twitter, a fairly dense cluster emerges. The larger a node, the higher its degree in the network – the more connected it is to other nodes. The average degree for this graph is 11.731. This means that on average, every user is following or is followed by a total of ~12 other users from this same network. The higher the number, the more densely connected the community. BTW – the median degree for this graph is 5, still quite high.
Wisconsin’s fan-base looks quite different. Generally the number of retweets for Wisconsin team accounts is significantly higher compared to the other three teams. The size of the Wisconsin user set is almost three times as large as that of Kentucky’s. While the graph is more dense, there are fewer central nodes, and many more tiny specs representing less connected users. Still, the average degree for nodes in this graph is 16.89, and the median, 8. Both higher than the Kentucky fan graph. So even though the Wisconsin fan-base is much larger in our sample, it is also much more inter-connected.
If we double down on this WI graph, and run a network detection algorithm, we can start to identify regional friend networks and interest groups, outlined by similarly colored areas. For each of these clusters, we can see trends – users in the bottom purple region tend to have “Green Bay, WI” in their bios, while those in the Red region have either La Crosse or Galesville, WI. Many of them are high school or college students, on various sports teams, connected to their friends in the network. I was surprised to see many of them using Vine. Curious if that’s a general trend across student populations, or simply a regional one.
This is a pretty standard analysis, and only touching the surface in terms of audience insight that can be reached. We’ll be highlighting much more of this type of work here, and get deeper into methods and techniques.