When I heard that Tyler Gray at Public Knowledge was looking for someone to do some analysis on tweets that mentioned SOPA, I thought I might try Cytoscape (an open source tool used for biomedical research, but handy for large scale data visualization) to show some of the relationships between people discussing the controversial bill on Twitter.
The result is a graph of the most active users referencing SOPA
Public Knowledge worked with the Brick Factory to set up their slurp140 tool to record approximately 1.5 million tweets which Tyler sent me in the form 350mb CSV file. I first used Google Refine to clean and narrow the set down to only tweets which were replies to someone else. This left approximately 80,000 tweets which I then imported into R. I then ranked all of usernames by how often they appeared both as senders and recipients, and then picked the approximate top 1,000 users. Since replies are sent from one user to another, the graph is directed: each edge has a direction with an origin and an arrow pointing at the recipient. There are 1,021 nodes identified by their Twitter usernames, and 1,757 edges a good portion of which are labeled with the content of their tweet.
Visualizing networks this large is more of an art than a science
I've tried to strike a balance between visual complexity, aesthetics and readability of tweets, but you'll find that this isn't always successful. Sometimes tweets run into nodes, sometimes edges run into labels, and sometimes the graph feels like a total mess. But that messiness is part of what made the SOPA debate on so interesting over the last month.
Thousands of people participating with plenty of cross talk.
The colors and sizes of the nodes and edges are coded in the following ways:
- A node and its label size is maps to the number of tweets both posted by a user and and mentioning a user. (Ex: @BarackObama is a huge node because so many people were tweeting at him about SOPA).
- Node color represents the number of outgoing tweets. The greener the node, the more replies a user posted. (Ex: @Digiphile sent a lot of tweets mentioning SOPA.)
- Edge thickness represents "edge betweeness" which is how many "shortest paths" that run through it. This is a rough measure of how central a given tweet is in a network. (Ex: @declanm and @mmasnick have a thick line connecting them because many other nodes are connected to the two through that tweet.)
- Edge color represents the language of the tweet. (Ex: Tweets in English are blue, Spanish are yellow.)
The nodes are positioned using an "force directed" algorithm which is typically designed for undirected graphs, but I found it to be the most visually compelling of Cytoscape's layout options. To learn more about force directed graphs, take a look at this d3 tutorial visualizing the characters in Victor Hugo's Les Misérables.
Thanks again to Public Knowledge, The Brick Factory for providing the infrastructure to record the tweets, and everyone who has helped fight against SOPA and PIPA over the last couple of months, especially those who tweeted about it.