The Data Behind My Ideal Bookshelf

My girlfriend / lady partner, Thessaly La Force, recently published a book with the artist Jane Mount called “My Ideal Bookshelf.” In it, Thessaly interviews over 100 people and Jane paints their bookshelves:

The books that we choose to keep –let alone read– can say a lot about who we are and how we see ourselves. In My Ideal Bookshelf, dozens of leading cultural figures share the books that matter to them most; books that define their dreams and ambitions and in many cases helped them find their way in the world. Contributors include Malcolm Gladwell, Thomas Keller, Michael Chabon, Alice Waters, and Tony Hawk among many others.

As I observed Jane and Thessaly compile the book over the last year, I couldn’t help but think about all the fun opportunities I could have exploring the data behind the shelves.

Contributor Neighbors

Each of the 101 contributors Thessaly interviewed picked as many books as they thought represented their ideal bookshelf, and I knew some of them would pick identical books.

So what would a taste graph linking contributors to each other using the books on their shelves look like?

Previously, I had worked with Cytoscape to render a network graphs, but this seemed like a good opportunity to make something interactive and also a perfect first project to really use d3. I can’t wait to do more with it.

Hover over each node to see the contributor’s neighbors.

The layout of the graph is done using d3′s force-directed graph layout implementation and each node represents a contributor’s shelf and is colored by the contributor’s profession.

Each active link is then colored by the neighbor’s profession. The nodes at the center of the graph have the most neighbors and exert the most pull over the rest of the graph. Try clicking and dragging a node to get a good feel for its centrality.

Genres and Professions

click for larger

The distribution of contributors professions skews heavily toward writers so all the professions aren’t evenly represented: 41 out of the 101 bookshelves were from writers, and there were a total of total of 35 unique professions represented.

click for larger

In the graph above, I picked books chosen by professions which were represented by two or more contributors. Each circle is proportionally sized with the share of books chosen by the profession of the contributor.

For example, fiction books make up the plurality — 31% — of books chosen by writers. Similarly, 55% of the books chosen by photographers were classified as photography.

This next graph is a violin plot showing the distribution of page counts of the books chosen by each profession:

click for larger

Excluding the Oxford English Dictionary which comes in at 22,000 pages, legal scholars (Larry Lessig & Jonathan Zittrain) earned the highest average page count of 475, and all the books chosen by the two photographers had page counts under 500.

Year vs. Page Count

Taking it a step farther, here is almost every book (again, excluding the OED) based on the year it was published (X-axis) against the log10 of its page count (Y-axis).

The size of each point represents the number of ratings the book had accumulated on Google Books, and its color represents the book order the contributor placed it on their shelf.

click for larger

The darker the point, the farther on the left of the shelf the book was ordered. Conversely, small teal dots represent books with relatively fewer ratings which were placed towards the right of contributor’s shelf.

Unfortunately, not all the books had publishing dates that made sense — Google reported the dates of Shakespeare’s works anywhere from 1853 to 1970 to 2010, and when would you say the Bible was published? 1380? 1450?

Excusing these erroneous points, I think the graph still works — checkout the cluster of books published in the early 20th century chosen by the writers in the upper left of the graph: they represent popular early 20th century books with page counts between 200 and 500.

Summary Stats

  • Number of shelves: 101
  • Number of Books Chosen: 1,645
  • Unique Books According to Google’s API: 1,431
  • Average number of books chosen:16.28
  • Average Pagecount: 381.2
  • Average Year of Publication: 1992
  • Top 5 Chosen Books:
    1. Lolita chosen by 8 contributors
    2. Moby Dick (chosen by 7)
    3. Jesus’ Son (chosen by 5 contributors)
    4. The Wind-Up Bird Chronicle (chosen by 5 contributors)
    5. Ulysses (chosen by 5 contributors)
  • Top 5 Authors:
    1. William Shakespeare (10 different books)
    2. Ernest Hemingway (7 different books)
    3. Graham Green (7 different books)
    4. Anton Chekov (6 different books)
    5. Edith Wharton (6 different books)
  • Contributor with the most number of books: James Franco
  • Contributor with the most number of shared books: James Franco
  • Longest Book: The Oxford English Dictionary, Second Edition: Volume XX* chosen by Stephin Merritt, 22,000 pages.
  • Shortest Book: Pac-Mastery: Observations and Critical Discourse by C.F. Gordon chosen by Tauba Auerbach, 12 pages

* Jane was only able to paint one volume of Stephen’s OED for his shelf, but the authors agreed it could stand as a synedoche for his choice of the entire edition.

Overall…

The data I cobbled together from My Ideal Bookshelf is far from perfect, but I think it does a good job of illustrating some of the larger themes and relationships behind the book.

For example, the books of the two legal scholars tended to picked on average some of the longest books in the set, and that professionals tend to pick books related to their jobs (checkout the large proportion of photography books chosen by photographers). Also: if you’re James Franco and pick a ton of books, you’re gonna have a lot of neighbors in a network graph.

I also discovered how skewed the dataset was towards the choices of writers — I jumped in expecting a diverse set of contributors which might be useful for representing an ideal ideal bookshelf — but that would have been a difficult case to make when 40% of the contributors were from one profession.

This might sound like an obvious observation (and something I should have known having spent so much time thinking about the book), but it wasn’t something I was able to really observe until looking at simple histogram of their professions. So remember: its always worth thinking critically about whether your samples are representative of the underlying distribution, and simple exploratory data analysis can really help you out there.

Bigger picture, I think this skew demonstrates the nature of coming up with an ideal list of anything: no matter who you ask, the task is essentially a subjective one. Here, it’s biased towards the network Thessaly and Jane were able to tap to make the book.

Cleaning and Reconciling the Data

While creating the book Thessaly and Jane had carefully compiled an organized spreadsheet listing each contributor and their chosen books, but I knew there could be subtle typos here and there. These typos could possibly throw off the larger analysis: if the title of two books was even slightly off or missing some punctuation, then aggregating based on their titles would be problematic. I also wanted additional data about the books (data which Thessaly and Jane didn’t record), such as the year the book was published, or the number of pages in the book.

So I figured I’d kill two birds with one stone and look for an API which I could automatically search using the title they had entered into the spreadsheet, and get a best-guess at the “true” book it represented. The API would also hopefully return a lot of useful metadata that I could use down the line.

It turns out Google’s Book API is the perfect tool for such a job. I could send a book title to it and get back the book that Google thought I was looking for. This allowed me to lean heavily on Google’s excellent search technology to reconcile book titles that might have had typos, while also retrieving the individual book’s metadata. While I could have used something like Levenshtein distance to try to find book titles that were close to each other in the original dataset, I wouldn’t have been able to retrieve any additional metadata.

A quick side note for copyright nerds: the Google Books API played into the HathiTrust case recently, and I’d like to imagine use cases like this were part of the reasoning behind declaring the Google’s use of copyrighted materials a fair use.

Google’s Book API allows 1,000 queries a day, but since the list contained thousands, I had to write in and ask for a quota-extension — thanks to whomever at Google granted that — I now get to hit it 10,000 times a day, which was enough to iterate building a script to compile the data.

Not surprisingly, Google’s API returns a ton of metadata. Everything from a thumbnail representing the book’s cover, to the number of pages, to the medium it was originally published, to its ISBN10 and ISBN13 … the list goes on. I tried to choose fields that I knew would be interesting to aggregate on, but also ones that would help me uniquely identify the books.

One particular piece of metadata that was missing was the genre of the book — only 28% of the books returned from Google had category information. Another option would have been to set up a Mechanical Turk Task to ask humans to try to determine the books’ genres. This kind of book ontology is actually a very difficult and somewhat subjective problem. Just think of how complicated the Dewey Decimal system is.

Finally, not all data is created equal — I’ve manually corrected a handful of incorrect classifications from Google where the search results clearly did return the right book, but its certainly possible not all books were recognized or reconciled properly.

The tools

Aside from d3 for the interactive plot at the top of this post, I used R and ggplot2 to create the static graphs.

The script I used to query the Google Books API was written in Ruby, and exported the data to a CSV which I then loaded into MySQL and Google Docs to manually review and spot check.

Here’s the query I used to generate the data necessary for the force-directed graph:

SELECT
  ideal_bookshelf_one.contributor_id as source,
  ideal_bookshelf_one.google_title as book,
  ideal_bookshelf_two.contributor_id as target
FROM 
  ideal_bookshelf as ideal_bookshelf_one,
  ideal_bookshelf as ideal_bookshelf_two
WHERE
  ideal_bookshelf_one.google_book_id = ideal_bookshelf_two.google_book_id
  AND ideal_bookshelf_one.contributor_id != ideal_bookshelf_two.contributor_id

Sequel Pro’s “Copy as JSON” was extremely helpful here — it took relatively little effort to munge the SQL results into an array of nodes and links required by d3′s force layout.

If you liked this post…

Pickup a copy of Thessaly and Jane’s book today!

Kickstarter Fulfillment and Product Development: A story of Dogfood and Data Validation

You could think of this post as telling the story of two Kickstarter projects. Since its a long post, here’s a quick summary:

  1. I recently ran a Kickstarter project.
  2. I wanted to share all the financials and details of how I shipped my rewards.
  3. I discovered we could do a better job helping creators process their backer’s addresses.
  4. We recently deployed a change to backer surveys that should do just that.

So I hope this post will educate Kickstarter creators on how to smoothly fulfill their rewards, but also shed a little light on how we do product development at Kickstarter.

The Kickstarter project was pretty simple — I FOUGHT SOPA AND ALL I GOT WAS THIS STUPID T-SHIRT– and the other project was actually a Kickstarter built by our product team, which we (and I use “we” loosely, Jed, Tieg, Daniella and Meaghan did all the work) shipped last week:

The address confirmation tool helps backers validate their addresses when filling out reward surveys from creators.

Just as Netflix or Amazon ask you to confirm your shipping address if it doesn’t precisely match an exact address, Kickstarter will now ask backers to confirm a more precise one so that creators can feel more confident about shipping their rewards off. This feature also has the benefit of cutting down some of the data correction creators might face when shipping rewards.

The development of this feature is a good example of the of “dogfooding” your own product, which is software jargon for the act of actually using the tools you’ve built.

Dogfooding is one of the best possible ways to understand and improve your product, so I’m always interested in getting feedback from people that have run their own Kickstarter projects.

The SOPA shirt project was pretty straightforward. It only really had two reward tiers, but as with all Kickstarter fulfillment, there were a lot of details to get right.

Getting one of those details right — backer addresses — made me realize we needed a better way to ensure we were delivering valid backer mailing addresses to project creators at the most crucial part of their project: reward fulfillment.

I also want this post to serve as a bit of a guide for fulfilling Kickstarter projects on a similar scale. So first, some details on how I planned to fulfill my rewards.

Estimating Income and Costs

To determine my actual net dollars from Kickstarter, Amazon, and the credit card fees, I had to download a CSV of my account activity from Amazon’s FPS platform, sum the pledges and subtract the credit card fees for each pledge.

The total cost of CC fees was 4.8%, Kickstarter’s fee was 5%, so I yielded 90.2% of actual pledges, or $3,497.

I had a chicken and egg problem trying to estimate which shirts to even offer in the survey.

Since different shirt sizes and different colors were different prices (the cost per shirt varies from $6.32 for a Grey Small, to $12.95 for a White XXL), it was crucial to a rough estimate of what the shirt size and color choice distribution would be.

I knew a lot of people were going to order large and medium shirts, but if I had too many XXLs, it might not have been affordable to offer the white shirts. And if it turned out that I couldn’t afford the white shirts, I didn’t want to even offer the white shirts in the survey.

So I took a look at Yancey’s shirt distribution. For his project he had the following distribution of sizes:

  • S: 13%
  • M: 26%
  • L: 30%
  • XL: 21%
  • XXL: 10%

These ratios enabled me to estimate what the distribution of sizes would be for my shirts. I guessed that 2/3rds of my backers would want the grey shirt, and the rest would want the white shirt.

Using these numbers and the bulk costs from ApparelSource.com, I was able to determine that I could afford to offer both the grey and white shirts in my survey.

Here’s what the actual distribution of sizes turned out to be:

  • S 13.85%
  • M 29.44%
  • L 32.90%
  • XL 15.58%
  • XXL 4.33%

My numbers weren’t far off from Yancey’s — I ended up having fewer XXL than me which surprised me. The grey / white distribution ended up being 77% grey, 23% white.

I threw together a quick graph using ggplot2:

One observation is that the distribution of white shirts looks like it skews smaller. Anecdotally, that might be explained by women preferring the white shirt (again, anecdotal), and since women’s sizes tend to run smaller, the distribution of white shirts had proportionally more smaller sizes.

Buying the Shirts

ApparelSource seems to operate multiple websites with basically the same layout and offerings. They had the best prices on CANVAS shirts I could find, though oddly, the prices varied between their different sites by a couple of cents. Some sites had them in stock and others didn’t. This made no sense to me because they’re all being shipped from the same warehouse.

I ordered one of each shirt style to confirm they were the style I wanted, and then prepared to make the bulk order.

My order was extremely specific (13 Small White, etc.) so I think it raised some flags on ApparelSource’s side. I had to spend some time working out the best way to pay them. It turned out paying them via PayPal was the easiest way.

This is a version of the sheet I used to calculate costs. I think working with something like this sheet is crucial for staying sane when fulfilling any Kickstarter project.

Printing

Yancey had also suggested I check out Kayrock so I dropped them a line to get a quote. Kayrock had a source they recommended for buying the actual shirts which would have saved me over $1,000, but I wanted to stick with my choice of CANVAS shirts from ApparelSource because I had already tested their quality.

In the original email Kayrock had quoted the print job as “plastisol ink, no shirt rolling, no double hit, no handling fee, no rush fee, no production prepress, no color change fee“, and gave me the quote of $479 + $80 for printing and setup.

What they didn’t tell me, was that if I didn’t use their apparel quote and instead had the shirts shipped to them directly, they were going to charge me $0.40 cents per shirt of handling. This added $100 to the total cost, so my Kayrock fees were $660.

Despite this issue, their work and responsiveness over e-mail was very good, and I would recommend checking them out if you’re interested in some great NY screen printing.

Once Kayrock finished printing the shirts that I had shipped directly to them, I headed to Greenpoint to pick them up myself so I could start packing and sending t-shirts from Kickstarter HQ.

Backer Addresses and Postage

In general, backer addresses tended to be malformed in a couple of ways. The first was that people tended confuse “Line 1″ and “Line 2″ — they either put their Apt # first, then street, or their Street then company.

I also ran into problems with 20+ addresses from the North East because their zip-codes were had their leading zeros removed.

Zip codes like 06897 became 6897 when opened in Google Docs or Excel because they were converted to numbers.

In addition someone put “Meow, Meow” as their address. They didn’t get their shirt.

To save on shipping, I did my research and chose First Class mail, which let me specify the ounces for each package. After weighing each style of shirt in an envelope (I correctly assumed the labels weren’t going to add any more weight), I had three tiers:

  • S / M white – 5oz – $1.98 postage
  • M / L shirts – 6oz – $2.15 postage
  • XL, XXL shirts – 7oz – $2.31 postage

Sorting by whose shirts had already been delivered (all KSR staff, some friends and family), and removing those people from my CSV yielded 183 packages, summing to $370. I’d recommend keeping careful track of the rewards you ship vs. the rewards you intend to hand deliver.

The cost to ship the same shirts using flat rate envelopes would have been $942.45, so I managed to save a lot by carefully weighing and picking First Class mail instead.

After having lunch with my mom on a Saturday, she offered to help me stuff envelopes and ship the shirts. This was actually a lot of fun and I always forget how rewarding manual labor can be when the majority of my day to day work involves keyboards and LCD screens.

My mom suggested to come up with a color coded system to manage shirt sizes & colors (envelopes marked with Blue represented a Large White shirt, for example). This worked well, but only because I had access to many different colored Sharpies.

Fulfillment would have been chaos without some kind of organized system for tracking sizes and colors.

Tyvek envelopes are great, until you realize they’re a little pricey; Staples sold 50 for $28, so $0.56 per envelope which added to the shipping cost. I probably could have ordered them for free from USPS if I had planned the shipping session in advance. In general I highly recommend using Tyvek envelopes as they’re much lighter, waterproof, and probably more durable than the cardboard flat-rate envelopes.

Endicia Software

I’m not sure how I would have addressed and processed all the envelopes without Endicia which is batch mailing software. I get the feeling many other creators have used it to fulfill their rewards. Endicia is free to use for 30 days, and then $15 a month to keep using it afterwards.

The software allows you to import a CSV, but the requirements are very strict. The fields must be (in the following order):

  • Name
  • Company
  • Address1
  • Address2
  • City
  • State
  • Zip

In order to correlate each person with a label and envelope with a t-shirt, I merged their size and color selection into the name field. So it looked like this for a XXL Grey shirt:

John Smith XXL/G
3105 Sturges Ridge
Santa Rosa, California 95401

This might have been a little odd for backers (are there privacy implications for exposing their shirt size to the post master?), but Endicia didn’t allow any other fields, so there was no where else to put this information on the label.

I bought $500 worth of postage, but only ended up using $370, and was able to get a refund for the left over balance when I canceled my Endicia account.

Printing Labels

Printing labels from Endicia is very scary. You are shown this dialog box (but replace 7 with 180, worth $370):

I did one test print of a label and it seemed OK, so I went with through with the big batch. After that Endicia automatically opened a file inside my /tmp/private directory with a number of rejected addresses, most of which had the ZIP code problem. This was a time consuming and stressful process because any mistake would mean lost money on postage.

After I was done printing the postage from Endicia, I realized I could have sent the batch to a PDF instead of the actual printer, and it would have mitigated some of the risk. Without storing my labels somewhere, if my printer had gone off line or my computer would have crashed, it was unclear how I would recover labels that were still stuck in the queue. I hated trusting the printer software this much, but the overall process went pretty smoothly.

Internal Kickstarter Post-Mortem

Once I shipped my shirts, I wrote up an internal email to Kickstarter staff detailing my fulfillment process. That email became the kernel for a number of discussions about how we could better improve the data processing creators face when delivering their rewards to backers.

One solution to the dirty-address problem was to use an external service to validate the addresses supplied by backers.

Using an external service to validate backer addresses turned out to solve two problems.

First, it would ensure backers had the chance to confirm a valid address, which is always a good thing. Second, it would add a hyphen and the 4-digit add-on to US zip-codes, converting zip-codes like 11217 to strings such as 11217-1142.

That extra hyphen would prevent applications like Google Docs and Excel from converting zip-codes like 06897 into numbers like 6897 (technically, the application would recognize the cell as containing a string so it wouldn’t attempt integer coercion).

Preventing zip-codes from losing their preceding zero would mean software like Endicia wouldn’t reject addresses from the North East. So we decided to give it a shot.

First, Jed did research on what external services we might like to use. Strike Iron emerged as one of the best options, so we began sending legacy addresses to their API in order to evaluate how many they’d be able to validate and correct. This back testing showed that Strike Iron would be able to supply valid addresses for the vast majority of the sample of the surveys we tried. It also showed that Strike Iron would even be able to suggest corrections for many otherwise undeliverable addresses.

Since then, its been really exciting to watch Jed and Tieg build the actual data flow, and I’m super proud of how it turned out.

So even though this is a somewhat behind-the-scenes change in the way Kickstarter processes data, its exciting to know that it will make the lives of many creators just a little bit easier.

Visualizing SOPA on Twitter

When I heard that Tyler Gray at Public Knowledge was looking for someone to do some analysis on tweets that mentioned SOPA, I thought I might try Cytoscape (an open source tool used for biomedical research, but handy for large scale data visualization) to show some of the relationships between people discussing the controversial bill on Twitter.

The result is a graph of the most active users referencing SOPA

Public Knowledge worked with the Brick Factory to set up their slurp140 tool to record approximately 1.5 million tweets which Tyler sent me in the form 350mb CSV file. I first used Google Refine to clean and narrow the set down to only tweets which were replies to someone else. This left approximately 80,000 tweets which I then imported into R. I then ranked all of usernames by how often they appeared both as senders and recipients, and then picked the approximate top 1,000 users. Since replies are sent from one user to another, the graph is directed: each edge has a direction with an origin and an arrow pointing at the recipient. There are 1,021 nodes identified by their Twitter usernames, and 1,757 edges a good portion of which are labeled with the content of their tweet.

Visualizing networks this large is more of an art than a science

I’ve tried to strike a balance between visual complexity, aesthetics and readability of tweets, but you’ll find that this isn’t always successful. Sometimes tweets run into nodes, sometimes edges run into labels, and sometimes the graph feels like a total mess. But that messiness is part of what made the SOPA debate on so interesting over the last month.

Thousands of people participating with plenty of cross talk.

The colors and sizes of the nodes and edges are coded in the following ways:

  • A node and its label size is maps to the number of tweets both posted by a user and and mentioning a user. (Ex: @BarackObama is a huge node because so many people were tweeting at him about SOPA).
  • Node color represents the number of outgoing tweets. The greener the node, the more replies a user posted. (Ex: @Digiphile sent a lot of tweets mentioning SOPA.)
  • Edge thickness represents “edge betweeness” which is how many “shortest paths” that run through it. This is a rough measure of how central a given tweet is in a network. (Ex: @declanm and @mmasnick have a thick line connecting them because many other nodes are connected to the two through that tweet.)
  • Edge color represents the language of the tweet. (Ex: Tweets in English are blue, Spanish are yellow.)

The nodes are positioned using an “force directed” algorithm which is typically designed for undirected graphs, but I found it to be the most visually compelling of Cytoscape’s layout options. To learn more about force directed graphs, take a look at this d3 tutorial visualizing the characters in Victor Hugo’s Les Misérables.

To really browse the graph visit GigaPan where I’ve uploaded a 32,000 x 32,000 pixel version.

I highly recommend GigaPan’s full screen mode. I’ve also created a couple snapshots on GigaPan that highlight interesting nodes: @BarackObama, @GoDaddy, and @LamarSmithTX21 and @DarellIssa.

If you really want, you can also download the 36mb gigapixel file, the Cytoscape source file, and the PDF vector version of the network graph.

Thanks again to Public Knowledge, The Brick Factory for providing the infrastructure to record the tweets, and everyone who has helped fight against SOPA and PIPA over the last couple of months, especially those who tweeted about it.

There’s No Such Thing as a Compulsory License for a Photo

My friend Andy has a terrific post up about his ordeal settling with the photographer Jay Maisel over the threat of a copyright lawsuit. Chances are if, you’re reading this, you know about that. If you haven’t ready Andy’s story, go and read it and then come back.

There’s one pointed question I’ve seen crop up in a number of conversations about the settlement:

Isn’t it wrong that Andy chose to pay the licensing fees for the music but not for the photograph?

This question makes the assumption that Andy could have paid the licensing fees to Maisel like he did for the music. He couldn’t have. This is because Jay Maisel refused to license the image and there’s no compulsory license for photography like there is for musical compositions.

A compulsory license is what it sounds like: the owner of the underlying musical composition is required, by law, to license it to anyone who wants to use it at a predetermined rate. This prohibits song writers from picking and choosing who gets to perform their works. It also allows Andy to license, at a fair rate, the underlying song compositions from a Miles Davis album to make a new album of original recordings (remember, copyrights to recordings are different from copyrights to the compositions of a song).

The copyright of photographic works, unlike works of music composition, is not subject to a compulsory license.

This means that photographers, unlike song writers, can forbid anyone from reusing their work, whether it is for a poster or for an album cover.

Now, artists like Jay Maisel obviously enjoy this absolute control over their work because it lets them dictate who uses their art and when. Song writers, unfortunately aren’t afforded to this their published works.

So while no one could have prevented Andy from recording an album of remixed music written by Miles Davis — not even Miles Davis himself if he were alive — the same does not hold for a photo of Miles Davis.

Remember, Maisel admitted he would have refused to license to Andy the rights to the photo. So Andy’s only option, short of not using the photo at all, was to use the 8-bit remix cover and wager it was a fair use.

That Andy could, in one case, hire artists to legally remix music by paying a compulsory license, but in another case had to make an expensive and risky bet on fair use (a bet that didn’t pan out) feels unfair.

Put another way: why are composers required to license their compositions at a fair rate to anyone, but yet virtually every other type of artist doesn’t have to play by the same rule?

I doubt anyone would argue that song composition is a lesser art or any less deserving of full royalties than other arts.

One reason is that the practicalities of compulsory rights for photographs (and other works) are hard to imagine. Music compositions are written, then performed, then recorded, whereas photographs are snapped and then printed. There’s no underlying right in a photograph (thank goodness) to its “composition” like there is for a piece of music. So that is part of why compulsory licenses for photos don’t exist.

But I think another part of the story is that the law has evolved the musical compulsory license as an implicit acknowledgement that music compositions are both maleable and fundamental components to our culture. Compulsory licenses make possible everything from karaoke bars to cover bands to remixes like Andy’s. The alternative — allocating complete power to composers over who reuses their work — yields transactional costs on culture that are simply too high. The law hasn’t felt the same way for the visual works.

So will other art forms, like photography, adopt compulsory licenses? I doubt it, but I can’t help but they’d be a great compromise in light of Andy’s settlement. I asked Andy over email whether he would have paid a mechanical license for the photo:

“Absolutely. If the laws and protocols for remixing photos were as clear and fair as covering music, I would’ve bought a mechanical license for the photo in a heartbeat. But the laws around visual art are frustratingly vague, and requiring someone’s permission to create art that doesn’t affect the market for the original doesn’t seem right. I didn’t think it would be a problem, especially considering the scope of my project, but I was wrong. Nobody should need a law degree to understand whether art is legal or not.”

CrowdFlower NYU Event

I did a panel on Crowdsourcing at SXSW this spring with Lukas, the CEO of CrowdFlower. Even if you made it to that, you should definitely know about this great event at NYU on April 13th:

CrowdFlower and NYU present: NYC Crowdsourcing Meetup
Join CrowdFlower for its first ever New York City meetup co-hosted with NYU

Wednesday, April 13
6:30-9pm
44 West 4th Street, room M3-110

Pizza, beer, and thought provoking conversation about the future of work.
Come listen, ask, and debate how crowdsourcing is changing everything from philanthropy and urban planing to creative design and enterprise solutions.

Confirmed Speakers:

Bartek Ringwelski, CEO and Co-Founder of SkillSlate
Panos Ipeirotis, Associate Professor at Stern School of Business, NYU
Lukas Biewald, CEO and Co-Founder of CrowdFlower
Amanda Michel, Director of Distributed Reporting at ProPublica
Trebor Scholz, Associate Professor in Media & Culture at The New School University
Todd Carter, CEO and Co-Founder of Tagasauris

See you there!

Sprint: A Portrait of a Dysfunctional Company


I’ve been trying become a Sprint customer for almost a week.

I want to order two 4G Motorola modems for a redundant Internet connection at work.

I call the Sprint phone number to determine price, terms of service and what their refund policy is.

The telephone sales representative sounds like he’s communicating through a CB radio from Southern India. He gets pushy and wants me to sign up right then and there, but I say no thanks, I’ll try to go into the store and do it. That was my first mistake. There must be a Sprint store near where I work, and indeed, I find one on Google Maps. I call the phone number and listen to a beep-beep-beep-boop-this-number-has-been-disconnected message. I try Google maps to find the phone number on the side of the awning, call that number, no one picks up.

I locate another Sprint store, in the West village, and call them to ask if they have the modems in stock. Nope. Do they get them in often? Nope. Do they know of any other stores that might carry them? Nope. Is it something that stores even carry? Not really, the woman on the phone has never seen them stock and is pretty sure I have to order them online. I begin to order the modem online and get a Request Timeout error that looks like its from a default Apache configuration. This does not inspire a lot of confidence.

Bad

Finally get to the end of the online order form and am told that there is a problem with my account — my identity and credit needs to be verified further, and an email is sent to the same effect, encouraging me to call a phone number for more information. I call said phone number only to be told my information isn’t in the system, and that Sprint can’t help me yet. I need to wait between 24 and 48 hours to call back to verify my information. OK. I call back in the next couple of hours and try to get to the bottom of what they need from me: a copy of a utility bill with our information faxed into a number with a cover sheet with my email, an Application Number (not order number) in order to verify my identity.

I send the fax and wait.

The next day, with no confirmation or indication that they’ve been able to verify my identity, I call back and ask what the status of my order is. I wait on hold.

Worse

That is strange, Sprint says, they can’t find any evidence of my fax being sent in.

It turns out the representative that handled my previous call didn’t make a “note” of the impending fax so therefore no one thought to look for it.

Where did my fax go? Sprint isn’t sure, they can’t say because my order didn’t have a note in it saying that a fax was coming. So I just sent my utility bill to your fax machine and now no one knows what happened to it? Yes, that appears to be the case.

I’m going to have to send the fax in again, but this time they’ll make a note of it and be sure that it will be associated with my order. Will I get any confirmation that they’ve received my fax? Can I check? No, a credit department representative will be in contact with me within 24-48 hours to tell me whether I have enough credit to order the modems. How? They’ll either call or email.

Two days later an email arrives from Sprint’s Business Credit department saying that a line has been approved for me and that I am eligible for up to 5 devices and a $150 deposit per device and so on. I reply: OK does that mean my order can go through? No, the Business Credit department informs me that they do not handle orders. What should I do now? I have to call one of another dozen Sprint phone numbers (if anyone needs a company directory, let me know, I got ‘em all).

I get back on the phone with Sprint with someone who sounds more competent than the rest of the people I’ve been dealing with. She shows a modicum of empathy. “I’m just trying to give your company money. I want to be a customer and you’re making this very difficult for me.” “Yes, I understand” she says and starts dealing with the credit department.

Hold music and an advertisement from Sprint warning of the dangers of App Stores and espousing their company’s “personalization” taunts me. They play it over and over again and again and I try to remember to beg the representatives to just put me on silent hold so I won’t have to listen to this. But every time I forget and start to understand the potential of repetitive auditory stimula for inducing psychosis.

Finally the woman comes back on to tell me that we’ll have to cancel the old order and try again. Does that mean I have to give you all the information I gave the website over the phone? No. But what she really means is yes. Yes, that is exactly what I’m going to have to do: as she goes through the list of information I recognize every question as a field from Sprint’s online checkout form. My own name and business address cease being meaningful concepts as I mumble them drearily into the phone.

More hold.

She’s back: ok we’re almost there. We need to charge the credit card. Get the credit card.

Getting the credit card. After an hour on hold, I can hear the excitement in both of our voices.

Whose credit card is this? Its the company card. Who is that?

Can I speak to the card holder please? No, he’s our CEO and is very busy right now.

The only way we can charge that card is if I speak to the cardholder.

I hand my boss the phone and he says yes and gets asked what he just agreed to.

Back on the phone with woman, ready to charge the card for the deposit plus first month of the modems. So very close.

The credit card was denied. My boss gets a phone call, its American Express verifying the charges. He spends 10 minutes on the phone with Amex trying to navigate their voice prompts to allow Sprint to charge our card.

Finally the modems are charged, and I get sent a confirmation email from Sprint for the new order.

Terrible

Back at work, Monday morning, check the status of the account: Your order has been canceled. Please call Sprint for more details.

I get passed from one person to another, and signing off, the first person says to me: for more information on your order, you can check sprint.com/myorder. I tell him that the reason I’m calling is that that page told me to call this number. He repeats himself and hangs up.

After half an hour on the phone, having talked to 4 different people, and at least one manager, I’m told that Sprint cannot verify my identity and they are incapable of fulfilling my order (or any order) over the phone. This concept no longer makes any sense to me, but I’m defeated at this point.

Surreal

Is there any possible way I can order these modems? Yes, I’m told, I can go to a store and order them.

What? Your stores don’t carry the modems. The stores told me to order them from the web.

Let me find you a store sir.

My apologies, sir, but our store locator is broken.

So I can’t order two modems from you– I can’t become your customer because your store locator is broken?

Maybe there’s a store outside of the city you can go to.

I have to leave NYC to order a modem from your company?

Our store locator isn’t working.

Do you have any idea when it will work?

No, they haven’t given us a time frame when it will work.

So not only can’t I order from your website, but I can’t even determine which stores I could go to?

I’m sorry sir.

They ask me for a new zip code and I give it to them. 10 minutes of hold later, I’m given 3 different stores with 3 different phone numbers. I call the first store.

Do you carry the modems?

No, you can only get them on the website because they’re for home use.

Sprint told me that I could get a store like yours to verify my identity and then I could order the modems over the web. Does that make sense to you?

Yes, that makes sense, but we can’t actually do that because we’re not actually a Sprint store.

So I was just told to call a store that Sprint thinks is a Sprint store but isn’t actually a Sprint store?

You’ll have to go to the main Sprint store to verify your identity. The one on 23rd.

Sell your Sprint Stock

Here’s a graph comparing AT&T (blue) and Verizon’s (red) stock with Sprint’s (green):

It is no mystery that Sprint is not doing well, and many people believe they are on the verge of bankruptcy. With sales (not customer) service like this, that’s not surprising. It shouldn’t be this hard to become a customer of anyone’s business.

Markov Chaining Kickstarter Blurbs

So I’m back from a wonderful couple of days of hanging out with the Kickstarter team near the beach. I managed to rent a surfboard and catch a couple of waves too. During a margarita-induced what-if-session, someone encouraged me to try and auto-generate some blurbs from the Kickstarter homepage. These are typically 150-200 character descriptions of projects that our community team labors over and refreshes daily.

Since I had worked with Markov Chains for Dan Shiffman’s class “Programming A to Z” at ITP and had done two projects using them: ROBODRUDGE (autogenerated Matt Drudge Headlines) and The Rutabaga (an April Fools Joke that Google was attempting to compete with The Onion by using auto-generated news headlines), they seemed like an obvious place to start. I found Eric Hodel’s Markov Chain ruby code readily usable and went from there.

To give some background to Markov Chains: the basic principle is to use probability to auto-generate new sequences based off old patterns. Sometimes these sequences can be numerical, sometimes they’re musical, and sometimes they’re  characters and words.

An example of exquisite corpse from Wikipedia.

Another way to think of Markov chains is as a computer’s attempt to play Equisite Corpse: it is fed a certain amount of existing information and then it attempts to extrapolate a similar pattern. A classic example is to feed a Markov Chain Engine Shakespeare; not only is it readily available in raw text and in the public domain, but Markov Chain generated Shakespeare looks strikingly similar to the real thing (thanks to Jim’s Random Notes for the work):

If they in thou, thy love, that old,
Thought, that which yet do the heat did excess
My love concord never saw my woeful taste
At ransoms you for that so unprovident:
For thereby beauty show’st I am fled,
Althoughts to the dead, that care
With should thus shall by their fair,
Where too much my jade,
Her loves, my heart.vs.

I pulled down a document with a majority of the text that Kickstarter has used to blurb projects on the homepage. Below is a subset of an almost infinite list of hilarious and sometimes disturbing auto-generated project blurbs:

  • It features monstrous puppets, mystic sex rituals, yellowface assassins, wildly stylized violence, and a songwriter who created that all-important childhood fave Schoolhouse Rock.
  • To promote NAIN, an all-new miniature setting for his REIGN roleplaying game, Greg Stolze has put together a group of thespians shouting thank you at their laptops to a great $3 price.
  • Karl Cronin wants to document and collect relics of the concoctions of Atlanta’s Good Food Truck.
  • Emilia Brock types out every word of her adorable and inventive zine, “Muster,” on a manual typewriter, has each copy illustrated by the Simpsons as a mobile CSA.`
  • First Law of Mad Science channels cyber-punk and sci-fi to bring a Yakuza noir production to the emerging subculture of asexuality.
  • Rewards include CDs mixed by a visit from a song left on your voicemail to a unique fabric whose color + pattern is determined by keywords pulled from Twitter’s database in real time.
  • Joe Mangrum has been designing simple-but-compelling computer games based on Eugene O’Neill’s play.
  • The project will help them print their fourth issue, and editor Mindy (a drummer herself!) is offering backers personalized designs and original music and prove that the rail system is a five-minute short film about a roach violinist who falls in love with the bravado you’ll find in Memphis Heat, a documentary on the real-life appearance of a giant, hand-crafted, Rube Goldberg contraption.
  • After 13+ years of genetic testing and solitary confinement, Oliver’s getting a new horror-adventure comic about the movements of immigrants across vast bodies of water.
  • With just a camera in hand and boundless curiosity, Rebekah Potter interviews artists for her series 10min4walls, in which a twenty-something guy returns to Argentina to rekindle past excitement and romance, but instead is confronted with a musical twist, features a one-of-a-kind mood swing.
  • Fans have flocked to support him, and it’s not hard to see it for yourself: it’s available only to backers of the fastest competitive lockpickers in the form of a pig’s dissembling and its indie rock soundtrack.
  • Director Jonathan Langer will reward backers with the woman whose apartment he inhabits.
  • Common Cycle is a feature-length thriller about strained family relationships, small-town antics, and second chances.
  • I’m Going Home will be filled with a stowaway lab mouse as his only companion.
  • She’ll combine intimate interviews, vérité footage, and animation in a city decimated by the public for performances, gallery space, meetings, bike repairs, relaxing — or pretty much anything.
  • This August, thousands of individual leaf, flower, and bird forms from reclaimed wood and connecting them into an animated feature film starring comedic legend Leslie Nielsen.
  • Trailer Park: A Mobile Public Park is a witty comedic web series.
  • Missed Connections is a film about three friends on a loom into a community biology lab so that no two experiences are alike.
  • Check out his video and you can get a sense of humor and would like to be normal, and Nick’s lusting for the Queen of England and wrote a jazz pianist whose work was performed by Miles Davis, John Coltrane, and many others; was dead.
  • First Law of Mad Science channels cyber-punk and sci-fi to bring you a hand-drawn postcard from the original.
  • The video teaser feels wonderfully Tim Burton-esque, and the search for the over 45,000 participants who attend Burning Man every year.
  • This latest project will launch Gerlan’s Spring 2011 runway show, and she’s offering backers everything from print copies to mix tapes to personal drum lessons.
  • Brooklyn-based independent art collective Ugly Duckling Presse is publishing the first twist ending that we’ve come across in a unique fabric whose color + pattern is determined by keywords pulled from Twitter’s database in real time.
  • Check out their touching video and the Red Hot Chili Peppers.
  • Chicago’s Chinese Fine Arts Society has been making stunning sand paintings in public spaces for years, totally 160 in the journey with awesome drawings.
  • Following her lauded Gypsy Killers album, Sanda Weigl has a web series as The Office for actors or Entourage for poor people.
  • As the World Cup opens in South Africa, Stan Engelbrecht and Nic Grobler’s project to document the life histories of 200 plants and animals through expressive movement, which he’ll share in this hilarious pitch video for great world beats!
  • Fed up with the absurdist aesthetic of Dutch animator Emiel Stevenhagen.
  • The result is hypnotizing, and the Land of the Misfit Toys.
  • Check out their touching video and creative rewards includes communist-issue Mongolian Bomber Goggles!
  • Coyote Pursues is a charming character-driven video game for kids.
  • Coyote Pursues is a fan of the remaining villagers.
  • After the success of the most disgusting way of making coffee we’ve ever seen, plus fun quotes like Let’s go slightly less on the road, and you can still get the book for just $10, plus artwork, CDs, a photo book honoring space exploration.
  • Dario Ciriello’s homegrown publishing outfit is guided by a subway car or Wal-Mart aisle is a massive interactive Burning Man installation comprised of an old Dodge pickup with fresh, seasonal veggies, Truck Farm was born.
  • Cobbled together from actual footage and the Outs have recorded a sweet, limited-edition EP on cassette (you heard me right) to help everyone send their phones to space and is going to help everyone send their phones to space and designing an app that will take place in a beautiful color poster, the DVD, and tickets to see any show, or a private event thrown in your town AND cook you dinner- what’s not to love?
  • Expect a fabulous soup of literary aficionados chatting intelligently about a roach violinist who falls in love with the bravado you’ll find in Memphis Heat, a documentary on the real-life appearance of a giant, hand-crafted, Rube Goldberg contraption.
  • After 13+ years of genetic testing and solitary confinement, Oliver’s getting a new recording in the film’s virtual town.
  • With Kickstarter, the beloved web-comic slash zine will be reissued in an American town, Congress, in the last 30 years, the PCR machine has been a fascination and obsession for 400 years, made clear by the economic crisis.
  • Help her bring her spring collection to NY’s Fashion Week Green Shows and you might end up pledging $10 to see his son do one of a series of mathematical stories that will open and close by electrical current.
  • The goal of CicLAvia is to put all the way through the eyes of the great ’60s cult classics.
  • Fart Party is the subject of Angela Kline’s boisterous documentary, A Love Letter to Tom Waits.
  • Akimenko Meats is a powerful combination of portraiture, live audio, and writing, creators Kitra and Chris aim to offer an intimate glimpse of a secretive guy in search of his Jens Pulver Kickstarter project, Gregory Bayne has set his sights on his first film, Person of Interest.
  • The work is beautifully previewed in a fantastical world where every shared glance across a subway doo-wop group can be a kite-flying extravaganza!
  • By showcasing this innovative and highly accessible approach to cinema, filmmaker Benjamin Reece hopes to perform A Chinese Love Story for a new work of comics journalism exposing the human cost of trafficking.
  • Pick up a signed copy of the fastest competitive lockpickers in the city!
  • Fishtank Performance Studio in Kansas City works hard to see his fantastically illustrated children’s story become a real-life 13 ft. sculpture and installation at the end, the super bouncy balls will all go flying when she throws herself off a roof.
  • Operating Theater’s play Transatlantica revolves around a psychoanalyst who encounters a series of bicycle-powered food tours.
  • Backers can witness the event in real-time; $20 gets you a package of exotic recipes, hard-to-find ingredients, and info cards on your voicemail to a Brooklyn rooftop.
  • It’s an inventive project from SFHny Studio, a group of thespians shouting thank you at their laptops to a unique photo book and traveling gallery show, dubbing the project that’s so infectious.
  • Fed up with a variety of sculpted paper viruses.
  • Get rewards like an unreleased font and a songwriter who created that all-important childhood fave Schoolhouse Rock.

Most represent composites of parts from two or three different project blurbs, and I’ve also tried to remove the ones that weren’t modified at all (sometimes MCEs just spit out unmodified sentences). I think part of the reason these work so well is because the original blurbs end up conforming to a particular style of quippy, short descriptions structured around rewards and project topics.

Thoughts on Verizon and Google

In early 2007 I attended a talk at Fordham Law School by William Barr, the former US Attorney General and current Verizon General Counsel and Executive Vice President. The premise of his talk was that regulation, of the network neutrality kind, would only hurt technological innovation in the broadband and Internet space.

A lot of has changed since then, and now that Google and Verizon have stuck a deal purportedly threatening the openness of the future of the web, I thought I’d revisit some of my thoughts from that night as well as muse about what this deal might mean and why its happening now.

During his lecture Barr attempted to point out that there had never been an instance of a telecommunications company violating the terms of network neutrality, so why would they begin now? Out of nowhere, from behind me, someone shouted “What about Madison River?” That person was Tim Wu, who I didn’t personally know at the time, but who would later become a friend of mine. Tim had interrupted Barr to remind him aboutMadison River where a local telecom had blocked VoIP connections for broadband subscribers because the telephone company didn’t want to compete with inexpensive internet telephony. It was precisely the kind of violation of network neutrality that Barr was claiming could never have happened. Barr dismissed Madison River as an isolated incident which didn’t represent the overall policy of non-discrimination by the telecom industry.

Later in the lecture, Barr tried to envision an industry closely regulated by the FCC in order to uphold network neutrality. This would be a world that Barr thought no one would want: innovation would peter out as businesses would face a high barrier of entry in the form of regulations. Conversely, if corporations had the opportunity to really invest in research and development without the fear of future regulatory action, then they might come up with services and tech that would be even better than TCP/IP. Barr believed that it was naive for us to blindly accept that TCP/IP was the best we were going to get for transferring data and communications over a network. Who is to say Verizon or AT&T couldn’t come up with a better protocol? TCP/IP has plenty of performance issues (real time synchronous voice communication was a huge challenge), so why not let Verizon innovate at the protocol level, and sure, maybe they’d prioritize some kind of traffic, but it would be for the benefit of technological innovation. Just think of all the potentially amazing applications they’d could come up with if the FCC just left the innovation to Verizon’s R&D lab instead of the open internet and the public?

Just say no to walled gardens.

During the question and answer period, I asked Barr why he thought that consumers wanted more walled gardens of content, and whether it was wise to assume the market was going to support another set of AOLs, Compuserves and Prodigys? He replied that of course they consumers wanted better content — video on handheld devices was going to be the future and the telecoms were going to be the only companies who could deliver it. I insisted that consumers only really want the internet in their pockets and that he was kidding himself if he thought a curated walled garden on a handset would be nearly as appealing as an actual functional web browser (something no mobile company had delivered yet).

In a sense we were both wrong and we were both right. Consumers did want mobile video on demand, but they also wanted the entire open web in a functional experience.

Prior to Barr’s lecture Verizon had announced a half-baked partnership with YouTube which would offer limited and selected versions of YouTube videos for watching on handheld devices. Then, a couple of months later, Steve Jobs announced the iPhone which would have even greater support for YouTube. Verizon was banking on curated portals inside hobbled handsets, and Apple had just bet the farm on the touchscreen and a mobile Safari browser. We know who won this battle. Does anyone ever talk about watching YouTube on their 3 year old cell phone any more? Does anyone even remember the partnership?

Why Verizon and none of the other telecoms never fully invested in a serious mobile browsing experience is best explained by their general hostility to the open web. The big telecoms have always loathed the net, whether it was manifest in an engineering snobbery towards the “dumbness” of TCP/IP or the fact that the net worked best when it treated their products not like products at all but like common utilities, something no company wants. So it has never been surprising that the telecommunications industry never bothered to create a real mobile browsing experience; they were too eager to strike Big Deals with Exclusive Providers of Proprietary Content than supply an actual connection to the open web.

Steve Jobs, to his credit, saw the opportunity to serve consumers what they really wanted, and he and Apple have since been handsomely rewarded for creating a mobile browsing experience worth using. Google’s choice to freely offer Android was a brilliant bit of strategy: all of the telecommunications firms and handset manufacturers were panicking and desperate to compete with Apple’s iPhone, so why not give supply them what they wanted?

So now Verizon and Google are making an uneasy deal behind the FCC’s back and trying to assuage the FCC and the public that they’re really doing it in the name of technological innovation. Think about all the applications that could exist if we didn’t have to rely on the Internet! Healthcare Monitoring! The Smart Grid! Advanced educational services! Incredible entertainment and gaming options! These are all ghosts of walled gardens past and there’s no reason to believe that a competitive startup can’t supply these exact services over the open web.

The wireless component of the Google/Verizon deal is the biggest wild card and the most controversial aspect of their joint policy proposal. The two companies argue that the principles of network neutrality shouldn’t apply in the wireless space. I couldn’t agree less. The telecoms have demonstrated very little capacity for innovation in the wireless space in the last 15 years (why is it so hard to develop SMS applications? why is Google voice such a pain to reconfigure as my voicemail? etc.), so why would we trust them now?

Ultimately, why shouldn’t the principles of common carriage and network neutrality apply to the wireless space? Because its too difficult? Too expensive? I don’t buy it. What the wireless space needs now is faster and cheaper TCP/IP service and a more open application infrastructure. Negotiating one off deals for new channels and services will only remind us of Compuserve circa 1999.

Lessig, Crawford and Wu have a good post about the proposal, but also read Jonathan Zittrain’s thoughts on it here too.