The Data Behind My Ideal Bookshelf

Thessaly La Force, recently published a book with the artist Jane Mount called “My Ideal Bookshelf.” In it, Thessaly interviews over 100 people and Jane paints their bookshelves:

The books that we choose to keep –let alone read– can say a lot about who we are and how we see ourselves. In My Ideal Bookshelf, dozens of leading cultural figures share the books that matter to them most; books that define their dreams and ambitions and in many cases helped them find their way in the world. Contributors include Malcolm Gladwell, Thomas Keller, Michael Chabon, Alice Waters, and Tony Hawk among many others.

As I observed Jane and Thessaly compile the book over the last year, I couldn’t help but think about all the fun opportunities I could have exploring the data behind the shelves.

Contributor Neighbors

Each of the 101 contributors Thessaly interviewed picked as many books as they thought represented their ideal bookshelf, and I knew some of them would pick identical books.

So what would a taste graph linking contributors to each other using the books on their shelves look like?

Previously, I had worked with Cytoscape to render a network graphs, but this seemed like a good opportunity to make something interactive and also a perfect first project to really use d3. I can’t wait to do more with it.

Hover over each node to see the contributor’s neighbors.

The layout of the graph is done using d3’s force-directed graph layout implementation and each node represents a contributor’s shelf and is colored by the contributor’s profession.

Each active link is then colored by the neighbor’s profession. The nodes at the center of the graph have the most neighbors and exert the most pull over the rest of the graph. Try clicking and dragging a node to get a good feel for its centrality.

Genres and Professions

click for larger
The distribution of contributors professions skews heavily toward writers so all the professions aren’t evenly represented: 41 out of the 101 bookshelves were from writers, and there were a total of total of 35 unique professions represented.

click for larger

In the graph above, I picked books chosen by professions which were represented by two or more contributors. Each circle is proportionally sized with the share of books chosen by the profession of the contributor.

For example, fiction books make up the plurality — 31% — of books chosen by writers. Similarly, 55% of the books chosen by photographers were classified as photography.

This next graph is a violin plot showing the distribution of page counts of the books chosen by each profession:

click for larger
Excluding the Oxford English Dictionary which comes in at 22,000 pages, legal scholars (Larry Lessig & Jonathan Zittrain) earned the highest average page count of 475, and all the books chosen by the two photographers had page counts under 500.

Year vs. Page Count

Taking it a step farther, here is almost every book (again, excluding the OED) based on the year it was published (X-axis) against the log10 of its page count (Y-axis).

The size of each point represents the number of ratings the book had accumulated on Google Books, and its color represents the book order the contributor placed it on their shelf.

click for larger
The darker the point, the farther on the left of the shelf the book was ordered. Conversely, small teal dots represent books with relatively fewer ratings which were placed towards the right of contributor’s shelf.

Unfortunately, not all the books had publishing dates that made sense — Google reported the dates of Shakespeare’s works anywhere from 1853 to 1970 to 2010, and when would you say the Bible was published? 1380? 1450?

Excusing these erroneous points, I think the graph still works — checkout the cluster of books published in the early 20th century chosen by the writers in the upper left of the graph: they represent popular early 20th century books with page counts between 200 and 500.

Summary Stats

  • Number of shelves: 101
  • Number of Books Chosen: 1,645
  • Unique Books According to Google’s API: 1,431
  • Average number of books chosen:16.28
  • Average Pagecount: 381.2
  • Average Year of Publication: 1992
  • Top 5 Chosen Books:
    1. Lolita chosen by 8 contributors
    2. Moby Dick (chosen by 7)
    3. Jesus’ Son (chosen by 5 contributors)
    4. The Wind-Up Bird Chronicle (chosen by 5 contributors)
    5. Ulysses (chosen by 5 contributors)
  • Top 5 Authors:
    1. William Shakespeare (10 different books)
    2. Ernest Hemingway (7 different books)
    3. Graham Green (7 different books)
    4. Anton Chekov (6 different books)
    5. Edith Wharton (6 different books)
  • Contributor with the most number of books: James Franco
  • Contributor with the most number of shared books: James Franco
  • Longest Book: The Oxford English Dictionary, Second Edition: Volume XX* chosen by Stephin Merritt, 22,000 pages.
  • Shortest Book: Pac-Mastery: Observations and Critical Discourse by C.F. Gordon chosen by Tauba Auerbach, 12 pages

* Jane was only able to paint one volume of Stephen’s OED for his shelf, but the authors agreed it could stand as a synedoche for his choice of the entire edition.

Overall…

The data I cobbled together from My Ideal Bookshelf is far from perfect, but I think it does a good job of illustrating some of the larger themes and relationships behind the book.

For example, the books of the two legal scholars tended to picked on average some of the longest books in the set, and that professionals tend to pick books related to their jobs (checkout the large proportion of photography books chosen by photographers). Also: if you’re James Franco and pick a ton of books, you’re gonna have a lot of neighbors in a network graph.

I also discovered how skewed the dataset was towards the choices of writers — I jumped in expecting a diverse set of contributors which might be useful for representing an ideal ideal bookshelf — but that would have been a difficult case to make when 40% of the contributors were from one profession.

This might sound like an obvious observation (and something I should have known having spent so much time thinking about the book), but it wasn’t something I was able to really observe until looking at simple histogram of their professions. So remember: its always worth thinking critically about whether your samples are representative of the underlying distribution, and simple exploratory data analysis can really help you out there.

Bigger picture, I think this skew demonstrates the nature of coming up with an ideal list of anything: no matter who you ask, the task is essentially a subjective one. Here, it’s biased towards the network Thessaly and Jane were able to tap to make the book.

Cleaning and Reconciling the Data

While creating the book Thessaly and Jane had carefully compiled an organized spreadsheet listing each contributor and their chosen books, but I knew there could be subtle typos here and there. These typos could possibly throw off the larger analysis: if the title of two books was even slightly off or missing some punctuation, then aggregating based on their titles would be problematic. I also wanted additional data about the books (data which Thessaly and Jane didn’t record), such as the year the book was published, or the number of pages in the book.

So I figured I’d kill two birds with one stone and look for an API which I could automatically search using the title they had entered into the spreadsheet, and get a best-guess at the “true” book it represented. The API would also hopefully return a lot of useful metadata that I could use down the line.

It turns out Google’s Book API is the perfect tool for such a job. I could send a book title to it and get back the book that Google thought I was looking for. This allowed me to lean heavily on Google’s excellent search technology to reconcile book titles that might have had typos, while also retrieving the individual book’s metadata. While I could have used something like Levenshtein distance to try to find book titles that were close to each other in the original dataset, I wouldn’t have been able to retrieve any additional metadata.

A quick side note for copyright nerds: the Google Books API played into the HathiTrust case recently, and I’d like to imagine use cases like this were part of the reasoning behind declaring the Google’s use of copyrighted materials a fair use.

Google’s Book API allows 1,000 queries a day, but since the list contained thousands, I had to write in and ask for a quota-extension — thanks to whomever at Google granted that — I now get to hit it 10,000 times a day, which was enough to iterate building a script to compile the data.

Not surprisingly, Google’s API returns a ton of metadata. Everything from a thumbnail representing the book’s cover, to the number of pages, to the medium it was originally published, to its ISBN10 and ISBN13 … the list goes on. I tried to choose fields that I knew would be interesting to aggregate on, but also ones that would help me uniquely identify the books.

One particular piece of metadata that was missing was the genre of the book — only 28% of the books returned from Google had category information. Another option would have been to set up a Mechanical Turk Task to ask humans to try to determine the books’ genres. This kind of book ontology is actually a very difficult and somewhat subjective problem. Just think of how complicated the Dewey Decimal system is.

Finally, not all data is created equal — I’ve manually corrected a handful of incorrect classifications from Google where the search results clearly did return the right book, but its certainly possible not all books were recognized or reconciled properly.

The tools

Aside from d3 for the interactive plot at the top of this post, I used R and ggplot2 to create the static graphs.

The script I used to query the Google Books API was written in Ruby, and exported the data to a CSV which I then loaded into MySQL and Google Docs to manually review and spot check.

Here’s the query I used to generate the data necessary for the force-directed graph:

SELECT
  ideal_bookshelf_one.contributor_id as source,
  ideal_bookshelf_one.google_title as book,
  ideal_bookshelf_two.contributor_id as target
FROM 
  ideal_bookshelf as ideal_bookshelf_one,
  ideal_bookshelf as ideal_bookshelf_two
WHERE
  ideal_bookshelf_one.google_book_id = ideal_bookshelf_two.google_book_id
  AND ideal_bookshelf_one.contributor_id != ideal_bookshelf_two.contributor_id

Sequel Pro’s “Copy as JSON” was extremely helpful here — it took relatively little effort to munge the SQL results into an array of nodes and links required by d3’s force layout.

If you liked this post…

Pickup a copy of Thessaly and Jane’s book today!

9 comments
  1. […] is the most fun you’ll have all day. Thumbed through My Ideal Bookshelf? Now fool around with its data. Posted on Tuesday, December 11, 2012 under Biscuits | RSS Click here to cancel […]

  2. […] The Data Behind My Ideal Book Shelf […]

  3. […] Thessaly La Force, with illustrator Jane Mount, recently published My Ideal Bookshelf, which is a look into the books that some people of interest, including Judd Apatow, Chuck Klosterman, and Tony Hawk, would like to have on their ideal bookshelf. La Force’s boyfriend took a more data-centric look at the collections. […]

  4. […] Malcom Gladwell, James Franco, Judd Apatow, Tony Hawk en Dave Eggers zijn een greep uit de namen die voor het boek van La Force hun favoriete titels op de boekenplank hebben gezet. Nadat Mount er een grafische weergave van maakte, ging La Force partner aan de slag met een wat meer data-centrische inslag. […]

  5. […] The data behind “My Ideal Bookshelf” (Fred Benenson) […]

  6. […] The Data Behind My Ideal Bookshelf | Fred Benenson’s Blog: Each of the 101 contributors Thessaly interviewed picked as many books as they thought represented their ideal bookshelf, and I knew some of them would pick identical books. So what would a taste graph linking contributors to each other using the books on their shelves look like? […]

  7. […] out Benenson’s  Data about My Ideal Bookshelf page to learn […]

  8. […] just a screenshot of the chart showing which books Jennifer Egan is sharing with the others. Go to the original post on Fred Benenson’s blog for the magic of bookshelf […]

  9. […] might be even more interesting though, is Fred Beneson’s breakdown of the data behind the book.  Which books were chosen repeatedly?  Which authors proved most influential? […]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.