One of the things I really enjoyed about the Internet Archive Open Library project was the software they used to attempting to determine whether works they were scanning were or were not under copyright. It was an elaborate set of questions and answers with access to some copyright databases. In contrast, unless I’m mistaken, Google Books just draws a line at 1923 and assumes everything after that date is in copyright. This includes government information which as you know is made with tax dollars and generally in the public domain. So why does Google Book Search treat all post-1923 books as under copyright? Just over-cautious?
James Jacobs — the guy from diglet who had been writing to Google to try to get “find in a library” added to ALL Google Book Search results — went to see Daniel Clancy, the Engineering Director for the Google Book Search Project speak at Stanford. While the talk wasn’t to librarians and wasn’t really about the social implications of the book search, James did learn a few things.
- Clancy mentioned that Google was NOT going for archival quality (indeed COULD not) in their scans and were ok with skipped pages, missing content and less than perfect OCR — he mentioned that the OCR process AVERAGED one word error per page of every book scanned
– about 70% of the book project use was coming from India.
- 92% of the world’s books are not generating revenues for copyright holders or publishers
If Googl Book Search really interests you, you might also like to read The Google Library Project: Both Sides of the Story [pdf, today's library link o' the day] which discusses some of the misinformation and lawsuits surrounding the Google Library porject and comes down on the side of Google’s fair use position.
University of Michigan President defends their relationship with the Google Book Search project (full speech pdf download). Intriguing comments below including Siva and a plug by librarians’ own SuperPatron on another way of harnessing the power of the Google Book Search for libraries. [thanks molly]
How come only some books in the Google Book Search have “find in a library” links next to them? Diglet asks, and gets an answer, sort of a lame one if you ask me. update: Kevin mentioned in the comments that it would be great to see this for all books in Google Books. I went to bed thinking “Oh yeah, I should look into that….” and while I was sleeping, Superpatron, aka Ed Vielmetti solved the crime, er problem, and created a Greasemonkey script (a plug-in that you can run with Firefox) that does this for Ann Arbor and can be modified for any library.
Hey clue club, any Harvard or Boston area librarians want to solve the what the heck is this mystery alluded to on this blog post? It looks like a handwritten version of the poem printed in the book, but without page numbers or any other indication that it’s part of the book. Table of Contents is mum on what’s going on. Anyone know, or want to go check out the book at Harvard and see? [thanks chase]
You can use the date operator to browse public domain books in Google Books. I’m not entirely sure why the covers of some of these books remain under copyright. Any ideas? I’ve also noticed a few scanning errors and some pretty neat finds like this one which gives the name of every librarian in the US and Canada working in a library holding over 1,000 volumes. Google Books clearly uses keyword indexing to make these books searchable. How great would it be to have this one in a database? You can see a few images that I particularly liked over at Flickr.
Hi. This is the presentation that Andrea and I are watching right now in San Francisco. The Open Library. Brewster Kahle is talking now and doing a book scanning demonstration. I like how he says “librarians” a lot.
Vision of an Open Library
The Web is So post-1996, what about older content?
Everyone is part of it: Amazon helps “expand the bookstore” but we’re looking for inclusivity.
“A great library for the published works of humankind, accessible to all… everybody involved… libraries LIVE based on the publishing system, they will be involved.”
3 to 4 billion of the 12 billion libraries spend every year goes to publishing. Let’s have more of that go to fairly compensating everyone.
“For the near term, we’re making books from books.” It’s hard to digitze a book that looks like the original, this is the proof that can work.
1. Selection. librarians choose books. Start with out of copyright materials, work towards in print, orphans next. “we’re not going to run out”
2. Scanning. 500 dpi “scribe system” 30-60 min per book. “we can read a 2 pt typeface, straight on” metadata, saved to archive
3. Cataloging. Use library data and coordinate between scanning centers using MetaFetch. Groups like RLG are coordinating.
4. Copyright. Copyright law is “a little confusing” Evidence based interface allows a Q&A “is this book under copyright” interrogation. Many books not re-registered copyright-wise. Already scanned copyright renewal records into a searchable database. Larry Lessig is bringing a suit re: orphan works and whether they can be in the virtual library. Other for-profits are working back the other way. It’s “tricky but doable”
5. Storage. 6 GB per book, hard to scale. Built a petabyte-scale machine “petabox” [I saw it] low power, runs cool, “set top boxes” not full computers with OSes etc. Object is not to have one box in an earthquake zone, but distributed system in flood zones & elsewhere.
6. Readers. Software. Check it out at openlibrary.org. UC librarians chose early set of books already scanned. Also looking into PDFs for printing. Also working with lulu.com for print on demand. Also, you can listen to these books.
Other mentioned projects: ICDL, Internet Archive Bookmobile [buck a book!]. BookShare will use this content for access for the blind. $100 laptop will be integrating books from this project onto their laptops [big news!]. Open Content Alliance to create protocols and formats.
Brewster Kahle: “I don’t know what it will be like to have books from our libraries injected into our culture again, but I’d like to see it”
“Knowledge for the World” is the mantra that all the funders [on and off the podium, 30 seconds each: Smithsonian (museums/content), Yahoo, Sloan Foundation (funding), Johns Hopkins (content/tech), RLG (cataloging), Adobe (display/doc formatting), HP (scan), LizardTech (data compression), Lulu.com (printing), MSN Search (search/funding) etc]
Guy from Yahoo “Finally a library I won’t get thrown out of” and “Find, use, share, and expand all human knowledge”
Andrea has more, including some links that I missed.
Google Print Library going on hold over copyright is big news in our world. Copyfight followed up on the story. Of particular interest is the comments with people speculating on the copyright-kosherness of a publicly traded for-profit company freely scanning, copying and indexing content that is not owned by them without negotiating for rights. Other popular copyfighters Siva and Seth have worthwhile insights.
Siva “Once again, I think we should recognize that unless we think copyright should not exist, copyright holders should be able to decide when to license their works to other companies. This is far from absolute. But it’s common sense and generally true. Only in unusual circumstances, such as when markets fail to provide an essential public good, should we consider radical moves. This is not one of those cases. The service is not an essential public good — just a cool idea. And the market was not failing. Publishers were at the table…. Google messed up by going all unilateral on the publishers. There was no market failure here. Transaction costs were not prohibitive. They were working out the deal. This was not the recording industry shunning Napster. This was how copyright is supposed to work.”
Seth: Why is Google doing this book-scanning project? It’s not because it’s just so cool (even if it is). While coolness may justify a small-scale promotional project, the scanning efforts are expensive. So Google, as a company, obviously sees some value in the effort. This is not wrong. But it’s also a direct conflict with the granted monopoly know as copyright. Whenever there is value, particularly commercial value, there is conflict over who should be able to receive it.