The OPEN Comprehensive Digital Library: A Vision Imperiled
The Library of Alexandria, the most famous library of antiquity, continues to capture our imagination. The sense of romance evoked by a library containing all of the world's knowledge is heightened by sense of loss evoked by the library's destruction.
In an age in which we rely on Internet to meet our information needs, it seems possible to recapture and even improve upon the Alexandrian ideal. While access to the ancient library was limited by geography and privilege, a digital reincarnation could be accessible anywhere to anyone with a mobile device.
Unfortunately, the vision of an open and comprehensive digital library will not be easily realized; great achievements often require a struggle. While technological barriers have been rolled back, restrictive copyright laws and outdated business models continue to pose significant obstacles.
Copyright and Digital Libraries
Effective copyright laws promote innovation and creativity by allowing authors to reap financial rewards for their work. Consequently, the copyright promotes the public good. The writers of the U.S. Constitution recognized this by granting Congress the power “To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” The first U.S. copyright act, passed in 1790, remained true to the spirit of the Constitution providing a limited copyright term of fourteen years, renewable for another fourteen years. The law’s requirement for registration and renewal of copyright provided a paper trail which facilitated copyright research and verification.
Unfortunately, copyright has become an obstacle to digital library building. In the past four decades, Congress has dramatically increased the copyright term to 70 years after the death of the author. Copyright has also become automatic; registration, renewal, and explicit copyright statements are no longer required. The length of the copyright term dramatically limits the body of works available for digitization. The elimination of any registration or renewal process makes it difficult to find rights-holders. The removal of the requirement for an explicit copyright statement means that a work an author may have intended to share freely must be considered to be under copyright restrictions. On a practical level, the effect is to make mass digitization of books published from 1923 to the present a practical impossibility. Copyright laws, intended to promote the public good, have become so restrictive that they detract from the public good.
Business Models and Digital Library Ethics
American brick and mortar libraries, especially in the academic and public sectors, have built mature collections and developed strong traditions around democratic access and privacy protection. The absence of a large-scale, publicly funded national digital library program leaves a vacuum in which library values and infrastructure are being defined on an ad hoc basis by non-profit and corporate initiatives.
In the non-profit arena, a plethora of cultural heritage institutions, research libraries, and charitable organizations have demonstrated their belief in the Alexandrian ideal by creating open digital collections. Project Gutenberg, founded in 1971, is one of the oldest efforts. The Project has a catalog of almost 30,000 online books, is registered with the IRS as a charity, and recruits volunteer proofreaders. The University Libraries of Boston College does its part by supporting an active program to digitize library collections and by supporting eScholarship@BC, an open access repository for the scholarly output of the University.
Building the comprehensive digital library "accidentally" by using commercial search engines to retrieve content from independent non-profit digital initiatives is appealing on some level. It avoids the challenges of governing and financing a large scale enterprise and eliminates the risk of placing too much power in the hands of a few. The accidental digital library, however, falls short on many fronts. Without a coordinated collection development strategy determining what is added to and removed from the collection, the scope of the collection is capricious. The library’s comprehensiveness and its stability are uncertain. Further, reliance on commercial search engines as the only discovery tool for the comprehensive digital library is problematic. Privacy is jeopardized. Research endeavors are influenced by search algorithms designed to generate advertising revenue. E-commerce sites are included in search results. Traditional library catalogs are designed to give similar works on the same topic an equal chance of retrieval. Such even-handedness cannot be expected from commercial search engines. When revenue is at stake and no standards apply, the incentive is to build a site to maximize the chance of retrieval.
Non Profit Digital Library Initiatives
- Harvard-Yenching Library: A cooperative project to digitize the library’s 51,500 volume Chinese Rare Book Collection will be financed primarily by the Chinese Government (Globe Article)
- American Memory: 9 million items from the collections at the library of Congress that document U.S. history and culture, including manuscripts, prints, photographs, posters, maps, sound recordings, motion pictures, books, pamphlets, and sheet music.
- The Edgar Allan Poe Digital Collection: The Ransom Center, The University of Texas at Austin
The most consequential and effective advocate for a unified and comprehensive digital library is Brewster Kahle, an internet executive who has sold companies to AOL and Amazon. In an attempt to realize the Alexandrian ideal, Kahle founded the non-profit Internet Archive (IA) in 1996. With over 1.6 million texts in its collection to date, the IA is the largest non-profit digital library effort in the world. The IA collaborates with hundreds of organizations around the world to make public domain material, both born digital and born analog, available at its site. A variety of formats, including text, audio, video, and software are collected and preserved. The Archive is also building a collection named the "Wayback Machine," by archiving websites nominated by "memory institutions" around the world.
At the heart of the Archive’s collection building activities are the 20 regional scanning centers it supports in five countries. At these scanning centers, one of which is at the Boston Public Library, 1,000 public domain books from collaborating libraries, including our own University Libraries, are digitized each day. The non-profit scanning centers are highly efficient; the cost to collaborating libraries is only 10 cents per page to have a book scanned and made available online.
Boston College/Internet Archive Collaboration Turns to Irish Materials
In the past several months, Irish history has been the focus of the Boston College/Internet Archive Collaboration. 550 books from the fifth floor of O’Neill have been scanned at Internet Archive’s regional scanning center at the Boston Public Library. This latest effort brings the number of Boston College volumes in the Internet Archive to 775. Our earlier contributions to the Archive include Jesuitana and Medieval Philosophy texts. Our digital texts at the Internet Archive are cataloged in Quest. The entire collection can be viewed on the Archive’s Books from Boston College page.
Digital library building is an expensive proposition, requiring significant investments in staff, technology, and copyright research. As a result, the corpus of works available through non-profit digital libraries is still relatively small compared to the body of all books ever published. In the absence of a large scale publicly funded effort, we have a digital library vacuum which, if filled exclusively by the private sector, will fundamentally alter the understanding of what it means to be an American library.
For Profit Initiatives
In 2002, Google launched a secret "books" project with the goal of scanning every book in the world. Not only did Google have the technological and financial means to pursue such an ambitious goal, Google’s founders viewed digital library building as essential to their business strategy. Google’s founder Sergey Brin explained this synergy in a 2007 interview for a New Yorker article, Google’s Moon Shot:
We really care about the comprehensiveness of a search. And comprehensiveness isn’t just about, you know, total number of words or bytes, or whatnot. But it’s about having the really high-quality information. You have thousands of years of human knowledge, and probably the highest-quality knowledge is captured in books. So not having that—it’s just too big an omission.
In 2004, Google began approaching major research libraries, offering to scan their collections in exchange for providing the library with digital copies of their collections. In July 2004, a pilot project was initiated to scan the 7 million volume collection at the University of Michigan. As of 2007, Google estimated that scanning the entire collection would take only 6 years. Today, Google’s website boasts an international list of 20 library partners, including Harvard, Oxford, Stanford, the University Library of Lausanne, Ghent University Library as well as several of the great American public universities.
There is no doubt that Google search’s inclusion of texts from the world’s great libraries provides a public benefit. It is not, however, a substitute for libraries as we have come to know them: user privacy is not guaranteed, search results could be shaped to maximize advertising revenue, and there is no guarantee that the access to the books will remain free in perpetuity. This is wholly appropriate, for, after all, Google’s primary obligation as a public company is to maximize shareholder value, not to build free digital libraries for the common good.
All was well with the Google Books project until restrictive copyright laws put into motion a series of unfortunate events that threaten to destroy the Alexandrian ideal. As stated earlier, the recent increases in the length of the copyright term and the challenges of identifying copyright holders make it impractical for digital library initiatives to make available the vast majority of books published after 1923.
Google, however, took a risk and invested heavily in digitizing post-1923 books. In April 2009, an estimated 6 million of the 7 million books that Google had digitized were still in copyright. Of the 6 million digitized and in-copyright books, Paula Samuelson, a professor at the UC Berkeley School of Law, estimates that 70 percent are out of print. This vast corpus of in-copyright, out-of-print books – a significant portion of the written record of the 20th century – is at the heart of the copyright problem. Prof. Samuelson explains in an ACM article:
Most of them are, for all practical purposes, "orphan works," that is, works for which it is virtually impossible to locate the appropriate rights holders to ask for permission to digitize them.
A broad consensus exists about the desirability of making orphan works more widely available. Yet, without a safe harbor against possible infringement lawsuits, digitization projects pose significant copyright risks.
Google’s risk was, not surprisingly, rewarded with a class action infringement lawsuit. In 2005, the Association of American Publishers and the Author’s Guild responded by filing suit against Google for copyright violation. Despite the lawsuit, Google maintains that the manner in which it uses copyrighted books is legal and benefits copyright holders by including their works in its search results:
. . . some in the publishing community question whether any third party should be able to copy and index copyrighted works so that users can search through them, even if all a user sees is the bibliographic information and a few snippets of text, and even if the result is to make those books widely discoverable online and help the authors and publishers sell more of them.
Copyright law is supposed to ensure that authors and publishers have an incentive to create new work, not stop people from finding out that the work exists. By helping people find books, we believe we can increase the incentive to publish them. After all, if a book isn't discovered, it won't be bought.
However, rather than risk a court ruling that jeopardized its enormous digitization investment, Google negotiated a settlement with its plaintiffs in October 2008. The settlement is subject to approval in United States District Court for the Southern District of New York following a final fairness hearing. The Fairness Hearing was scheduled for earlier this month. However, concerns about the settlement are so widespread that the fairness hearing has been postponed in order to amend the settlement. The primary problems with the settlement are as follows:
- The settlement addresses obstacles in the copyright law which should be addressed by the legislative branch rather than the judicial system.
- The settlement effectively gives Google a monopoly over the corpus of in-copyright but out of print books by granting Google the right to determine whether or not a book is commercially available, the right to sell copies of books which are not commercially available, and a "safe harbor" against infringement lawsuits should it make a mistake in its determinations.
- A proposed non-profit Book Rights Registry, which would be established to address the absence of copyright registration and renewal requirements in the current copyright law, would have a bias towards Google. The Registry is to be established with financial and technological assistance from Google. Its purpose is to allow copyright holders to register their rights. Registered rights holders could opt to have their books removed from Google’s corpus or to receive financial compensation for Google’s use of the book. While others would be able to consult the registry, the settlement manifests a clear Google bias, stating that the Registry "will respond in a timely manner to requests by Google."
- The plaintiffs in the suit are not diverse enough to be representative of all authors and publishers.
An amended settlement which may very well determine our digital future is due to be filed with the Court on November 9. Advocates of the Alexandrian ideal can monitor new developments at The Public Index and The Google Book Settlement site which is maintained by the Settlement Administrator with technical support from Google.