Ohio Web Library Mashup
[The following text, written by OPLIN staff Stephen Hedges, Karl Jendretzky, and Laura Solomon in October 2008, is also available as Chapter 20 in the book Library Mashups: Exploring New Ways to Deliver Library Data, edited by Nicole C. Engard and published by Information Today, Inc., ISBN 978-1-57387-372-7.]
Federated Database Search Mashup
Libraries spend a considerable amount of money each year on subscriptions to proprietary databases of online information, such as newspaper and magazine articles, authoritative encyclopedias, and reference resources. The current Gale Directory of Databases lists over 20,000 information databases and related products which libraries can purchase, including, of course, the Gale Directory of Databases. These databases are so strongly associated with the libraries that purchase them that they are often simply called "library databases"; they allow libraries to gather information that is not normally available on the World Wide Web.
While the vendors that sell this information devote a great deal of time and effort to making their searching interfaces smooth and efficient, many libraries that serve the general public try to offer a single "federated" search interface that pulls information from a variety of library databases simultaneously, rather than going to each individual database's interface and repeating the search. A handful of commercial vendors offer federated search tools, and the market for such tools has been growing over the last five years.1 Ideally, such an interface provides the library's users with a simple but effective "Google-like" search for accessing the library's entire collection of online resources. As Roy Tennant has written, at their best these tools are "...the correct solution for unifying access to a variety of information resources."2
If a federated searching tool is desirable for a single public library, consider how much more important it would be for a network of more than 250 public library systems. Such is the case for the Ohio Public Library Information Network (OPLIN).
Since 1996, OPLIN has been providing a basic collection of online databases to Ohio public libraries. Currently purchased in cooperation with the Ohio academic library information network (OhioLINK) and the school library network (INFOhio), with additional funds from an Institute of Museum and Library Services LSTA grant awarded by the State Library of Ohio, the total cost of this extensive collection tops $4 million annually. In 2004, the network partners began referring to this communal collection of databases as the "Ohio Web Library."
While a federated search of these valuable resources may not be as critical in an education environment, where students are often instructed to use a particular database, for public libraries the lack of a good federated search interface results in low use of the databases and therefore a high cost per search. Public library users are uncomfortable with being forced to select which database to search, often with no idea of what type of information the database provides, and to repeat their search in each individual database search interface. Confronted with this task, they simply turn to Google or some other web search engine to find online information and never get access to the information libraries purchase and provide.
OPLIN began offering a federated search interface to the Ohio Web Library in 2004, when a WebFeat Search Prism was installed. This search tool was based primarily on the search functions included in the Z39.50 information retrieval protocol (ISO 23950), which is the protocol most commonly used by federated search tools. The Library of Congress is the Maintenance Agency and Registration Authority for the Z39.50 standard, which specifies a client/server-based protocol for searching and retrieving information from remote databases. Z39.50 is a large standard, with lots of functionality, but usually an implementation does not support the complete standard. Instead, a subset of functions, called a "profile," is commonly used to meet specific requirements. A profile provides a specification for vendors to use when setting up their servers, so that search applications can interoperate. For example, the ATS profile defines a basic subset of services for Z39.50 support of public access library catalogs. The ATS profile is quite simple, requiring support for only three search attributes — author, title, and subject ("ATS") — and mandates support for MARC record data transfer. Other profiles are more complex, to fit more complex needs.3
While the WebFeat Search Prism provided a federated search using Z39.50, the return of search results from a database collection the size of the Ohio Web Library was slow, and results were presented in groupings by source, rather than in order of relevance. Nevertheless, it was judged by OPLIN to be the best federated search tool available at the time. By 2007, however, Ohio public librarians were voicing dissatisfaction with these limitations. In focus groups they repeatedly expressed a desire for a search that was "more like Google." OPLIN, for its part, had not made any significant changes to the federated search interface since it had been installed. Clearly it was time for a change.
In late 2007, OPLIN began the search for a new search. We first contacted WebFeat, the existing vendor, to explore newer versions of their product. They demonstrated WebFeat Express, their newest offering, which had significant improvements over the WebFeat product in use by OPLIN, including control over search results ranking. Ultimately, however, OPLIN rejected this possibility because it still did not allow us enough control over the product to be able to make rapid, incremental changes in response to user feedback, which we felt would be an important feature of any new search interface.
Next we investigated Google Custom Search Engine (CSE); after all, librarians had told us they wanted something "more like Google." Custom Search allows you to create your own Google search against a list of websites you specify. Once you have defined your search targets, Google gives you a simple piece of code for a search box to place on your site. Any search done through this search box starts a Google search, but the search is limited to the sites you have specified. At the time we looked at CSE, there was no way to tailor the ranking of search results; the results that ranked highest in Google's standard algorithm always came to the top. Google staff told us they were planning to release CSE with the capability for the host site to tailor rankings to reflect the host's preferences or area of expertise, and indeed, this capability was added a few months later.
The biggest shortfall of the Google CSE technology was the fact that the only sources that it could hit were the ones that were already indexed by Google. While this worked well for open content like Wikipedia, proprietary sources like EBSCO, which require user authentication to access, had no presence. We went through several iterations trying to devise a way to integrate proprietary content. These ranged from possibly using MARC records from EBSCO to build our own open website with the article descriptions and links in it for the Google bot to crawl, to having talks with Google themselves to see if there was any way of making the data available. In the end we found that the Google Custom Search just was not well suited for accessing the type of data we needed, and we moved on.
About this time, Index Data announced a demonstration of their MasterKey federated search. OPLIN staff were impressed by the speed and ease of this product. It had the additional advantage of being built around the Z39.50 protocol, which we knew would work with most of the databases we purchased. Initial discussions with Index Data indicated, however, that the MasterKey hosted search solution was not the best fit for our needs. Rather, they suggested we build a search around their open source "pazpar2" metasearching middleware, which is the engine within MasterKey.
Although we knew that neither Google CSE nor a hosted MasterKey web service would be our eventual federated search tool, we decided to use these two searches for some early user testing to guide our future decisions.
We built a Google Custom Search and targeted some open websites that had good content, such as www.NetWellness.org, Wikipedia, www.HowStuffWorks.com, and www.Gutenberg.org. We took that and the MasterKey demo to the ScanPath Usability Lab at the Kent State University School of Library and Information Science in November 2007 to observe how users reacted to the two searches. ScanPath uses high-tech software to measure and record a user's eye movements while using a web site, tracking how often the eye travels to certain areas of a page, how long it lingers, etc. ScanPath staff also interview the user while he/she is using a page to get them to verbalize their experience.
Our observations stunned us. Until that time, we had not appreciated how thoroughly Google has defined the online searching behavior of most people. Users expected results lists that looked like Google, down to the colors of the components of a search result. They were adept at judging the reliability of search results based on the URL; for example, URLs ending in .org were judged to be generally trustworthy, while URLs containing the word "blog" might contain interesting, but not necessarily reliable, information. They expected the links in search results to take them to another web site, not directly to an article. Clearly librarians were right; to meet user expectations libraries needed a federated search that was "more like Google."
While Index Data was working on their installation, OPLIN contracted with a local Ohio web design company, 361 Studios, to re-design the www.ohioweblibrary.org site. We specified a very simple, clean layout dominated by one search box. As much as possible, the design was to rely on a cascading style sheet (CSS) for the page look, and the entire design had to be compliant with accessibility standards. We also specified that the initial page load had to be less than 100 KB. Other than that, 361 Studios had a free hand in designing the graphics and navigation for the page.
It did not take very long for us to get a basic federated search against a few targets up and running for testing. Now we began to work on turning the basic search into something good enough for production. We set a production launch date of July 1, 2008.
One weakness of using Z39.50 as the protocol for retrieving information from commercial databases is the fact that not all database vendors operate Z39.50 servers. In 2000, Sebastian Hammer, co-founder of Index Data, wrote that, "Most major library-systems vendors now support Z39.50 to some extent."4 It is the words "most" and "to some extent" that would cause problems for OPLIN.
Several of the vendors supplying databases to the Ohio Web Library collection do not support Z39.50, including a couple of well-known vendors like World Book and Facts on File. How could we get their databases included in our federated search? The answer came indirectly from our previous federated search vendor, WebFeat. WebFeat received a patent in late 2004 on a method for managing the authentication and communication necessary to perform a search against a licensed database and convert the results into Z39.50, XML, or HTTP format. In other words, they created a tool that could access unstructured data and deliver it in a structured format. They called this access tool a "translator" and built thousands of translators for databases that do not support structured data. These translators are prominently used in the WebFeat federated search products.
OPLIN now needed to find a way to get translators from the very vendor whose search product we were abandoning. Fortunately, someone else had anticipated this problem. CARE Affiliates had developed a partnership with both WebFeat and Index Data to market "OpenTranslators," WebFeat translators that are hosted on Index Data servers for a fee to allow a federated search tool to get to unstructured databases, including open sites on the World Wide Web. OPLIN provided CARE Affiliates with a list of database and open web targets for which we wanted translators and they took care of having the translators built, hosted, and made available as targets for pazpar2 configuration files. Once these were all in place, we were able to do a federated search across all the Ohio Web Library databases as well as some open sites, such as Wikipedia, NetWellness, and OAIster.
Making It User-Friendly
Now we went back to the results of our early usability testing, included some feedback from test users of our new prototype, and started adding some amenities based on what our users might expect.
For one thing, we knew that users would try to do advanced "Google" searches, meaning using quotes around phrases, plus-signs to make some words manadatory, etc. Most users seemed to assume that if Google search uses these conventions, every search uses them. While Z39.50 can handle some complex searching behavior, it does not use the Google notation conventions. Moreover, while many library information vendors may support Z39.50 "to some extent," quite a few do not support complex Z39.50 queries. We learned that including quotes and other Google notations in the search term often caused our search software to crash, so one of the first things we did was set up a parsing routine to strip these problem characters from users' search terms before passing the terms to pazpar2. We hope someday to find a work-around that will allow pseudo-phrase searching when terms are enclosed in quotes, but for now the quotes are ignored.
We also added a spell checker, to suggest spellings for search terms that seemed to be misspelled. The problem with trying to implement search suggestion functionality is that for it to actually make good suggestions, it needs a large bank of data to draw from. We looked into just using dictionary files, but that type of setup would not automatically change with trends and we would have to manually update the dictionary. Next we took a look at the API made available by Google. Though this is something that they used to offer, they had changed what the API exposed, and search suggestions were not available at that time. The same path that led us to Google brought us to Yahoo! next. Yahoo! also offers an API, and using a simple JSON call we were able to retrieve search suggestions from them. These suggestions tend to be more accurate than anything we could construct in-house, since they have a large database of search queries information to pull from.
The display of the search results received a lot of attention as we developed the interface. The pazpar2 middleware does a good job of ranking results by relevancy, so that was not a problem. It also includes an option to rank results by date, although the effectiveness of this ranking is limited somewhat by the fact that not all vendors have their Z39.50 servers configured to return date information. But other than the ranking, there was the question of what information to display in the search results. We decided to display four items of information for each individual search result: the title, the source(s), the date, and a brief description. Not all Z39.50 servers returned all of this information, but at a minimum the search result displays title and source. In all cases, the title is a link to the article.
In most cases, the search results link to full-text articles, but some of the EBSCO databases included in the Ohio Web Library also contain just citations to articles, not the full text of the articles. In these cases, we found that in the Z39.50 data EBSCO makes available, no URL to the article is returned if only a citation exists. By parsing the returned data before it was displayed on the search results page, we were able to test for the presence of a URL. If the search result failed this test, we appended "(Citation only)" to the title of the article to warn the user.
Once a user finds an article of interest and clicks on the title, then the process of authenticating them begins. If the link points to an "open" article from the Open Archives Initiative, Wikipedia, NetWellness, etc., then no authentication is necessary, but if the source of the article is a proprietary database, then we have to check to make sure the user is covered by our license agreement with the vendor. OPLIN has used EZProxy for several years to handle this task, and still uses that system to handle Ohio Web Library user authentication. We acquired EZProxy before it was purchased by OCLC. Like pazpar2, EZProxy is middleware; it authenticates users against our local authentication system and provides remote access to our licensed content based on the user's authorization. In OPLIN's case, we use the IP addresses assigned to libraries as the basic test to authenticate a session. Traffic coming from within a library passes through the authentication with no interruption. Users accessing the system from outside a library can give their public library card number (or a username and password if they are associated with a K-12 school) to authenticate their session. Once they have been authenticated, they can access as many articles as they want without any further interruptions; the system remembers them until they close their browser.
Finally, we included a "Live Help" link to the Ohio KnowItNow 24/7 reference service. If a user finds the whole searching process to be too difficult or unproductive, they have instant recourse to a live reference librarian to assist them. KnowItNow is a live online information service provided free of charge for the citizens of Ohio by the State Library of Ohio and local public libraries. Professional librarians are available 24 hours a day, seven days a week to answer reference questions and assist Ohioans in finding information.
Launching, re-launching, re-launching...
Not all of these changes were implemented on July 1, 2008, when the www.ohioweblibrary.org page was launched. Nor do we ever expect to have all of our changes implemented. Whenever we introduce the site to a librarian, we emphasize that the open source code that runs the site allows us to incorporate their suggestions to make the site better. Several of the changes mentioned above came about because of suggestions from librarians using the site. In most cases, we were able to make the changes within hours.
As mentioned earlier, our goal is to be continually making small, incremental changes to the search interface based on user feedback. Because so much of the code that runs the site is open source, we have exceptional ability to do just that. The www.ohioweblibrary.org search interface will never be "finished" until the day we take the site down. We intend to just keep mashing cool things into it as we find them.
To recap, these are the components of our mashup (so far):
- pazpar2 (with customizations)
- Yahoo spell-checker API
- custom page design
- custom scripting
We're proud of what we've built, and we invite you to check it out at www.ohioweblibrary.org.
1. Alexis Linoski and Tine Walczyk, "Federated Search 101," Library Journal Net Connect 133 (Summer 2008): 2.
2. Roy Tennant, "The Right Solution: Federated Search Tools," Library Journal 128:11 (June 15, 2003): 28.
3. For a good general discussion of Z39.50 and profiles, see Mark Hinnebusch and Charles B.Lowry, "Z39.50 at ten years: How stands the standard?," Journal of Academic Librarianship 23:3 (May 1997): 217.
4. Sebastian Hammer, "Dr. Metadata, or: How I Learned to Stop Worrying and Love Z39.50," American Libraries 31:9 (Oct. 2000): 54.