Magazine Article | July 1, 1997

Imaging Technology Preserves Legacy Of Scientific Documents

Source: Field Technologies Magazine

A document imaging solution makes research documents from defunct government think tank available to scientists worldwide.

Integrated Solutions, July 1997
The Office of Technology Assessment (OTA), a U.S. Government think tank, enjoyed for many years the reputation for performing high quality, unbiased research. As a result, a large number of invaluable lengthy documents were created. When, in 1995, Congress decided to close the OTA, the race was on to save this legacy. This is the story of the digital resurrection of an irreplaceable collection of 775 research reports.

All of the traditional methods of preservation and distribution were considered, but a brand new technology made the project a resounding success. To put it another way, this is how the OTA Legacy Project found a way to preserve this unique collection of documents and distribute it to the world.

Dispersing Report Information
When the shutdown of the OTA was announced, requests came in from all over the globe for copies of the complete collection of documents. "We were able to scrounge together 85 complete sets of reports to send out," says Bill Creager, project director for the Office of Technology Assessment Legacy Project. Once those last sets were boxed up and shipped out, the OTA's legacy would be forever dispersed. The National Institutes for Standards and Technology (NIST) normally provided an archive of such information on microfilm, which could then be duplicated and distributed to a very limited audience. This traditional fate was expected for the final preservation of the OTA files.

Evaluating Technology Solutions
The advantages of using microfilm are the proven longevity of the media and the mechanical simplicity of the optical viewers. The overwhelming disadvantage of microfilm is rooted in its physical, rather than digital, format. To distribute microfilm, it must be mailed or shipped. To duplicate microfilm, a roll of film must be mechanically reproduced on another roll. Large collections on microfilm tend to be very expensive.

But, in today's world of new media, microfilm has been made obsolete for most purposes involving documents. In the most basic measurement of physical density of information, the same number of pages that would require several pounds of conventional microfilm can be placed in a single CD weighing a fraction of an ounce.

More importantly, access to the information on microfilm requires an external index or database to speed up the process of finding relevant frames. The new Internet business media is expected to grow to thirty times the size of the current user base in the next few years, and microfilm has no place at all in the new media. There's no such thing as an Internet microfilm viewer or printer. There is no way for microfilm to duplicate the infinitely quicker search and retrieval aspects of digital documents.

Instead, the entire contents of the OTA Legacy, which comprise 800 reports and 110,000 pages, were digitized to portable document format (PDF) and onto five CD-ROM diskettes. The contents as well as the document information index fields are fully accessible via the Verity Topic search engine which is included on the CDs.

Providing A Research Tool
The entire OTA Legacy Project offers many lessons learned in state-of-the-art document digitization effort. In contrast to the dead-end techniques of copying these archives to paper and film, Bill Creager and Peter Blair, AD for the Industry, Commerce and International Security Division of OTA, decided to use the best technologies available to do justice to this collection. "Our major objective was to preserve the legacy of OTA," declares Blair. "These are very long-shelf-life documents," Blair explains, "and our goal was to provide a research tool for our demanding users."

Accessing Document Resources
Adobe Systems, Inc. created the Acrobat environment explicitly for the world of digital documents. Ten years ago, PostScript became a universal standard by giving every user on every platform the opportunity to create attractive paper documents, no matter what printer they might have. Now, Acrobat gives today's user the ability to create a universal digital document that can be viewed and printed on any platform, including Windows, Mac, UNIX and DOS. These documents appear in the same format on both CD and the Web.

Unlike HTML, the Acrobat PDF format is fully functional on CD. The Acrobat search engine can be licensed and distributed on CD, so that a user on any platform can use the full text retrieval capabilities of the embedded Verity Topic search engine, or use the document info fields for a precise index search.

High Performance Scanners Needed For Project
"In our service bureau operations we have a mix of Fujitsu scanners," says John Solomon, v.p. of Input Solutions, Inc., which won the bid to perform the conversion. "Some of them have over one million scans, and we have had no major failures," he says. "We replace retard pads and rollers, but I don't know if we've ever even had to change a bulb."

"We had six misfeeds in a 110,000 page job," Solomon declares. "So, we were understandably impressed by the Fujitsu 3099." The documents in this project consisted of previously published reports that averaged 80 to 100 pages in length. The paper source documents are the equivalent of textbooks, in both soft and hardcover editions. These source documents were made ready for the scanner by using an electric guillotine to slice off the bound edge.

"Unlike some automatic document feeders (ADF), the Fujitsu 3099 can handle a full 500-page stack, as specified," Solomon points out. The double-sided scanning capability was a labor saver on this job. Since the vast majority of the OTA pages were double-sided, this capability offered enhanced efficiency in several ways. Obviously, paper handling was greatly reduced because the pages needed to be scanned only once, and very long books could be fed as a stack in the ADF.

Effective Batching Saves Money
Another advantage to double-sided scanning is to enhance batch processing. To maintain rapid throughput of documents by up to ten QC operators on the network, clean-up procedures depended upon complete and orderly sets of pages. Since a single missing or out-of-order page would typically interrupt three or four people in QC, rescan and batch handling, the clean batches created by the Fujitsu 3099 offered substantial savings through cost avoidance.

Scanning Text & Images
Another benefit of the 3099 is Fujitsu's Enhancement Technology for creating optimum images. Most reports in this collection comprised both text and graphics, or so-called "compound documents." The graphics included everything from drawing and charts to photographs. The digital documents produced in this project are intended to look exactly like the original reports. The requirements of this application dictated the use of binary scans, because gray scale images would be too large for efficient delivery on the Web or even on CD.

"We scanned at 400 dpi," Solomon reports, "because we obviously wanted to maximize the accuracy of the Acrobat Capture OCR. Some of the pages had halftone images, and the 3099 was able to smooth them."

Advanced Scanning Provides Image Quality
Input Solutions pushed the envelope of image quality by testing and adjusting the scanner during the job. "We tweaked the scanner for some of the books, after visually checking the contents," Solomon explains. "Sometimes there were chapter heading pages with a screened background, and we were able to completely remove the screen so it didn't interfere with OCR. We were able to run with the same settings for entire books and sets of books," Solomon says, expanding upon their successful procedures.

Bill Creager's Legacy Project is now showing on the World Wide Web, and the first printing of the CD Edition has sold out. A second printing is on the way, an undeniable testimony to the demand for this 'value-added' form of information. Research projects can now be rapidly conducted into the entire body of the OTA's work that would have taken years to do with the paper reports and documents.

The files of the Office of Technology Assessment are available on the World Wide Web at http://www.wws.princeton.edu/~ota