|
||
| Home| Collection Search | Help| Feedback| About| Cymraeg |
Books from the Past is an on-line collection of books of national cultural interest which have long been out of print, and are unlikely to be reprinted by traditional means. The texts are available in two forms - images of the original book pages, together with a fully searchable electronic text which is also suitable for printing.
Developed by Culturenet Cymru and the Welsh Books Council, Books from the Past is a resource freely accessible to all. The web site will be developed and expanded over the coming years to include many more texts in both languages.
This project is currently at the pilot stage and feedback regarding its technical performance and suggestions of books to include in the future are welcomed.
In the process of the project, Culturenet have developed a software application to make a collection of Welsh texts accessible through a single access point with aggregated searching across every word in the collection on a bilingual web site. In the process of doing the project our staff have developed a range of skills in the area of digitisation, metadata, content management and web development. The technology and skills are also suited to delivering other types of text and graphical material over the internet
As the project grows issues of content selection become increasingly important. The Welsh Books Council have selected a further 200 books for possible inclusion. Digitisation of between 9 and 20 new books in 2004 is planned. The books will be selected for their educational and cultural value. In order to push the limits of the technology still further, large books or books with lots of graphic material and series will be chosen.
Culturenet Cymru and the Welsh Books Council would like to broaden the partnership to include other heritage sector bodies. It is hoped that other digitisation projects will be able to use the Culturenet application. As the partnership base and the number of digitised texts grows, there are a number of issues which require clarification and for this reason, Culturenet will be engaging in a feasibility study.
The aim of the feasibility study is to illuminate the way forward for Culturenet members in the area of text digitisation and text encoding. Key issues to be investigated will be: standards, data interchange, technical solutions, sharing expertise and resources, ownership of content, editorial control, sources of funding and sustainability.
The feasibility study will deliver concrete proposals for the way forward. Hopefully a large scale, collaborative text digitisation programme will be the result.
This project was initiated by one of our members, The Welsh Books Council, who identified a need for access to out-of-print books of cultural significance. Culturenet Cymru exists to assist member organisations to put culture online. So, we started with a list of the Welsh Books Council’s aims, added a list of our own and then sought the best way to accomplish as many of them as possible.
The Welsh Books Council required the project to deliver:
Culturenet Cymru required the project to:
It was immediately clear from the aims of the partners that this project would need to be done rather differently to most digitisation projects. It is a fairly straightforward matter to digitise postcards, photographs and small artworks and manage the content. Culturenet Cymru had experience of this kind of digitisation with Gathering the Jewels. But, complex objects with many related parts, like books with many pages for example, require a different kind of digitisation, management and presentation. It was not that the digital imaging of books is too taxing a problem, barring preservation of the books during capture. The more difficult problem is the presentation of books on the web. The presentation of thousands of pages of text as individual pages and in proper sequence while maintaining ease of navigation through a larger collection of books as a whole posed a significant challenge to Culturenet.
There are fairly well established projects world-wide which have successfully managed these technicalities. Text is presented on the web in at least three ways. The most ubiquitous is electronic text, usually in the form of html. PDF documents are also widely used. Some project display text as images of pages of text. The latter option was quickly discounted as it satisfied very few of the partners' aims.
It was initially felt that Books from the Past should be a collection of PDF files. PDF solves many of the navigation problems and even offers page zoom and a rudimentary search function. But this solution has problems of it's own, especially with long texts (download times) and as the archive of books grows. We felt that full text searching across an entire archive of books would be very difficult to achieve with PDF. It would also require users to download very large PDF files before being able to perform word searches and browsing. Users demand intuitive navigation that takes them to exactly the part of the book they want without having to download the entire text. For legibility, users need sharp, legible text, without sacrificing faithful reproduction of illustrations. The use of PDF alone for the delivery of web content may cause accessibility problems and there are questions about the longevity of support for the PDF format in the long term.
Many of these problems can be addressed by converting books into electronic text and then offering users a number of formats including PDF for download. We felt that electronic text would allow full text searching, easy browsing and cross collection searching.
Conversion of text into electronic form is often done by optical character recognition (OCR) or by re-keying or a combination of the two. In general, re-keying is the more accurate method. Decisions over which is most appropriate to use have a lot to do with the characteristics of the source material and the web delivery mechanism. Some web delivery mechanisms rely on ‘quick and dirty’ OCR’ed text for word searches, but only present an image of the page to the user. These systems sometimes use ‘fuzzy’ searching which enables users to find search terms quite effectively without 100% accurate text. These systems have been widely used for newspapers and material which has a lot of visual interest. Other systems rely on very accurate, edited OCR’ed text or re-keyed text as they present the user with the electronic text itself. These are usually more appropriate for literary or linguistic texts.
There is no doubt that the most widely used and longest established standard for the latter system is the Text Encoding Initiative (TEI). TEI is an international and interdisciplinary standard for the representation of literary and linguistic texts online. It was launched in 1987 and represents our best hope against obsolescence. TEI enables the full electronic text of a book to be encoded along with supplementary metadata in a non-proprietary and open standard form. TEI also provides a rich and extensible set of markup tags, enabling the mark up of parts of the books for more intelligent searching and retrieval of content. Where this markup is used, users can search for lines of verse, names, geographical locations, chapter headings and so on. TEI also offers the flexibility to present texts with minimal markup and then add more markup at a later date to add value to the text.
While re-keyed electronic text in TEI would satisfy the searching and accuracy criteria for the project, it would not be sufficient (on its own) for the presentation of the original orthography and layout. The nature of the web means that layout and presentation of web pages is controlled by the user. Ideally we wanted to present to the user an image of each page and electronic text of each page. For this we required a metadata system which could link an image of each page to the encoded text for each page in a given TEI document. We also required a metadata system which could wrap up all the different parts which make up each electronic book. For this purpose we decided to use an emerging and already widely accepted standard called METS.
The METS schema is a standard for encoding different types of metadata regarding objects within a complex digital object (such as a book or a whole library of books) and is expressed using XML. It is being developed as an initiative of the Digital Library Federation. METS provides a hub document which draws together dispersed but related digital files and content. Our system would have one TEI file per book and hundreds of image files. METS provides the hub which draws these files together to form a digital entity which could make sense to users. METS provides a syntax for identifying the digital pieces that together comprise a digital entity, for specifying the location of these pieces, and for expressing the relationships between these digital pieces.
The heart of the METS file is the file section (fileSec) and the structure map section (structMap). The file section records information regarding all of the data files which comprise the digital library object. The structural map section defines the hierarchical arrangement of the source document being digitised. This hierarchy is encoded as a tree of divisions or ‘div’ elements.
The structure map contains the file pointer for each file associated with the corresponding page. For example for a text page there is an entry for the web image, the thumbnail and the associated XML file. This file pointer in the structure map section points to a unique file in the file section identified by its ID. The file section then points to the physical files stored on the file server.
On its own, METS cannot present files on the web. Neither can it provide an administrative tool for the organisation and editing of parts of the electronic books. For this some sort of transformation is required. In some instances this may be achieved using XSLT files or proprietary content management systems which transform the METS and TEI for display on a web site. We decided to use an open source system called Greenstone. Read More about Greenstone
Collections built with Greenstone offer effective full-text searching and metadata-based browsing facilities that are attractive and easy to use. Moreover, they are easily maintainable and can be augmented and rebuilt entirely automatically. The system is extensible and can have software plugins to accommodate different document and metadata types. Greenstone has its own XML DTD called Greenstone Archive Format (GAF). This is an XML style that marks documents into sections, and can hold metadata at the document or section level. In this regard it is very similar to METS. Culturenet has developed a plugin which allows automatic conversion of METS and TEI files into GAF format.
Image digitisation, text re-keying and encoding and web development were all out-sourced for this project. Firstly, the books are taken apart and the pages are scanned at 300dpi in 24bit colour. These are archived as master files for preservation and possible future re-use. A copy of each file is optimised for web transfer. Then the text is re-keyed and marked up using standard methods (TEI and METS) that describe the book and its structure in a way that can be understood by automatic processes (such as a web server). In many instances, double re-keying is used, in which two operators re-key the text independently while software checks for variances between them.
Culturenet staff check each image for the following:
Once images and electronic text has been created it is loaded into the web application using the admin utility. This publishes the books on a test web server. Staff at the Welsh Books Council check the electronic text against the image of the scanned page. Errors are reported and fixed.
HTML 4.01 transitional, W3C WAI level A
TEI, Text Encoding Initiative, (teixlite dtd)
METS, Metadata Encoding and Transmission Standard
GAF, Greenstone Archive Format
Master image files - TIFF
JPEG(SPIFF)/GIF
XML
Six work packages were identified. The work packages were divided between The Welsh Books Council, Culturenet Cymru and Milan Associates who successfully tendered for four of the work packages. Milan Associates sub-contracted to two companies based in India.
The bibliographic metadata required for each TEI header was supplied by the Welsh Books Council
This was performed by Milan Associates along with its partner company in India.
Milan Associates along with its partner company in India did the web development, based on the Greenstone open source code.
The Welsh Books Council checked the accuracy of the electronic text and Culturenet Cymru checked the quality of the images
Milan Associates performed the initial consultancy and managed the liaison between the companies in India and Culturenet Cymru
The design of the web pages was by Culturenet Cymru
The Welsh Books Council sourced copies of the books and managed copyright clearance
Milan Associates Ltd was formed to deliver high quality software and services solutions and European clients, at offshore prices. Such high value could only be delivered through an innovative "DualShore" business model (see below), which binds us closer to the client's interest, and unlike other offshore povider, frees them of the need to have any offshore dealings.
The initial focus was in Banking and Finance, but thanks to increasing demand from other industries such as Pharma, Publishing, Bureaus and other industries, we now offer a broad spectrum of services to a wide clientele.
"Our mission is to deliver DualShore solutions using the best of local and offshore talent to deliver UK and European levels of Quality, at Offshore prices."

Our DualShore strategy means that we use a blended UK/offshore team to deliver solutions. The UK team looks after the client interface, project management, solution requirements (and very often the design) and the QA. The offshore team carries out detailed design, execution and testing, to pre-agreed QA standards. We work with carefully selected local and offshore partners who each bring specialist skills to the table.
This means that we leverage the strengths of both locations and still offer superb value. This fit for purpose approach means that each project is evaluated and the resource split is established accordingly. Thus many projects will lend themselves to a totally local solution and some to a mostly offshore solution. The vast majority of projects however, can benefit from this DualShore approach.
Address: Milan Associates Limited, 327 West Barnes Lane, New Malden, Surrey. KT3 6JE. United Kingdom.
Phone: +44 208 255 6088, Fax: +44 208 286 2341
Read more about Milan Associates at http://www.milanassociates.com/
Greenstone is a suite of software for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet or on CD-ROM. Greenstone is produced by the New Zealand Digital Library Project at the University of Waikato, and developed and distributed in cooperation with UNESCO and the Human Info NGO. It is open-source, multilingual software, issued under the terms of the GNU General Public License. Read more about Greenstone at http://www.greenstone.org/cgi-bin/library
![]() | ![]() | ![]() |