Why scanning old books isn’t always as easy as it sounds
– 13 January 2008
When it comes to scanning in old books, how easy or difficult it will be depends on several generic factors.
First among these is the design of the scanning apparatus itself, which may range (at the more affordable price points at least) from a general-purpose all-in-one office device that contains a flatbed scanner as well as a printer (generally these tend to be very slow at scanning) to a dedicated ordinary flat-bed scanner (which should be faster but is still best suited to scanning individual sheets of paper rather than books) to a flat-bed scanner specially adapted for book-scanning (which should tend to be moderately fast and efficient at the task it is made for).
The three general classes of affordable devices outlined above differ not only in speed but also in the amount of physical effort that is typically required in holding down a book during scanning and pre-scanning in order to get it to lie flat and stay still. Dedicated book scanners such as the Plustek Opticbook 3000 are supposed to minimise the amount of physical effort needed for a book to lie flat, and that is the main reason (other than the huge speed advantage over Linn’s all-in-one Hewlett Packard office device) why I bought one second-hand last summer from a recent graduate of Durham University who no longer needed his and was auctioning it off on ebay as a result. However, as I’ll go on to elucidate below, even the Plustek book scanner is far from perfect in achieving its aims.
Second among the generic factors referred to previously is the required scan resolution. If you have good quality optical character recognition software and your only purpose is to recreate the text from the source material, without displaying it in its precise original format, a 300 dpi monochrome scan will probably suffice for the OCR software to work from unless the font in the source is very old and thus non-standard today and difficult for OCR software to interpret.
If however your intention is to create a scan that is going to be directly viewed and studied in its photographed form, and / or printed out by others, 600 dpi is really necessary, especially for monochrome scans, which by their nature of rounding up all details at the periphery of characters into either black or white tend to reduce the edge definition of the source material, so a cruder resolution such as 300 dpi will produce a monochrome scan that is noticeably less well-defined around the edges of the letters and figures than a 600 dpi one.
Suffice it to say that any scanner priding itself on its speed will be found to be at least twice as slow at physically scanning in 600dpi compared with 300dpi, and when it comes to the data processing the scanner and subsequently the computer has to do, it will take at least four times as long with 600dpi scans compared with 300dpi ones. This extra time can make the cycle between scans (even if they are done as a continuous batch with the help of software designed to achieve this) very much slower, and the processing time will be especially slow at 600dpi if the scan is in full colour as opposed to monochrome, with greyscale somewhere between the two.
Third is the amount of complexity or otherwise involved in configuring the settings for each scan, which will include such considerations as the resolution, colour bit-depth of scan (ranging from 48-bit full colour to one-bit monochrome), area of the scanning bed that is to be selected for the particular scan, and other settings such as contrast and brightness (especially important to get right with each page on monochrome scans, or problems with shadow and thickened letters and lines or partially invisible letters will result).
Fourth is the physical quality of the book itself, particularly with regard to its binding and the width of its unprinted inner margins. The narrower the inner margins, the less well constructed the hinges (if any), and the further in from the spine any immoveable structural parts of the binding are found, the more difficult (and in some cases impossible) it is going to be to obtain a scan of the entire page, no matter what scanner is being used, Plustek Opticbook included.
Let’s take the 1598 second edition of Fabian Wither’s translation of Claude Dariot’s magnum opus on astrology for example, since I’ve been working on it the last couple of days and know from bitter experience just what an awkward beast it can be to handle. The copy I have is in what remains of its original vellum binding. This binding is heavily torn at the covers, exposing the book to numerous dog-ears, but at the spine it is extremely strong. It is held together by thick string ties that run through the book, connecting the pages together, at least a centimetre into the book from the outer spine. This means in practice that even when the book is fully opened, the inner margins are very narrow throughout, but on some pages a lot more narrow than on others, to the extent that there is printing right up to the visible inner edge of some pages. Since the book has no proper hinges and thus is inclined to a gradual curvature between the spine and the pages if it is laid down flat, the only way to get it to open as fully as possible and flat right across the printed width of the page is to press down very hard on the spine. This cannot be achieved simply by closing the lid, since the spine protrudes by about a centimetre outwards from the ties that hold the pages together, and any attempt to close the lid upon it without a steadying hand is doomed to cause the book to move out of place and the pages to lie far from flat. Thus, the only solution is manual pressure applied downwards on the spine, so as to reduce the natural gradual bend of the pages into the spine (when the book is open at 90 degrees) to as sharp and sudden a one as possible, and thus to get as much as possible of the printed part of the page to lie flat for the scan.
Although the Plustek Opticbook 3000 is supposed to avoid the need for this kind of hard work (and I do mean extremely physically hard work: the application of sustained strong physical pressure at a constant, steady rate so that the book does not slip yet the pages remain flat, over a long enough period for both pre-scan and scan to be executed, for each page in turn, is completely exhausting when working at a resolution like 600 dpi which by its nature leads each scan to be a time-consuming business; it must be on a par with lifting weights and holding the lift until a very slow judge blows his whistle, I imagine, not that I’ve ever done any serious weight-lifting myself, but you get the picture I hope!), in practice it does not unless there is a significant inner margin to hold over the scanning edge of the machine. This is because although its key selling point is its ability to scan up to the edge (so that it is possible to hold books in a right-angled open position over the edge resulting in a perfectly flat lie to one page being scanned without the need for physical pressure being applied manually to the spine), in reality it doesn’t quite manage this, for two reasons: firstly, because there is a structural strip of plastic about two or three millimetres wide just before the scanning glass begins, and inward from this about another millimetre of width is occupied by the top of the bracket that holds the moving scanning part in place as it glides up and down beneath the glass; so even if it was possible to place a book perfectly across and down the near edge of the scanner as directed, at least the first three millimetres of the inner margin of the page lying flat would not be scanned; and secondly, because even when a book is open at 90 degrees rather than 180 degrees, there is still a natural tendency for the pages to curve upwards to meet the spine, and in some types of binding such as that of the Dariot book I’ve been working on this tendency is much more pronounced than on others. So even with the Plustek Opticbook 3000, extreme amounts of physical pressure have to be applied to allow printed content very close to the inner binding ties to show up at all when scanning.
It’s fortunate from a conservation perspective that the Dariot’s vellum binding, which has lasted for well over four hundred years, is so strong that it can withstand this amount of pressure without the ties breaking. But it is really no fun at all having to hold one’s breath and stay absolutely still while pushing with immense force for the lengths of time required for the scanner to complete its task on the most difficult pages, and I found myself needing frequent breaks just to get my strength back while working on this book today and yesterday. I must say I’m glad I’ve finished this arduous task, and am looking forward to other books being a lot easier than this one!
Philip
Leave a Reply