Making pdf Documents
© Brooke Clarke 2006 - 2007
Background
Scanning
Manual or Auto Feed
Bleed Through
File Format
File Naming
Dots Per Inch
Black & White, Gray scale or
Color
Frame Size
Scanning Blank Pages
Post Processing
Bleed Through
Portrait or Landscape
Angle Correction Rotation
Clean
Stitch
Vector Graphics
Adding Color
Acrobat Processing
Navigation
Bookmarks
What's wrong with most Bookmarks
Good Bookmarks
Links
OCR
Page Numbers
Photos
Links
Background
This came about because I wanted to
make
pdf versions of military
and test instrument Technical Manuals that range in size from
dozens to hundreds of pages. So here are some of the things I've
learned in the past few years. There are three main steps to
making a great manual. Scanning, Post Processing and Acrobat
processing. The following is based on the idea that a CD-ROM or
DVD will be used as the distribution medium, not an on line
document. When a document is to be on line the file size needs to
be minimized both to reduce the storage requirement and to shorten the
download time at the expense of quality.
Scanning
The first, but for most also
essentially the last, step is to scan the document.
Manual or Auto Feed
There are two kinds of scanners,
manual and auto feed. I use a manual flat bed scanner and it has the
advantage that when individual pages are scanned they are aligned and
not rotated. An auto feeder saves on labor, but creates a number
of problems. Since there needs to be some clearance between the
edges of the paper and the feed slot, say it's 0.1 inches, then the
paper can rotate by some small amount (0.1"/8.5" = 0.6 degrees).
0.6 degrees is a very noticeable amount of rotation and typically many
pages have some rotation. The feed rollers sometimes grab two
sheets either skipping a sheet, making some combination of two sheets,
or distorting a single sheet by smearing the letters.
Bleed through
When copying double sided sheets the back side image bleeds through and
shows up in the scan of the front side. The main reason for bleed
through is that the scanner lid has a white lining. This is about
as bad as a mirror and reflects light back through the paper. In
my opinion the lid should be painted flat black, or what I do is tape a
sheet of flat black paper to the lid. Now light is only reflected
by the front of the page knocking the bleed through down a lot.
Histogram
The four images below show what bleed through looks like on a
histogram. The upper left "Exposure Adjustment" window shows the
classical bleed through peak on the right. In the image below it
you can see the word "INDEX" in the bleed through.
The upper right "Exposure Adjustment" window shows the highlight cursor
has been moved from 233 to 169. 169 was choosen because it's at
the left toe of the bleed through curve. The image below shows
the output.
The only change made was to the highlight cursor. Both images are the raw .bmp files directly from the HP 6200 scanner.
The next step in eliminating bleed through is to set the white
threshold using the histogram. On the
HP
6200C flat bed scanner, when doing gray scale or color scans, you
can adjust the black threshold, white threshold and the gamma (plus
stuff with the 3 color channels). The threshold controls are
directly below the histogram and move cursors on the histogram.
So by placing the white (right hand side) cursor just to the left of
the toe of the hump that's the bleed through you eliminate it
completely. Note this is a trade off since you are also cutting
out some of the highlight detail in the image.
Note that if the page is Black and White (no gray) then by scanning it
in gray scale and setting the scanner controls you can completely
eliminate bleed though. But it there is a photograph or other
gray scale on the page eliminating bleed through and the quality of the
image are a tradeoff and the histogram gets to be very important.
When scanning a bound book insert a sheet of black paper behind the
page being scanned.
I tried the HP 8400 flat bed scanner and although they "show" a
histogram, there's no way it can be used as described above since the
controls were somewhere else and there were no cursors. I turned
it back and stuck with the 6200.
See
Post Processing Bleed Through below for how to fix bleed through in an existing image file.
File Format
I expect that what most people do is scan in jpg or pdf format.
This is a mistake if you're going to do any post processing since these
are lossy formats and degrade each time there's a new generation.
A non lossy format like Bit Map (.bmp) is a better choice to maintain
high quality. Bit map also includes the physical size of the image which is not the case with Tagged Image Format (TIF).
File Naming
When working with hundreds of pages there will be mistakes and you will
need to rescan a page or two. So it's a very good idea if your
file naming scheme somehow will allow you match to the actual
book. Most of the TMs that I scan use a chapter-page system where
the first page in chapter 4 is called 4-1.
Also the file name should be such that the computer file manager will
alphabetize them into the same order that they appear in the
book. Otherwise you will need to do a lot of manual work to get
the pages in order.
The answer for me has been a file name like nnn-mmm.bmp. Where
nnn is the chapter number starting with 000 for stuff prior to chapter
1 and after the last chapter keep using the next number, so if appendix
A comes after chapter 9 then it's 010-mmm.bmp. Where mmm is the
page number. Note nnn and mmm are always say 3 digits so the
front cover is 001-001, not 1-1. This is needed to keep the file
sort order correct.
A schematic might be 004-037L.bmp for the left side and 004-037R.bmp
for the right side. If there are more than two scans you can use
A, B, C etc. This way when making multiple scans of what is
really one page number you don't use up page numbers that are needed
for other actual pages. For a huge book you may need to provide for a 4 digit page number like
nnn-mmmm.
When scanning you don't need to manually enter all of the file
name. When the save file button is pressed the default file name
is the last name stored and you can just place the cursor in front of
the digit that needs to be changed and type: delete, the new digit,
enter.
Dots Per Inch (DPI)
This has a lot to do with the source material. If the source is
line art or text made prior to laser printers then 300 DPI is
very good. But if there are schematic diagrams with very small
print size (like a C size drawing that has been photo reduced by 4x)
then 600 DPI is needed. Photos are discussed separately.
Black & White, Gray scale or Color
If the source document has color then the scan should be in
color. When scanning very old books where the pages have yellowed
sometimes using color will make the post processing easier. For
everything else I use gray scale. Black and White has an
advantage
when you are trying for the smallest file size, but for me it's too
much of a quality reduction. Note that even though you are
scanning a black and white document you need gray scale so that when
half a pixel sees black and the other half sees white it can make a
gray. If B&W was used in this situation there would be a 1
pixel error either into the black or into the white.
Frame Size
For most documents you can set the frame size to just a little smaller
than the page image then there will not be black borders. When
working with schematics it's good to expand the frame size to capture
as much of the schematic as possible so that when stitching you will
have more choice of where to place the seam. But remember to put
it back for text pages.
Scanning Blank Pages
Books are laid out so that new chapters always start on a right side
(odd numbered) page. This is a good thing to do for a physical
book since it allows thumbing for new chapters, but has no advantage in
an electronic only book. If making a pdf where it's planned to
print all of it then scanning blank pages will maintain the odd page on
the right concept.
Post Processing
This is where a number of things get
fixed and the advantages of an electronic manual start to show
up. This is done using a photo editing package like
Photoshop. These packages process image files and although they
can do some text that's not their main use.
Bleed Through
Recently I received images of pages scanned by someone else that had
noticeable bleed through on many pages. But in Photoshop you can
Image\Adjust\Levels and on the histogram move the right cursor to the
left so that it's on the toe of the bleed through curve. Also
moving the left cursor to the right makes the blacks blacker and
the whole page look better. This is much better than trying to
use the "magic wand" to get rid of the bleed though. This works
so well because the bleed through is in the form of light gray images
not the black images that are desired.
Portrait or Landscape
In a hard bound book all the pages must be in the same orientation, but
that's not the case for a pdf document. So if there's a diagram
or table that's better viewed in landscape mode the page should be
rotated into landscape format. Note that if the document is later
printed Adobe will automatically rotate it.
Angle Correction Rotation
If an auto feeder was used in any of the prior generations then there
will be pages with rotations typically less than 2 degrees that will
need to be rotated to within about 0.3 degrees of true. 0.7
degrees is very noticeable and anything over 1 degree is really
noticeable. If it's a schematic that's going to be stitched all
the
pages need to be the same rotation. This means that you can
stitch a couple of pates where they are both 0.6 degrees, but not if
one is 0.0 and the other is 0.6.
Clean

The idea is erasing
things that are not wanted. Binder and staple
holes are an example. Older copy processes have the tendency to
leave small black specks much like finely ground pepper. Older
books where the pages have yellowed have a grainy background. If
you have set the frame size too big or the page got rotated there may
be black borders that need to be erased. Some copy machines leave
streaks, like there was a scratch on the drum. The fold lines on
a schematic are another thing that can be erased. A properly made
scan of a clean page may not need any cleaning. An antique book
may need an hour of cleaning for each page.
Sometimes rather than erasing to white you need to use a copy and paste
method, like for eliminating the binder holes in a color cover sheet.
The image at the left comes from the
1928
K&E catalog. Many hours of cleaning were required to get
it looking like this. The photo is a reduced resolution image, at
full size it's even more impressive. Note that like all the
illustrations in the catalog this is hand drawn using K&E drafting
supplies not a photo.
Stitch
Fold out pages need to be stitched together. This makes it very
easy to look at a schematic on the computer screen. When a 2, 3 or
4 page fold out schematic is broken up into seperate pages it's
almost impossible to work with it on a computer. What's most
commonly done is to get out the tape and scissors and make a hard copy.
Most schematics that were drawn prior to laser printers were done
either by hand or a plotting machine, but in either case there was a
pen or pencil used that could not draw anything finer than about
0.3 mm. This is a much wider line than a laser printer can
draw. So if you have a schematic that's on a B (2 x letter size)
or C (4 x letter size) sheet you can stitch it together. Do NOT
shrink the page size, leave it as is since the Adobe printing
default will shrink the page to fit the printer. This way the end
user has the Adobe option to use tiling (cut and tape) to get a full
size print. There was a time when photo reductions were used
typically to move a drawing to the next smaller sheet size. For
these you need to go up to 600 DPI when scanning to maintain the
annotations.
When stitching you can place the stitch anywhere in the area where both
images overlap. Rather than just take all of the second image
it's good to look for a place where there's a minimal amount of text that will
cross the stitch. It's common that there's a small scale and
rotation difference between the two images that are being joined so
even if you pick a good stitch line you may still need to determine
where on the line the best match will occur and as you go farther away
the match gets poorer.
Vector Graphics
Line art, like schematics, can be
stored as either an image or vector file. Photoshop, Paint, etc.
are image processing programs. Autocad and the old HP ME
(Mechanical Engineering) are vector processing programs. A vector
based pdf "D" size drawing fits into a few hundred kilo bytes, but if
the same file with the same resolution is converted to an image format
it will be a few hundred Mega bytes, i.e. about 1,000 times
larger. This was made real to me when making the web page for the
HP E1938 OCXO where the complete drawing package was very small as a vector pdf but huge when done as an image file.
I haven't found a free image to vector converter application. If you know of one please
let me know.
Often in schematics a box is drawn
around part of the circuit to define some function. Since these
lines look very similar to the trace lines they add confusion.
But they can be manually erased and replaced by either a colored line
or a gray line. Greatly adding to the understanding of the
circuit.
The trace lines can also be made much more understandable by doing
things like making all the ground lines wider and solid black.
The Vcc lines can be made red and the signal lines some other
color. This goes even further in making the schematic easier to
understand at a glance.
Some colors, like yellow, look good on the computer screen, but do not
show up when a page is printed. So when choosing colors be sure
they have at least 15% of red, blue and green components, so there's
some gray to print. Better is to make a trial print to test each
new color.
Acrobat Processing
In Acrobat 7 there is a "make pdf from
multiple files" option and a browse function. So if you have the
files named as described above it's just a few clicks and you will have
a single document that combines all the pages. I put this step in
the intro to the Acrobat processing sections because it's just he
beginning, not the end of what's needed.
Navigation
An electronic document is different from a physical document and how
you find what you want is different. You can not "thumb" an
electronic document like you can a book. But a book does not have
the instant access that you get with an electronic document. When
pages are numbered using the chapter-page method there's no way to
correlate that with the pdf document page number.
Bookmarks
I
think good bookmarks are by far the best way of navigating a pdf
document. A pdf document without bookmarks is next to worthless
for use at a computer, all you can do is print it and use the hard
copy, what a waste!
It's an art to name the bookmarks to keep them both short and
meaningful. Nesting folders is part of keeping the length of the
bookmark names short and also logically dividing the document.
For documents of about 25 pages and up bookmarks make a world of
difference in how easy it is to find something.
When the bookmarks are setup like the table of contents and List of
Illustrations and List of Tables (TOC, LOI and LOT) you have all of
these handy no matter what page you are on, just click on the Bookmarks
tab, open a folder or two and your on a new page. It's very fast
and convenient.
What's wrong with
most Bookmarks

I overlooked the use of bookmarks for
some time. Note that all the free TMs on LOGSA have bookmarks and
also note that they are useless. This means that when you get a
CD-ROM with a bunch of TMs it's also probably the case that the
bookmarks are useless. I think that someone that knew about good
bookmarks wrote the mil spec for how a TM is to be made and the spec
has a paragraph saying that there will be a bookmark for each chapter,
paragraph, figure, and table and sure enough that's what they all
have. The problem is that the bookmark names are worthless.
For example "Chapter 3" is the name of the bookmark for "Ch 3
Operation" . The bookmark name for a paragraph may be "Chapter 4-
Section 3 -Paragraph 4.1.4". This has two problems, one - It does
not tell you what's in this paragraph and two it's too long.
Bookmarks are in a collapsible frame to the left of the main page view
frame. You can click on the "Bookmarks" tab on the left to open
them and you can click on the button at the center of the divider bar(8
little bumps in bottom right of the illustration)
to close them. The divider bar can also be grabbed and
moved. So you can see that good bookmarks both tell you what you
will get when you click on them and also are as short as possible.
In the illustration they use "CHAPTER 1" instead of "Ch 1". All
capitals is like someone is SHOUTING, not pleasent. Also they use
up 8 spaces when 4 work better.
Another problem with the LOGSA TMs is that the bookmarks depend on a
logical order for the paragraph numbers. If there's a paragraph
number typo caused
by the OCR then then all the bookmarks for the rest of that
chapter are missing.
When someone makes a pdf document without bookmarks and then locks out
any changes, which includes the ability for the user to add bookmarks,
then they have really made a useless document.
Some vendors use pdf documents for their data sheets, which in some
cases are really books of 50 or more pages. I have helped one
change from using the worthless type of bookmark to using better ones.
Good Bookmarks

A good bookmark gives you a good idea
of what you will get if you click on it. It's also as short as
possible. Since bookmarks can be nested a good way to eliminate
the "Chapter 4- Section 3 -Paragraph 41.4" length problem is to have a
folder for "Ch 4 Maint" and a sub folder for "DS Maint" and then have a
bookmark for P4 "P4-Cal Adj" and a sub bookmark for "P4.1 VFO" and a
sub sub bookmark for P4.1.4 VFO Max Freq Adj". This way the
folder a bookmark tells you it's context. There can then be many
bookmarks called "Scope" but each is in a different section.
This is from TM11-5820-667-35 for the PRC-77. I think you can see
good bookmarks allow you to find what you want very quickly.
Notice in the illustration "Sec II Schematics & Block dia". By
making a bookmark for each section all the indented bookmarks under
that section no longer need to carry any of the section title thus
making them shorter. Another example is where there are a lot of
bookmarks that all relate to the same item in a longer list of
different items. Adding a new bookmark-folder allows
removing the common name from all the sub bookmarks, i.e.
bookmarks are context sensitive and each one does not need to spell out
all the higher level names.
Under the Figures in the above illustration I should have appended sch
for schematics or blk for block diagrams.
Links
Links work like web page links.
They can be placed on about anything in a pdf document. Most, not
all, LOGSA manuals come with links on the Table of Contents, List
of Illustrations and List of Tables entries, so you can use a
conventional book navigation approach. They also typically have
links in the body of the document whenever there's a reference to some
other part of the document. For example referenced to Figures are
linked as are references to other paragraphs. Some documents even
have every Index entry inked to the referenced pages. If the
bookmarks are good as described above there's less need for links, but
they are still very handy, i.e. one click and you're at the linked
page. And using the BIG back arrow (not the previous page left
arrow) you can go back to the page prior to the link, making it easy to
have a look at where the link is pointing and then return.
OCR
Optical Character Recognition allows
different types of pdf documents. Most of, but not all, the LOGSA
documents have each letter of the text as a letter. This not only
allows searching the text but also allows correcting typos. When
an antique book is scanned you can leave the image of the book to
appear in the pdf document and hide the OCR text behind the
image. This allows searching but you can not change the
appearance. Without bookmarks OCR allows finding things, but with
a good set of bookmarks it's not as important. With Acrobat
7 you can just click a button and add OCR for the whole document
(although it takes some time and memory).
I used Omni Page Pro 11 for some time. You have quite a bit of
control over what it does and the file format for the output. The
three main windows are a list of the source pages, the active page
being worked where it brings up questionable conversions and asks for
your input, and the output window running the application suitable for
the app, like Word. One problem is that it may make a mistake and
not ask you what to do. Another is assigning different fonts to
similar text or making some text bold and some not. Omni Page has
ZERO on line support and the quality of the phone support leaves
something to be desired.
Acrobat 7 has built in OCR capability but I have not figured out how to
really use it. So far I just click and let it run. But have
not been able to exercise much control of what it will do or correct
what it has done. If you know about Acrobat 7 OCR
let me know.
Page Numbers
I don't use page number links since
good bookmarks work so well. If a document has poor or non
existent bookmarks, links or OCR then adding bookmarks where the name
and target is a page number would allow translating a body text, TOC or
Index reference to a page number into a way to get to that page
number. Note typically there is NO way in an electronic
document to get to any given page number since the pdf file page number
almost never correlates with the number printed at the bottom of the
page.
Photos
An area where a electronic document is
very different from a printed one is the case of photographs. A
pdf document allows the user to change the size of the displayed
image. I find this very useful since my reading vision is not as
good as it once was, but it's fantastic to be able to zoom in on a high
resolution color photo to the point that you are seeing macroscopic
detail.
Taking a high resolution color
photo takes
some skill and is the subject of many books and college level
classes. When using a digital camera if at all possible use the
raw file format (the one that makes the largest file size). Note
that a color scanner can make a 30 Mega byte file at only 300 DPI and
HUGE files at higher DPI values like 600 or 1200. These provide
macroscopic views or even microscopic views when enlarged. You
can see way more in one of these images than you can with a magnifying
glass.
I frequently see things in my photos that I did not see with my eyes.
Making a high resolution color photo into a pdf does not result in much
file size reduction and may even make it larger, I feel it's the right
thing to do.
Links
Back to Brooke's Manuals
Scanned by Brooke, Home, Products for Sale web pages
02875 hits since May 04 2007 page created 22 May 2005.