How to digitise a book in 2025.
In this HOWTO I want to digitise a book in 2025, with minimal effort.
In short, it involves the following:
- Scan a book into TIFF or PNG.
- Fix scan deformities using scantailor.
- Compress image files until they have a sensible size.
- (Optional) Make different versions of the book for different use-cases, e.g. colourful and grayscale.
- Combine the images into a single PDF using LaTeX or ocrmypdf.
- Add a text search layer (OCR) to a file using ocrmypdf.
- Add “roman page numbers” using python/mupdf/PyMuPDF/fitz or LaTeX.
- Add a Table of Contents (TOC), using python/mupdf/PyMuPDF/fitz or LaTeX.
- (Optional) Try to convert a pdf book to epub.
- Upload it where need it to be available.
This task is a bit of a pain, but if you use a book often, it is worth it.
Throughout this HOWTO I assume that this is your own book.
1. Body
1.1. Scan a book into TIFF or PNG.
This is the easiest if you have a stack of paper sheets, and you can put them into a scanner which supports scanning stacks. However, this is often not the case.
In this HOWTO we have a slightly worse case, when we have a webcam, which is focused well on the book pages, the books is fixed on the table, and but you need to turn the pages manually.
On screen 4, there is a webcam program, which looks at the book’s page. You need to turn pages manually in 5 seconds. Increase the delay if you need it.
pkill -f screensaver # we do not want to screenshot the lockscreen sleep 3 xdotool key Super+4 # switch to desktop 4 sleep 3 for (( i=1 ; i<100 ; i++ )) ; do sleep 5 scrot -a 70,80,925,1500 -f page-$(printf '%04d' $i).png # page dimensions xdotool key Right # prevent X11 from sleeping done xdotool key Super+3 xscreensaver-demo & # restore screensaver
You might think that calling the webcam program to do screenshot is better, and it may be in your case, but in my case a screenshot was perfectly fine.
As a result you will have a directory of book pages scans.
1.2. Fix deformities using scantailor or unpaper.
This HOWTO point is small, because you’d better go and try using scantailor-advanced yourself to learn the intuition.
You can also try “unpaper” if you want to try a fully automatic method.
1.3. Compress the images.
printf "Converting to Black-and-White\n" TMP=temp-bw-png rm -rf $TMP mkdir $TMP time seq -w 0000 $((nfiles-1)) | parallel -j $(nproc) convert -colorspace Gray scans/page-{}.png $TMP/page-{}.png TMP=temp-bw-jpg rm -rf $TMP mkdir $TMP time seq -w 0000 $((nfiles-1)) | parallel -j $(nproc) convert -colorspace Gray -quality 90 scans/page-{}.png $TMP/page-{}.jpg
parallel
runs the programs in parallel to use all CPU resources.
1.4. Combine the pages into a single PDF.
There are two ways of doing so.
ocrmypdf
LaTeX
ocrmypdf
is easier, you can just run something like
ocrmypdf scans/*png index.pdf
But LaTeX gives more control over pages, allows adding stuff, and in general is more flexible.
The tricky thing here is the colors
dictionary, it needs to be measured from the images by hand.
You can use kcolorpicker
or anything else, or even try to extract it from the pages using imagemagick
.
It is actually quite necessary, because without good background tesseract fails to perform OCR.
declare -A colors colors[scans]="A5A099" colors[temp-bw-png]="9A9A9A" colors[temp-bw-jpg]="9A9A9A" dirs=(scans temp-bw-png temp-bw-jpg) for t in ${dirs[@]} ; do FILENAME="index-$t" FULLPATH="$WORKINGDIR"/"${FILENAME}.tex" cat > "$FULLPATH" <<EOF \documentclass[a4paper]{article} \usepackage{hyperref} % roman numbering and TOC \usepackage[dvipsnames]{xcolor} % background % taken from the file with colorpicker % greatly improves OCR \definecolor{Mycolor2}{HTML}{${colors[$t]}} \usepackage[margin=0in]{geometry} % no margins \usepackage{graphicx} % for image inclusion \begin{document} \pagecolor{Mycolor2} \pagenumbering{Alph} % front cover is neither roman nor arabic page \begin{center} EOF i=0 for p in $t/* do i=$((i+1)) # if (( i > 55 )) ; then break ; fi if (( i == 3 )) ; then printf '%s' "\\pagenumbering{roman}" >> $FULLPATH else : fi printf '%s' "\\includegraphics[height=\\paperheight,keepaspectratio]{$(readlink -f $p)}" >> $FULLPATH if (( i == 39 )) ; then printf '%s\n' '\pagenumbering{arabic} ' >> $FULLPATH fi if (( i != nfiles)) ; then printf '%s\n' '\newpage' >> $FULLPATH fi done cat >> "$FULLPATH" <<EOF \end{center} \end{document} EOF done for t in ${dirs[@]} ; do FILENAME="index-$t" (cd "$WORKINGDIR" printf "Building pdf for %s\n" "$FILENAME" time lualatex "$FILENAME.tex" > /dev/null ) printf "Non-OCRed file is is %s\n" "$WORKINGDIR"/"$FILENAME.pdf" ls -lh "$WORKINGDIR"/"$FILENAME.pdf" done
1.5. Add a text search layer (OCR) to a file using ocrmypdf.
for t in ${dirs[@]} ; do FILENAME="index-$t" (cd "$WORKINGDIR" printf "OCRing the pdf %s\n" "$FILENAME" time ocrmypdf --force-ocr "$FILENAME.pdf" "$FILENAME.ocr.pdf" ) printf "Your ready to use file is %s\n" "$WORKINGDIR"/"$FILENAME.ocr.pdf" ls -lh "$WORKINGDIR"/"$FILENAME.ocr.pdf" done
Nothing really to comment here, because it is self-evident.
Note that ocrmypdf
can, in principle, do more than just OCR, look at ocrmypdf --help
.
In particular, it can call unpaper
for some post-processing, compress pages into a PDF file without you having to perform the parallel
step, et cetera.
1.6. Add “roman page numbers” and add a Table of Contents.
You have to write a TOC manually, because there is no fully-automatic algorithm to do it.
However, not all is lost, if the OCR process succeeded, you can copy at least some of the data from the recognised table of contents in the PDF.
import fitz # PyMuPDF def add_toc_to_pdf(input_pdf, output_pdf, toc_entries): """ Example of toc_entries: [ [1, 'Chapter 1', 1], [2, 'Section 1.1', 2], [2, 'Section 1.2', 3], [1, 'Chapter 2', 4], [2, 'Section 2.1', 5] ] """ pdf_document = fitz.open(input_pdf) pdf_document.set_toc(toc_entries) # pdf_document.set_page_labels([{'startpage': 1, 'style': 'r', 'firstpagenum': 1}, # {'startpage': 39, 'style': 'D', 'firstpagenum': 1}]) pdf_document.save(output_pdf) print(f"Table of contents added and saved to {output_pdf}") toc_entries = [ [1, 'Chapter 1', 1], [2, 'Section 1.1', 2], [2, 'Section 1.2', 3], [1, 'Chapter 2', 4], [2, 'Section 2.1', 5] ] add_toc_to_pdf( "bookmaker-temp/index-scans.ocr.pdf", "bookmaker-temp/index-scans.ocr.withtoc.pdf" , toc_entries)
Look at the commented part: pdf_document.set_page_labels
.
Style D
is “digits”, and r
is “roman numbers”.
In theory you can add roman numbering right in this script, and dispense with the use LaTeX.
1.7. (Optional) Try to convert a pdf book to epub. optional
Extract the text.
pdfgrep '' index-scans.ocr.withtoc.pdf > booktext.txt
Tesseract would give you a decent, but “stupid” conversion, without recognising page structure, so you might want to try some more advanced tools or services.
(I tried Mathpix
, but up to you.)
From the text and the TOC you can try and make the epub, but it is far more work than everything above combined.
1.8. Enjoy yourself.
Upload your book where you want, say, on your website, and feel the inner glow of joy overwhelming you.