How to digitise a book in 2025.

In this HOWTO I want to digitise a book in 2025, with minimal effort.

In short, it involves the following:

  1. Scan a book into TIFF or PNG.
  2. Fix scan deformities using scantailor.
  3. Compress image files until they have a sensible size.
    1. (Optional) Make different versions of the book for different use-cases, e.g. colourful and grayscale.
  4. Combine the images into a single PDF using LaTeX or ocrmypdf.
  5. Add a text search layer (OCR) to a file using ocrmypdf.
  6. Add “roman page numbers” using python/mupdf/PyMuPDF/fitz or LaTeX.
  7. Add a Table of Contents (TOC), using python/mupdf/PyMuPDF/fitz or LaTeX.
  8. (Optional) Try to convert a pdf book to epub.
  9. Upload it where need it to be available.

This task is a bit of a pain, but if you use a book often, it is worth it.

Throughout this HOWTO I assume that this is your own book.

1. Body

1.1. Scan a book into TIFF or PNG.

This is the easiest if you have a stack of paper sheets, and you can put them into a scanner which supports scanning stacks. However, this is often not the case.

In this HOWTO we have a slightly worse case, when we have a webcam, which is focused well on the book pages, the books is fixed on the table, and but you need to turn the pages manually.

On screen 4, there is a webcam program, which looks at the book’s page. You need to turn pages manually in 5 seconds. Increase the delay if you need it.

pkill -f screensaver # we do not want to screenshot the lockscreen
sleep 3
xdotool key Super+4 # switch to desktop 4
sleep 3

for (( i=1 ; i<100 ; i++ )) ; do
  sleep 5
  scrot -a 70,80,925,1500 -f page-$(printf '%04d' $i).png # page dimensions
  xdotool key Right # prevent X11 from sleeping
done

xdotool key Super+3
xscreensaver-demo & # restore screensaver

You might think that calling the webcam program to do screenshot is better, and it may be in your case, but in my case a screenshot was perfectly fine.

As a result you will have a directory of book pages scans.

1.2. Fix deformities using scantailor or unpaper.

This HOWTO point is small, because you’d better go and try using scantailor-advanced yourself to learn the intuition.

You can also try “unpaper” if you want to try a fully automatic method.

1.3. Compress the images.

printf "Converting to Black-and-White\n"
TMP=temp-bw-png
rm -rf $TMP
mkdir $TMP
time seq -w 0000 $((nfiles-1)) | parallel -j $(nproc) convert -colorspace Gray  scans/page-{}.png $TMP/page-{}.png

TMP=temp-bw-jpg
rm -rf $TMP
mkdir $TMP
time seq -w 0000 $((nfiles-1)) | parallel -j $(nproc) convert -colorspace Gray  -quality 90 scans/page-{}.png $TMP/page-{}.jpg

parallel runs the programs in parallel to use all CPU resources.

1.4. Combine the pages into a single PDF.

There are two ways of doing so.

  1. ocrmypdf
  2. LaTeX

ocrmypdf is easier, you can just run something like

ocrmypdf scans/*png index.pdf

But LaTeX gives more control over pages, allows adding stuff, and in general is more flexible.

The tricky thing here is the colors dictionary, it needs to be measured from the images by hand. You can use kcolorpicker or anything else, or even try to extract it from the pages using imagemagick. It is actually quite necessary, because without good background tesseract fails to perform OCR.

declare -A colors
colors[scans]="A5A099"
colors[temp-bw-png]="9A9A9A"
colors[temp-bw-jpg]="9A9A9A"
dirs=(scans temp-bw-png temp-bw-jpg)
for t in ${dirs[@]} ; do
  FILENAME="index-$t"
  FULLPATH="$WORKINGDIR"/"${FILENAME}.tex"
  cat > "$FULLPATH" <<EOF
  \documentclass[a4paper]{article}
  \usepackage{hyperref} % roman numbering and TOC
  \usepackage[dvipsnames]{xcolor} % background
  % taken from the file with colorpicker
  % greatly improves OCR
  \definecolor{Mycolor2}{HTML}{${colors[$t]}}
  \usepackage[margin=0in]{geometry} % no margins
  \usepackage{graphicx} % for image inclusion
  \begin{document}
  \pagecolor{Mycolor2}
  \pagenumbering{Alph} % front cover is neither roman nor arabic page
  \begin{center}
  EOF
  i=0
  for p in $t/*
  do
    i=$((i+1))
    #  if (( i > 55 )) ; then break ; fi
    if (( i == 3 )) ; then
      printf '%s' "\\pagenumbering{roman}" >> $FULLPATH
    else
      :
    fi
    printf '%s' "\\includegraphics[height=\\paperheight,keepaspectratio]{$(readlink -f $p)}" >> $FULLPATH
    if (( i == 39 )) ; then
      printf '%s\n' '\pagenumbering{arabic} ' >> $FULLPATH
    fi
    if (( i != nfiles)) ; then
      printf '%s\n' '\newpage' >> $FULLPATH
    fi
  done
  cat >> "$FULLPATH" <<EOF

    \end{center}
  \end{document}
  EOF
done
for t in ${dirs[@]} ; do
  FILENAME="index-$t"
  (cd  "$WORKINGDIR"
   printf "Building pdf for %s\n" "$FILENAME"
   time lualatex "$FILENAME.tex" > /dev/null
  )
  printf "Non-OCRed file is  is %s\n" "$WORKINGDIR"/"$FILENAME.pdf"
  ls -lh "$WORKINGDIR"/"$FILENAME.pdf"
done

1.5. Add a text search layer (OCR) to a file using ocrmypdf.

for t in ${dirs[@]} ; do
  FILENAME="index-$t"
  (cd  "$WORKINGDIR"
   printf "OCRing the pdf %s\n" "$FILENAME"
   time ocrmypdf --force-ocr "$FILENAME.pdf" "$FILENAME.ocr.pdf"
  )
  printf "Your ready to use file is %s\n" "$WORKINGDIR"/"$FILENAME.ocr.pdf"
  ls -lh "$WORKINGDIR"/"$FILENAME.ocr.pdf"
done

Nothing really to comment here, because it is self-evident.

Note that ocrmypdf can, in principle, do more than just OCR, look at ocrmypdf --help. In particular, it can call unpaper for some post-processing, compress pages into a PDF file without you having to perform the parallel step, et cetera.

1.6. Add “roman page numbers” and add a Table of Contents.

You have to write a TOC manually, because there is no fully-automatic algorithm to do it.

However, not all is lost, if the OCR process succeeded, you can copy at least some of the data from the recognised table of contents in the PDF.

import fitz  # PyMuPDF


def add_toc_to_pdf(input_pdf, output_pdf, toc_entries):
    """
    Example of toc_entries:
    [
        [1, 'Chapter 1', 1],
        [2, 'Section 1.1', 2],
        [2, 'Section 1.2', 3],
        [1, 'Chapter 2', 4],
        [2, 'Section 2.1', 5]
    ]
    """
    pdf_document = fitz.open(input_pdf)
    pdf_document.set_toc(toc_entries)
#    pdf_document.set_page_labels([{'startpage': 1, 'style': 'r', 'firstpagenum': 1},
#                                  {'startpage': 39, 'style': 'D', 'firstpagenum': 1}])
    pdf_document.save(output_pdf)
    print(f"Table of contents added and saved to {output_pdf}")

toc_entries =     [
        [1, 'Chapter 1', 1],
        [2, 'Section 1.1', 2],
        [2, 'Section 1.2', 3],
        [1, 'Chapter 2', 4],
        [2, 'Section 2.1', 5]
    ]

add_toc_to_pdf( "bookmaker-temp/index-scans.ocr.pdf", "bookmaker-temp/index-scans.ocr.withtoc.pdf" , toc_entries)

Look at the commented part: pdf_document.set_page_labels. Style D is “digits”, and r is “roman numbers”.

In theory you can add roman numbering right in this script, and dispense with the use LaTeX.

1.7. (Optional) Try to convert a pdf book to epub.   optional

Extract the text.

pdfgrep '' index-scans.ocr.withtoc.pdf > booktext.txt

Tesseract would give you a decent, but “stupid” conversion, without recognising page structure, so you might want to try some more advanced tools or services. (I tried Mathpix, but up to you.)

From the text and the TOC you can try and make the epub, but it is far more work than everything above combined.

1.8. Enjoy yourself.

Upload your book where you want, say, on your website, and feel the inner glow of joy overwhelming you.