Ian suggested I share the highly involved build process for my doctoral dissertation, which I submitted for examination earlier this year. Beyond just compiling a PDF from Markdown and LaTeX sources, there are just two, simple-seeming goals: produce a PDF that passes PDF/A validation, for long term archival, and replace the second page with a scanned copy of the page after it was signed by the examiners. Achieving these two things reproducibly turned out to require a lot of complexity.

First we build dissertation1.tex out of a number of LaTeX and Markdown files, and a Pandoc metadata.yml, using Pandoc in a Debian sid chroot. I had to do the latter because I needed a more recent Pandoc than was available in Debian stable at the time, and didn’t dare upgrade anything else. Indeed, after switching to the newer Pandoc, I carefully diff’d dissertation1.tex to ensure nothing other than what I needed had changed.

dissertation1.tex: preamble.tex \
 citeproc-preamble.tex \
 committee.tex \
 acknowledgements.tex \
 dedication.tex \
 contents.tex \
 abbreviations.tex \
 abstract.tex \
 metadata.yaml \
 template.latex \
 philos.csl \
 philos.bib \
 ch1.md ch1_appA.md ch2.md ch3.md ch3_appB.md ch4.md ch5.md
    schroot -c melete-sid -- pandoc -s -N -C -H preamble.tex \
        --template=template.latex -B committee.tex \
        -B acknowledgements.tex -B dedication.tex \
        -B contents.tex -B abbreviations.tex -B abstract.tex \
        ch1.md ch1_appA.md ch2.md ch3.md ch3_appB.md ch4.md ch5.md \
        citeproc-preamble.tex metadata.yaml -o $@

With hindsight, I think that I should have eschewed Pandoc in favour of plain LaTeX for a project as large as this was. Pandoc is good for journal submissions, where one is responsible for the content but not really the presentation. However, one typesets one’s own dissertation, without anyone else’s help. I decided to commit dissertation1.tex to git, because Pandoc’s LaTeX generation is not too stable.

We then compile a first PDF. My Makefile comments say that pdfx.sty requires this particular xelatex invocation. pdfx.sty is supposed to make the PDF satisfy the PDF/A-2B long term archival standard … but dissertation1.pdf doesn’t actually pass PDF/A validation. We instead rely on GhostScript to produce a valid PDF/A-2B, at the final step. But we have to include pdfx.sty at this stage to ensure that the hyperlinks in the PDF are PDF/A-compatible – without pdfx.sty, GhostScript rejects hyperref’s links.

dissertation1.pdf: \
 dissertation1.tex dissertation1.xmpdata committee_watermark.png
    xelatex -shell-escape -output-driver="xdvipdfmx -z 0" $<
    xelatex -shell-escape -output-driver="xdvipdfmx -z 0" $<
    xelatex -shell-escape -output-driver="xdvipdfmx -z 0" $<

As I said, the second page of the PDF needs to be replaced with a scanned version of the page after it was signed by the examiners. The usual tool to stitch PDFs together is pdftk. But pdftk loses the PDF’s metadata. For the true, static metadata like the title, author and keywords, it would be no problem to add them back. But the metadata that’s lost includes the PDF’s table of contents, which PDF readers display in a sidebar, with clickable links to chapters, and the sections within those. This information is not static because each time any of the source Markdown and LaTeX files change, there is the potential for the table of contents to change. So we have to extract all the metadata from dissertation1.pdf and save it to one side, before we stitch in the scanned page. We also have to hack the metadata to ensure that the second page will have the correct orientation.

SED = /^PageMediaNumber: 2$$/ { n; s/0/90/; n; s/612 792/792 612/ }
KEYWORDS = virtue ethics, virtue, happiness, eudaimonism, good lives, final ends
dissertation1_meta.txt: dissertation1.pdf
    printf "InfoBegin\nInfoKey: Keywords\nInfoValue: %s\n%s\n" \
        "${KEYWORDS}" "$$(pdftk $^ dump_data)" \
        | sed "${SED}" >$@

Now we can stitch in the signed page, and then put the metadata back. You can’t do this in one invocation of pdftk, so far as I could see.

dissertation1_stitched_updated.pdf: \
 dissertation1_stitched.pdf dissertation1_meta.txt
    pdftk dissertation1_stitched.pdf \
        update_info dissertation1_meta.txt output $@

dissertation1_stitched.pdf: dissertation1.pdf
    pdftk A=$^ \
        B=$$HOME/annex/philos/Dissertation/committee_signed.pdf \
        cat A1 B1 A3-end output $@

Finally, we use GhostScript to reprocess the PDF into two valid PDF/A-2Bs, one optimised for the web. This requires supplying a colour profile, a PDFA_def.ps postscript file, a whole sequence of GhostScript options, and some raw postscript on the command line, which gives the PDF reader some display hints.

    -sColorConversionStrategy=UseDeviceIndependentColor \
    -dEmbedAllFonts=true -dPrinted=false -dPDFA=2 \
    -dPDFACompatibilityPolicy=1 -dDetectDuplicateImages \
    -dPDFSETTINGS=/printer -sOutputFile=$@
GS_OPTS2 = PDFA_def.ps dissertation1_stitched_updated.pdf \
    -c "[ /PageMode /UseOutlines \
        /Page 1 /View [/XYZ null null 1] \
        /PageLayout /SinglePage /DOCVIEW pdfmark"

all: Whitton_dissert_web.pdf Whitton_dissert_gradcol.pdf
Whitton_dissert_gradcol.pdf: \
 PDFA_def.ps dissertation1_stitched_updated.pdf srgb.icc
    gs ${GS_OPTS1} ${GS_OPTS2}
Whitton_dissert_web.pdf: \
 PDFA_def.ps dissertation1_stitched_updated.pdf srgb.icc
    gs ${GS_OPTS1} -dFastWebView=true ${GS_OPTS2}

And here’s PDFA_def.ps, based on a sample in the GhostScript docs:

% Define an ICC profile :
/ICCProfile (srgb.icc)

[/_objdef {icc_PDFA} /type /stream /OBJ pdfmark

/N 3
>> /PUT pdfmark
[{icc_PDFA} ICCProfile (r) file /PUT pdfmark

% Define the output intent dictionary :
[/_objdef {OutputIntent_PDFA} /type /dict /OBJ pdfmark
[{OutputIntent_PDFA} <<
  /Type /OutputIntent               % Must be so (the standard requires).
  /S /GTS_PDFA1                     % Must be so (the standard requires).
  /DestOutputProfile {icc_PDFA}     % Must be so (see above).
  /OutputConditionIdentifier (sRGB)
>> /PUT pdfmark
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ]>> /PUT pdfmark


Some comments on Hacker News

Posted Tue 05 Dec 2023 14:46:36 UTC