posts/PDF_forms.mdwn

   1 You can use [[pdftk]] to fill out [[PDF]] forms (thanks for the
   2 inspiration, [Joe Rothweiler][JR]).  The syntax is simple:
   3
   4     $ pdftk input.pdf fill_form data.fdf output output.pdf
   5
   6 where `input.pdf` is the input PDF containing the form, `data.fdf` is
   7 an [FDF][] or [XFDF][] file containing your data, and `output.pdf` is
   8 the name of the PDF you're creating.  The tricky part is figuring out
   9 what to put in `data.fdf`.  There's a useful comparison of the Forms
  10 Data Format (FDF) and it's XML version (XFDF) in the [XFDF
  11 specification][XFDF-specs].  XFDF only covers a subset of FDF, so I
  12 won't worry about it here.  FDF is defined in section 12.7.7 of [ISO
  13 32000-1:2008][ISO32000], the PDF 1.7 specification, and it has been in
  14 PDF specifications since version 1.2.
  15
  16 Forms Data Format (FDF)
  17 =======================
  18
  19 FDF files are basically stripped down PDFs (§12.7.7.1).  A simple FDF
  20 file will look something like:
  21
  22     %FDF-1.2
  23     1 0 obj<</FDF<</Fields[
  24     <</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
  25     <</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
  26     …
  27     ] >> >>
  28     endobj
  29     trailer
  30     <</Root 1 0 R>>
  31     %%EOF
  32
  33 Broken down into the lingo of ISO 32000, we have a header
  34 (§12.7.7.2.2):
  35
  36     %FDF-1.2
  37
  38 followed by a body with a single object (§12.7.7.2.3):
  39
  40     1 0 obj<</FDF<</Fields[
  41     <</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
  42     <</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
  43     …
  44     ] >> >>
  45     endobj
  46
  47 followed by a trailer (§12.7.7.2.4):
  48
  49     trailer
  50     <</Root 1 0 R>>
  51     %%EOF
  52
  53 Despite the claims in §12.7.7.2.1 that the trailer is optional, pdftk
  54 choked on files without it:
  55
  56     $ cat no-trailer.fdf
  57     %FDF-1.2
  58     1 0 obj<</FDF<</Fields[
  59     <</T(Name)/V(Trevor)>>
  60     <</T(Date)/V(2012-09-20)>>
  61     ] >> >>
  62     endobj
  63     $ pdftk input.pdf fill_form no-trailer.fdf output output.pdf
  64     Error: Failed to open form data file:
  65        data.fdf
  66        No output created.
  67
  68 Trailers are easy to add, since all they reqire is a reference to the
  69 root of the FDF catalog dictionary.  If you only have one dictionary,
  70 you can always use the simple trailer I gave above.
  71
  72 FDF Catalog
  73 -----------
  74
  75 The meat of the FDF file is the catalog (§12.7.7.3).  Lets take a
  76 closer look at the catalog structure:
  77
  78     1 0 obj<</FDF<</Fields[
  79     …
  80     ] >> >>
  81
  82 This defines a new object (the FDF catalog) which contains one key
  83 (the `/FDF` dictionary).  The FDF dictionary contains one key
  84 (`/Fields`) and its associated array of fields.  Then we close the
  85 `/Fields` array (`]`), close the FDF dictionary (`>>`) and close the
  86 FDF catalog (`>>`).
  87
  88 There are a number of interesting entries that you can add to the FDF
  89 dictionary (§12.7.7.3.1, table 243), some of which require a more
  90 advanced FDF version.  You can use the `/Version` key to the FDF
  91 catalog (§12.7.7.3.1, table 242) to specify the of data in the
  92 dictionary:
  93
  94     1 0 obj<</Version/1.3/FDF<</Fields[…
  95
  96 Now you can extend the dictionary using table 244.  Lets set things up
  97 to use [UTF-8][] for the field values (`/V`) or options (`/Opt`):
  98
  99     1 0 obj<</Version/1.3/FDF<</Encoding/utf_8/Fields[
 100     <</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
 101     <</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
 102     …
 103     ] >> >>
 104     endobj
 105
 106 pdftk understands raw text in the specified encoding (`(…)`), raw
 107 UTF-16 strings starting with a [BOM][] (`(\xFE\xFF…)`), or UTF-16BE
 108 strings encoded as ASCII hex (`<FEFF…>`).  You can use
 109 [[pdf-merge.py|PDF_bookmarks_with_Ghostscript/pdf-merge.py]] and its
 110 `--unicode` option to find the latter.  Support for the `/utf_8`
 111 encoding in pdftk is new.  I mailed a
 112 [[patch|0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch]]
 113 to pdftk's Sid Steward and posted a [patch request][utf-8-patch] to
 114 the underlying iText library.  Until those get accepted, you're stuck
 115 with the less convenient encodings.
 116
 117 Fonts
 118 -----
 119
 120 Say you fill in some Unicode values, but your PDF reader is having
 121 trouble rendering some funky glyphs.  Maybe it doesn't have access to
 122 the right font?  You can see which fonts are embedded in a given PDF
 123 using [pdffonts][].
 124
 125     $ pdffonts input.pdf
 126     name                                 type              emb sub uni object ID
 127     ------------------------------------ ----------------- --- --- --- ---------
 128     MMXQDQ+UniversalStd-NewswithCommPi   CID Type 0C       yes yes yes   1738  0
 129     MMXQDQ+ZapfDingbatsStd               CID Type 0C       yes yes yes   1749  0
 130     MMXQDQ+HelveticaNeueLTStd-Roman      Type 1C           yes yes no    1737  0
 131     CPZITK+HelveticaNeueLTStd-BlkCn      Type 1C           yes yes no    1739  0
 132     …
 133
 134 If you don't have the right font for your new data, you can add it
 135 [using current versions of iText][TextFieldFonts.java].  However,
 136 pdftk uses an older version, so I'm not sure how to translate this
 137 idea for pdftk.
 138
 139 FDF templates and field names
 140 -----------------------------
 141
 142 You can use pdftk itself to create an FDF template, which it does with
 143 embedded UTF-16BE (you can see the FE FF BOMS at the start of each
 144 string value).
 145
 146     $ pdftk input.pdf generate_fdf output template.fdf
 147     $ hexdump -C template.fdf  | head
 148     00000000  25 46 44 46 2d 31 2e 32  0a 25 e2 e3 cf d3 0a 31  |%FDF-1.2.%.....1|
 149     00000010  20 30 20 6f 62 6a 20 0a  3c 3c 0a 2f 46 44 46 20  | 0 obj .<<./FDF |
 150     00000020  0a 3c 3c 0a 2f 46 69 65  6c 64 73 20 5b 0a 3c 3c  |.<<./Fields [.<<|
 151     00000030  0a 2f 56 20 28 fe ff 29  0a 2f 54 20 28 fe ff 00  |./V (..)./T (...|
 152     00000040  50 00 6f 00 73 00 74 00  65 00 72 00 4f 00 72 00  |P.o.s.t.e.r.O.r.|
 153     …
 154
 155 You can also dump a more human friendly version of the PDF's fields
 156 (without any default data):
 157
 158     $ pdftk input.pdf dump_data_fields_utf8 output data.txt
 159     $ cat data.txt
 160     ---
 161     FieldType: Text
 162     FieldName: Name
 163     FieldNameAlt: Name:
 164     FieldFlags: 0
 165     FieldJustification: Left
 166     ---
 167     FieldType: Text
 168     FieldName: Date
 169     FieldNameAlt: Date:
 170     FieldFlags: 0
 171     FieldJustification: Left
 172     ---
 173     FieldType: Text
 174     FieldName: Advisor
 175     FieldNameAlt: Advisor:
 176     FieldFlags: 0
 177     FieldJustification: Left
 178     ---
 179     …
 180
 181 If the fields are poorly named, you may have to fill the entire form
 182 with unique values and then see which values appeared where in the
 183 output PDF (for and example, see codehero's
 184 [identify_pdf_fields.js][]).
 185
 186 Conclusions
 187 ===========
 188
 189 This would be so much easier if people just used [YAML][] or [JSON][]
 190 instead of bothering with PDFs ;).
 191
 192
 193 [JR]: http://www.myown1.com/linux/pdf_formfill.shtml
 194 [FDF]: http://en.wikipedia.org/wiki/Forms_Data_Format#Forms_Data_Format_.28FDF.29
 195 [XFDF]: http://en.wikipedia.org/wiki/Forms_Data_Format#XML_Forms_Data_Format_.28XFDF.29
 196 [XFDF-specs]: http://partners.adobe.com/public/developer/en/xml/xfdf_2.0.pdf
 197 [ISO32000]: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
 198 [UTF-8]: http://en.wikipedia.org/wiki/UTF-8
 199 [BOM]: http://en.wikipedia.org/wiki/Byte_order_mark
 200 [utf-8-patch]: https://sourceforge.net/p/itext/patches/101/
 201 [pdffonts]: http://poppler.freedesktop.org/
 202 [TextFieldFonts.java]: http://itextpdf.com/examples/iia.php?id=158
 203 [identify_pdf_fields.js]: https://github.com/codehero/OpenTaxFormFiller/blob/master/script/identify_pdf_fields.js
 204 [YAML]: http://www.yaml.org/
 205 [JSON]: http://www.json.org/
 206
 207 [[!tag tags/tools]]