posts:pdf_forms: add PDF forms post (FDF and pdftk).

author W. Trevor King <wking@tremily.us>

Thu, 20 Sep 2012 20:42:02 +0000 (16:42 -0400)

committer W. Trevor King <wking@tremily.us>

Thu, 20 Sep 2012 20:56:15 +0000 (16:56 -0400)
author W. Trevor King <wking@tremily.us>
Thu, 20 Sep 2012 20:42:02 +0000 (16:42 -0400)
committer W. Trevor King <wking@tremily.us>
Thu, 20 Sep 2012 20:56:15 +0000 (16:56 -0400)
diff --git a/posts/Bugs.mdwn b/posts/Bugs.mdwn

index 3ea1260e863730208938a8eff1a4b8bea46780b1..1238789d72eefc0ce1414448656ae048d7ff4d80 100644 (file)
--- a/posts/Bugs.mdwn
+++ b/posts/Bugs.mdwn
@@ -193,6 +193,11 @@ GSL
  
  * [Cannot build without doc/version.texi](http://savannah.gnu.org/bugs/?31390).
  
+iText
+=====
+
+* [Add support for /Encoding/utf_8 to the FDF
+  reader](https://sourceforge.net/p/itext/patches/101/).
  
  libiphone
  =========
diff --git a/posts/PDF_forms.mdwn b/posts/PDF_forms.mdwn

new file mode 100644 (file)

index 0000000..5d748c6
--- /dev/null
+++ b/posts/PDF_forms.mdwn
@@ -0,0 +1,206 @@
+You can use [[pdftk]] to fill out [[PDF]] forms (thanks for the
+inspiration, [Joe Rothweiler][JR]).  The syntax is simple:
+
+    $ pdftk input.pdf fill_form data.fdf output output.pdf
+
+where `input.pdf` is the input PDF containing the form, `data.fdf` is
+an [FDF][] or [XFDF][] file containing your data, and `output.pdf` is
+the name of the PDF you're creating.  The tricky part is figuring out
+what to put in `data.fdf`.  There's a useful comparison of the Forms
+Data Format (FDF) and it's XML version (XFDF) in the [XFDF
+specification][XFDF-specs].  XFDF only covers a subset of FDF, so I
+won't worry about it here.  FDF is defined in section 12.7.7 of [ISO
+32000-1:2008][ISO32000], the PDF 1.7 specification, and it has been in
+PDF specifications since version 1.2.
+
+Forms Data Format (FDF)
+=======================
+
+FDF files are basically stripped down PDFs (§12.7.7.1).  A simple FDF
+file will look something like:
+
+    %FDF-1.2
+    1 0 obj<</FDF<</Fields[
+    <</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
+    <</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
+    …
+    ] >> >>
+    endobj
+    trailer
+    <</Root 1 0 R>>
+    %%EOF
+
+Broken down into the lingo of ISO 32000, we have a header
+(§12.7.7.2.2):
+
+    %FDF-1.2
+
+followed by a body with a single object (§12.7.7.2.3):
+
+    1 0 obj<</FDF<</Fields[
+    <</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
+    <</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
+    …
+    ] >> >>
+    endobj
+
+followed by a trailer (§12.7.7.2.4):
+
+    trailer
+    <</Root 1 0 R>>
+    %%EOF
+
+Despite the claims in §12.7.7.2.1 that the trailer is optional, pdftk
+choked on files without it:
+
+    $ cat no-trailer.fdf
+    %FDF-1.2
+    1 0 obj<</FDF<</Fields[
+    <</T(Name)/V(Trevor)>>
+    <</T(Date)/V(2012-09-20)>>
+    ] >> >>
+    endobj
+    $ pdftk input.pdf fill_form no-trailer.fdf output output.pdf
+    Error: Failed to open form data file: 
+       data.fdf
+       No output created.
+
+Trailers are easy to add, since all they reqire is a reference to the
+root of the FDF catalog dictionary.  If you only have one dictionary,
+you can always use the simple trailer I gave above.
+
+FDF Catalog
+-----------
+
+The meat of the FDF file is the catalog (§12.7.7.3).  Lets take a
+closer look at the catalog structure:
+
+    1 0 obj<</FDF<</Fields[
+    …
+    ] >> >>
+
+This defines a new object (the FDF catalog) which contains one key
+(the `/FDF` dictionary).  The FDF dictionary contains one key
+(`/Fields`) and its associated array of fields.  Then we close the
+`/Fields` array (`]`), close the FDF dictionary (`>>`) and close the
+FDF catalog (`>>`).
+
+There are a number of interesting entries that you can add to the FDF
+dictionary (§12.7.7.3.1, table 243), some of which require a more
+advanced FDF version.  You can use the `/Version` key to the FDF
+catalog (§12.7.7.3.1, table 242) to specify the of data in the
+dictionary:
+
+    1 0 obj<</Version/1.3/FDF<</Fields[…
+
+Now you can extend the dictionary using table 244.  Lets set things up
+to use [UTF-8][] for the field values (`/V`) or options (`/Opt`):
+
+    1 0 obj<</Version/1.3/FDF<</Encoding/utf_8/Fields[
+    <</T(FIELD1_NAME)/V(FIELD1_VALUE)>>
+    <</T(FIELD2_NAME)/V(FIELD2_VALUE)>>
+    …
+    ] >> >>
+    endobj
+
+pdftk understands raw text in the specified encoding (`(…)`), raw
+UTF-16 strings starting with a [BOM][] (`(\xFE\xFF…)`), or UTF-16BE
+strings encoded as ASCII hex (`<FEFF…>`).  You can use
+[[pdf-merge.py|PDF_bookmarks_with_Ghostscript/pdf-merge.py]] and its
+`--unicode` option to find the latter.  Support for the `/utf_8`
+encoding in pdftk is new.  I mailed a
+[[patch|0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch]]
+to pdftk's Sid Steward and posted a [patch request][utf-8-patch] to
+the underlying iText library.  Until those get accepted, you're stuck
+with the less convenient encodings.
+
+Fonts
+-----
+
+Say you fill in some Unicode values, but your PDF reader is having
+trouble rendering some funky glyphs.  Maybe it doesn't have access to
+the right font?  You can see which fonts are embedded in a given PDF
+using [pdffonts][].
+
+    $ pdffonts input.pdf
+    name                                 type              emb sub uni object ID
+    ------------------------------------ ----------------- --- --- --- ---------
+    MMXQDQ+UniversalStd-NewswithCommPi   CID Type 0C       yes yes yes   1738  0
+    MMXQDQ+ZapfDingbatsStd               CID Type 0C       yes yes yes   1749  0
+    MMXQDQ+HelveticaNeueLTStd-Roman      Type 1C           yes yes no    1737  0
+    CPZITK+HelveticaNeueLTStd-BlkCn      Type 1C           yes yes no    1739  0
+    …
+
+If you don't have the right font for your new data, you should
+complain to whoever generated the PDF that you're trying to fill out,
+because I can't figure out how to attach a new font to an
+already-generated PDF for use with your new data.
+
+FDF templates and field names
+-----------------------------
+
+You can use pdftk itself to create an FDF template, which it does with
+embedded UTF-16BE (you can see the FE FF BOMS at the start of each
+string value).
+
+    $ pdftk input.pdf generate_fdf output template.fdf
+    $ hexdump -C template.fdf  | head
+    00000000  25 46 44 46 2d 31 2e 32  0a 25 e2 e3 cf d3 0a 31  |%FDF-1.2.%.....1|
+    00000010  20 30 20 6f 62 6a 20 0a  3c 3c 0a 2f 46 44 46 20  | 0 obj .<<./FDF |
+    00000020  0a 3c 3c 0a 2f 46 69 65  6c 64 73 20 5b 0a 3c 3c  |.<<./Fields [.<<|
+    00000030  0a 2f 56 20 28 fe ff 29  0a 2f 54 20 28 fe ff 00  |./V (..)./T (...|
+    00000040  50 00 6f 00 73 00 74 00  65 00 72 00 4f 00 72 00  |P.o.s.t.e.r.O.r.|
+    …
+
+You can also dump a more human friendly version of the PDF's fields
+(without any default data):
+
+    $ pdftk input.pdf dump_data_fields_utf8 output data.txt
+    $ cat data.txt
+    ---
+    FieldType: Text
+    FieldName: Name
+    FieldNameAlt: Name:
+    FieldFlags: 0
+    FieldJustification: Left
+    ---
+    FieldType: Text
+    FieldName: Date
+    FieldNameAlt: Date:
+    FieldFlags: 0
+    FieldJustification: Left
+    ---
+    FieldType: Text
+    FieldName: Advisor
+    FieldNameAlt: Advisor:
+    FieldFlags: 0
+    FieldJustification: Left
+    ---
+    …
+
+If the fields are poorly named, you may have to fill the entire form
+with unique values and then see which values appeared where in the
+output PDF (for and example, see codehero's
+[identify_pdf_fields.js][]).
+
+Conclusions
+===========
+
+This would be so much easier if people just used [YAML][] or [JSON][]
+instead of bothering with PDFs ;).
+
+
+[JR]: http://www.myown1.com/linux/pdf_formfill.shtml
+[FDF]: http://en.wikipedia.org/wiki/Forms_Data_Format#Forms_Data_Format_.28FDF.29
+[XFDF]: http://en.wikipedia.org/wiki/Forms_Data_Format#XML_Forms_Data_Format_.28XFDF.29
+[XFDF-spec]: http://partners.adobe.com/public/developer/en/xml/xfdf_2.0.pdf
+[ISO32000]: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
+[UTF-8]: http://en.wikipedia.org/wiki/UTF-8
+[BOM]: http://en.wikipedia.org/wiki/Byte_order_mark
+[utf-8-patch]: https://sourceforge.net/p/itext/patches/101/
+[pdffonts]: http://poppler.freedesktop.org/
+[identify_pdf_fields.js]: https://github.com/codehero/OpenTaxFormFiller/blob/master/script/identify_pdf_fields.js
+[YAML]: http://www.yaml.org/
+[JSON]: http://www.json.org/
+
+[[!tag tags/tools]]
diff --git a/posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch b/posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch

new file mode 100644 (file)

index 0000000..09770f2
--- /dev/null
+++ b/posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch
@@ -0,0 +1,56 @@
+From fe83cb12c2c275ccd922adc63d54b5b6c0604a2d Mon Sep 17 00:00:00 2001
+Message-Id: <fe83cb12c2c275ccd922adc63d54b5b6c0604a2d.1348172464.git.wking@tremily.us>
+From: "W. Trevor King" <wking@tremily.us>
+Date: Thu, 20 Sep 2012 16:01:19 -0400
+Subject: [PATCH] Add support for /Encoding/utf_8 to the FDF reader.
+
+From PDF 32000-1:2008, section 12.7.7.3.1, table 243 (Entries in the
+FDF dictionary), on page 459:
+
+  Key: Encoding
+  Type: name
+  Value:
+
+  (Optional; PDF 1.3) The encoding that shall be used for any FDF
+  field value or option (V or Opt in the field dictionary; see Table
+  246) or field name that is a string and does not begin with the
+  Unicode prefix U+FEFF.
+
+  Default value: PDFDocEncoding.
+
+  Other allowed values include Shift_JIS, BigFive, GBK, UHC, utf_8,
+  utf_16
+---
+ java/com/lowagie/text/pdf/FdfReader.java | 2 ++
+ java/com/lowagie/text/pdf/PdfName.java   | 2 ++
+ 2 files changed, 4 insertions(+)
+
+diff --git a/java/com/lowagie/text/pdf/FdfReader.java b/java/com/lowagie/text/pdf/FdfReader.java
+index f8776ab..94b432e 100644
+--- a/java/com/lowagie/text/pdf/FdfReader.java
++++ b/java/com/lowagie/text/pdf/FdfReader.java
+@@ -188,6 +188,8 @@ public class FdfReader extends PdfReader {
+                     return new String(b, "GBK");
+                 else if (encoding.equals(PdfName.BIGFIVE))
+                     return new String(b, "Big5");
++                else if (encoding.equals(PdfName.UTF_8))
++                    return new String(b, "UTF8");
+             }
+             catch (Exception e) {
+             }
+diff --git a/java/com/lowagie/text/pdf/PdfName.java b/java/com/lowagie/text/pdf/PdfName.java
+index bd4aaeb..50d3704 100644
+--- a/java/com/lowagie/text/pdf/PdfName.java
++++ b/java/com/lowagie/text/pdf/PdfName.java
+@@ -903,6 +903,8 @@ public class PdfName extends PdfObject implements Comparable{
+     /** A name */
+     public static final PdfName USETHUMBS = new PdfName("UseThumbs");
+     /** A name */
++    public static final PdfName UTF_8 = new PdfName("utf_8");
++    /** A name */
+     public static final PdfName V = new PdfName("V");
+     /** A name */
+     public static final PdfName VERISIGN_PPKVS = new PdfName("VeriSign.PPKVS");
+-- 
+1.7.12.176.g3fc0e4c.dirty
+
author	W. Trevor King <wking@tremily.us>
	Thu, 20 Sep 2012 20:42:02 +0000 (16:42 -0400)
committer	W. Trevor King <wking@tremily.us>
	Thu, 20 Sep 2012 20:56:15 +0000 (16:56 -0400)
posts/Bugs.mdwn		patch \| blob \| history
posts/PDF_forms.mdwn	[new file with mode: 0644]	patch \| blob
posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch	[new file with mode: 0644]	patch \| blob