From ee36367a62a3d0635c266e167d3cf5e8ffe59ebf Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 20 Sep 2012 16:42:02 -0400 Subject: [PATCH] posts:pdf_forms: add PDF forms post (FDF and pdftk). --- posts/Bugs.mdwn | 5 + posts/PDF_forms.mdwn | 206 ++++++++++++++++++ ...for-Encoding-utf_8-to-the-FDF-reader.patch | 56 +++++ 3 files changed, 267 insertions(+) create mode 100644 posts/PDF_forms.mdwn create mode 100644 posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch diff --git a/posts/Bugs.mdwn b/posts/Bugs.mdwn index 3ea1260..1238789 100644 --- a/posts/Bugs.mdwn +++ b/posts/Bugs.mdwn @@ -193,6 +193,11 @@ GSL * [Cannot build without doc/version.texi](http://savannah.gnu.org/bugs/?31390). +iText +===== + +* [Add support for /Encoding/utf_8 to the FDF + reader](https://sourceforge.net/p/itext/patches/101/). libiphone ========= diff --git a/posts/PDF_forms.mdwn b/posts/PDF_forms.mdwn new file mode 100644 index 0000000..5d748c6 --- /dev/null +++ b/posts/PDF_forms.mdwn @@ -0,0 +1,206 @@ +You can use [[pdftk]] to fill out [[PDF]] forms (thanks for the +inspiration, [Joe Rothweiler][JR]). The syntax is simple: + + $ pdftk input.pdf fill_form data.fdf output output.pdf + +where `input.pdf` is the input PDF containing the form, `data.fdf` is +an [FDF][] or [XFDF][] file containing your data, and `output.pdf` is +the name of the PDF you're creating. The tricky part is figuring out +what to put in `data.fdf`. There's a useful comparison of the Forms +Data Format (FDF) and it's XML version (XFDF) in the [XFDF +specification][XFDF-specs]. XFDF only covers a subset of FDF, so I +won't worry about it here. FDF is defined in section 12.7.7 of [ISO +32000-1:2008][ISO32000], the PDF 1.7 specification, and it has been in +PDF specifications since version 1.2. + +Forms Data Format (FDF) +======================= + +FDF files are basically stripped down PDFs (§12.7.7.1). A simple FDF +file will look something like: + + %FDF-1.2 + 1 0 obj<> + <> + … + ] >> >> + endobj + trailer + <> + %%EOF + +Broken down into the lingo of ISO 32000, we have a header +(§12.7.7.2.2): + + %FDF-1.2 + +followed by a body with a single object (§12.7.7.2.3): + + 1 0 obj<> + <> + … + ] >> >> + endobj + +followed by a trailer (§12.7.7.2.4): + + trailer + <> + %%EOF + +Despite the claims in §12.7.7.2.1 that the trailer is optional, pdftk +choked on files without it: + + $ cat no-trailer.fdf + %FDF-1.2 + 1 0 obj<> + <> + ] >> >> + endobj + $ pdftk input.pdf fill_form no-trailer.fdf output output.pdf + Error: Failed to open form data file: + data.fdf + No output created. + +Trailers are easy to add, since all they reqire is a reference to the +root of the FDF catalog dictionary. If you only have one dictionary, +you can always use the simple trailer I gave above. + +FDF Catalog +----------- + +The meat of the FDF file is the catalog (§12.7.7.3). Lets take a +closer look at the catalog structure: + + 1 0 obj<> >> + +This defines a new object (the FDF catalog) which contains one key +(the `/FDF` dictionary). The FDF dictionary contains one key +(`/Fields`) and its associated array of fields. Then we close the +`/Fields` array (`]`), close the FDF dictionary (`>>`) and close the +FDF catalog (`>>`). + +There are a number of interesting entries that you can add to the FDF +dictionary (§12.7.7.3.1, table 243), some of which require a more +advanced FDF version. You can use the `/Version` key to the FDF +catalog (§12.7.7.3.1, table 242) to specify the of data in the +dictionary: + + 1 0 obj<> + <> + … + ] >> >> + endobj + +pdftk understands raw text in the specified encoding (`(…)`), raw +UTF-16 strings starting with a [BOM][] (`(\xFE\xFF…)`), or UTF-16BE +strings encoded as ASCII hex (``). You can use +[[pdf-merge.py|PDF_bookmarks_with_Ghostscript/pdf-merge.py]] and its +`--unicode` option to find the latter. Support for the `/utf_8` +encoding in pdftk is new. I mailed a +[[patch|0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch]] +to pdftk's Sid Steward and posted a [patch request][utf-8-patch] to +the underlying iText library. Until those get accepted, you're stuck +with the less convenient encodings. + +Fonts +----- + +Say you fill in some Unicode values, but your PDF reader is having +trouble rendering some funky glyphs. Maybe it doesn't have access to +the right font? You can see which fonts are embedded in a given PDF +using [pdffonts][]. + + $ pdffonts input.pdf + name type emb sub uni object ID + ------------------------------------ ----------------- --- --- --- --------- + MMXQDQ+UniversalStd-NewswithCommPi CID Type 0C yes yes yes 1738 0 + MMXQDQ+ZapfDingbatsStd CID Type 0C yes yes yes 1749 0 + MMXQDQ+HelveticaNeueLTStd-Roman Type 1C yes yes no 1737 0 + CPZITK+HelveticaNeueLTStd-BlkCn Type 1C yes yes no 1739 0 + … + +If you don't have the right font for your new data, you should +complain to whoever generated the PDF that you're trying to fill out, +because I can't figure out how to attach a new font to an +already-generated PDF for use with your new data. + +FDF templates and field names +----------------------------- + +You can use pdftk itself to create an FDF template, which it does with +embedded UTF-16BE (you can see the FE FF BOMS at the start of each +string value). + + $ pdftk input.pdf generate_fdf output template.fdf + $ hexdump -C template.fdf | head + 00000000 25 46 44 46 2d 31 2e 32 0a 25 e2 e3 cf d3 0a 31 |%FDF-1.2.%.....1| + 00000010 20 30 20 6f 62 6a 20 0a 3c 3c 0a 2f 46 44 46 20 | 0 obj .<<./FDF | + 00000020 0a 3c 3c 0a 2f 46 69 65 6c 64 73 20 5b 0a 3c 3c |.<<./Fields [.<<| + 00000030 0a 2f 56 20 28 fe ff 29 0a 2f 54 20 28 fe ff 00 |./V (..)./T (...| + 00000040 50 00 6f 00 73 00 74 00 65 00 72 00 4f 00 72 00 |P.o.s.t.e.r.O.r.| + … + +You can also dump a more human friendly version of the PDF's fields +(without any default data): + + $ pdftk input.pdf dump_data_fields_utf8 output data.txt + $ cat data.txt + --- + FieldType: Text + FieldName: Name + FieldNameAlt: Name: + FieldFlags: 0 + FieldJustification: Left + --- + FieldType: Text + FieldName: Date + FieldNameAlt: Date: + FieldFlags: 0 + FieldJustification: Left + --- + FieldType: Text + FieldName: Advisor + FieldNameAlt: Advisor: + FieldFlags: 0 + FieldJustification: Left + --- + … + +If the fields are poorly named, you may have to fill the entire form +with unique values and then see which values appeared where in the +output PDF (for and example, see codehero's +[identify_pdf_fields.js][]). + +Conclusions +=========== + +This would be so much easier if people just used [YAML][] or [JSON][] +instead of bothering with PDFs ;). + + +[JR]: http://www.myown1.com/linux/pdf_formfill.shtml +[FDF]: http://en.wikipedia.org/wiki/Forms_Data_Format#Forms_Data_Format_.28FDF.29 +[XFDF]: http://en.wikipedia.org/wiki/Forms_Data_Format#XML_Forms_Data_Format_.28XFDF.29 +[XFDF-spec]: http://partners.adobe.com/public/developer/en/xml/xfdf_2.0.pdf +[ISO32000]: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf +[UTF-8]: http://en.wikipedia.org/wiki/UTF-8 +[BOM]: http://en.wikipedia.org/wiki/Byte_order_mark +[utf-8-patch]: https://sourceforge.net/p/itext/patches/101/ +[pdffonts]: http://poppler.freedesktop.org/ +[identify_pdf_fields.js]: https://github.com/codehero/OpenTaxFormFiller/blob/master/script/identify_pdf_fields.js +[YAML]: http://www.yaml.org/ +[JSON]: http://www.json.org/ + +[[!tag tags/tools]] diff --git a/posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch b/posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch new file mode 100644 index 0000000..09770f2 --- /dev/null +++ b/posts/PDF_forms/0001-Add-support-for-Encoding-utf_8-to-the-FDF-reader.patch @@ -0,0 +1,56 @@ +From fe83cb12c2c275ccd922adc63d54b5b6c0604a2d Mon Sep 17 00:00:00 2001 +Message-Id: +From: "W. Trevor King" +Date: Thu, 20 Sep 2012 16:01:19 -0400 +Subject: [PATCH] Add support for /Encoding/utf_8 to the FDF reader. + +From PDF 32000-1:2008, section 12.7.7.3.1, table 243 (Entries in the +FDF dictionary), on page 459: + + Key: Encoding + Type: name + Value: + + (Optional; PDF 1.3) The encoding that shall be used for any FDF + field value or option (V or Opt in the field dictionary; see Table + 246) or field name that is a string and does not begin with the + Unicode prefix U+FEFF. + + Default value: PDFDocEncoding. + + Other allowed values include Shift_JIS, BigFive, GBK, UHC, utf_8, + utf_16 +--- + java/com/lowagie/text/pdf/FdfReader.java | 2 ++ + java/com/lowagie/text/pdf/PdfName.java | 2 ++ + 2 files changed, 4 insertions(+) + +diff --git a/java/com/lowagie/text/pdf/FdfReader.java b/java/com/lowagie/text/pdf/FdfReader.java +index f8776ab..94b432e 100644 +--- a/java/com/lowagie/text/pdf/FdfReader.java ++++ b/java/com/lowagie/text/pdf/FdfReader.java +@@ -188,6 +188,8 @@ public class FdfReader extends PdfReader { + return new String(b, "GBK"); + else if (encoding.equals(PdfName.BIGFIVE)) + return new String(b, "Big5"); ++ else if (encoding.equals(PdfName.UTF_8)) ++ return new String(b, "UTF8"); + } + catch (Exception e) { + } +diff --git a/java/com/lowagie/text/pdf/PdfName.java b/java/com/lowagie/text/pdf/PdfName.java +index bd4aaeb..50d3704 100644 +--- a/java/com/lowagie/text/pdf/PdfName.java ++++ b/java/com/lowagie/text/pdf/PdfName.java +@@ -903,6 +903,8 @@ public class PdfName extends PdfObject implements Comparable{ + /** A name */ + public static final PdfName USETHUMBS = new PdfName("UseThumbs"); + /** A name */ ++ public static final PdfName UTF_8 = new PdfName("utf_8"); ++ /** A name */ + public static final PdfName V = new PdfName("V"); + /** A name */ + public static final PdfName VERISIGN_PPKVS = new PdfName("VeriSign.PPKVS"); +-- +1.7.12.176.g3fc0e4c.dirty + -- 2.26.2