git.tremily.us Git - mw2txt.git/commit

author	W. Trevor King <wking@drexel.edu>
	Wed, 8 Feb 2012 01:51:09 +0000 (20:51 -0500)
committer	W. Trevor King <wking@drexel.edu>
	Wed, 8 Feb 2012 14:55:13 +0000 (09:55 -0500)
commit	8ff215a148d6a3b327a8fc821d8a7236f6855d80
tree	f0020d60c9741a363802589d4fb7438ec097b140	tree \| snapshot
parent	deddd0095e51511a2e4d417511d8862ed38b9f7c	commit \| diff

Handle Unicode strings in pdf-merge.py.

For information on Unicode strings in PDFs, see `§7.3.4 String
Objects` and `§7.9.2.2 Text String Type` in the PDF reference [1] and
`Table 2.3 (p21)`, `Table 2.5 (p25)`, etc. in the pdfmark reference
[2].

Note that there are Ghostscript bugs [3] that can lead to errors like:

  Entity: line 5: parser error : xmlParseCharRef: invalid xmlChar value 1
  <rdf:Description rdf:about='e7674657-8a09-11ec-0000-cfd67fe5d10' xmlns:pdf='

and:

  Entity: line 9: parser error : Input is not proper UTF-8, indicate encoding !
  Bytes: 0xAC 0x26 0x23 0x32
  dobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:3792d85d-8a0b-11ec-0000-cfd67fe5d10

when you open your generated PDF in `evince`.  These bugs were fixed
in Ghostscript 9.06.

The encoding of command-line arguments are not well standardized [4],
so I supply an `--argv-encoding` option to override the locale if
necessary.

[1] Document management — Portable document format — Part 1: PDF 1.7
  (PDF 32000-1:2008, July 2008
   http://www.adobe.com/devnet/pdf/pdf_reference.html)
[2] pdfmark Reference
  (Edition 1.0, November 2006
   http://www.adobe.com/devnet/acrobat/pdfs/pdfmark_reference.pdf)
[3] http://bugs.ghostscript.com/show_bug.cgi?id=692422
[4] http://stackoverflow.com/questions/5408730/what-is-the-encoding-of-argv