7c/77303c51600f536be5b341809b4adf06b0e1cb

   1 Return-Path: <rlb@defaultvalue.org>\r
   2 X-Original-To: notmuch@notmuchmail.org\r
   3 Delivered-To: notmuch@notmuchmail.org\r
   4 Received: from localhost (localhost [127.0.0.1])\r
   5  by arlo.cworth.org (Postfix) with ESMTP id 66BC46DE1512\r
   6  for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:31 -0700 (PDT)\r
   7 X-Virus-Scanned: Debian amavisd-new at cworth.org\r
   8 X-Spam-Flag: NO\r
   9 X-Spam-Score: 0.134\r
  10 X-Spam-Level: \r
  11 X-Spam-Status: No, score=0.134 tagged_above=-999 required=5 tests=[AWL=0.684, \r
  12  RP_MATCHES_RCVD=-0.55] autolearn=disabled\r
  13 Received: from arlo.cworth.org ([127.0.0.1])\r
  14  by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)\r
  15  with ESMTP id fmevZxWHrujW for <notmuch@notmuchmail.org>;\r
  16  Sun, 30 Aug 2015 09:26:28 -0700 (PDT)\r
  17 X-Greylist: delayed 309 seconds by postgrey-1.35 at arlo;\r
  18  Sun, 30 Aug 2015 09:26:28 PDT\r
  19 Received: from defaultvalue.org (defaultvalue.org [70.85.129.156])\r
  20  by arlo.cworth.org (Postfix) with ESMTP id 0C8116DE14FD\r
  21  for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:28 -0700 (PDT)\r
  22 Received: from trouble.defaultvalue.org (localhost [127.0.0.1])\r
  23  (Authenticated sender: rlb@defaultvalue.org)\r
  24  by defaultvalue.org (Postfix) with ESMTPSA id 93EC820235\r
  25  for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 11:21:17 -0500 (CDT)\r
  26 Received: by trouble.defaultvalue.org (Postfix, from userid 1000)\r
  27  id 1B1DE14E0F9; Sun, 30 Aug 2015 11:21:16 -0500 (CDT)\r
  28 From: Rob Browning <rlb@defaultvalue.org>\r
  29 To: notmuch@notmuchmail.org\r
  30 Subject: [PATCH 1/1] Store and search for canonical Unicode text [WIP]\r
  31 Date: Sun, 30 Aug 2015 11:21:16 -0500\r
  32 Message-Id: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>\r
  33 X-Mailer: git-send-email 2.5.0\r
  34 MIME-Version: 1.0\r
  35 Content-Type: text/plain; charset=UTF-8\r
  36 Content-Transfer-Encoding: 8bit\r
  37 X-BeenThere: notmuch@notmuchmail.org\r
  38 X-Mailman-Version: 2.1.18\r
  39 Precedence: list\r
  40 List-Id: "Use and development of the notmuch mail system."\r
  41  <notmuch.notmuchmail.org>\r
  42 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
  43  <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
  44 List-Archive: <http://notmuchmail.org/pipermail/notmuch/>\r
  45 List-Post: <mailto:notmuch@notmuchmail.org>\r
  46 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
  47 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
  48  <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
  49 X-List-Received-Date: Sun, 30 Aug 2015 16:26:31 -0000\r
  50 \r
  51 WARNING: this version is very preliminary, and might eat your data.\r
  52 \r
  53 Unicode has multiple sequences representing what should normally be\r
  54 considered the same text.  For example here's a combining Á and a\r
  55 noncombining Á.\r
  56 \r
  57 Depending on the way you view this, you may or may not see a\r
  58 difference, but the former is the canonical form, and is represented\r
  59 by two Unicode code points: a capital A (U+0041) followed by a\r
  60 "combining acute accent" (U+0301); the latter is the single code\r
  61 point (U+00C1), which is probably what most people would type.\r
  62 \r
  63 Before this change, notmuch would index two strings that differ only\r
  64 with respect to canonicalization, like tóken and tóken, as separate\r
  65 terms, even though they may be visually indistinguishable, and do (for\r
  66 most purposes) represent the same text.  After indexing, searching for\r
  67 one would not find the other, and which one you present to notmuch\r
  68 when you search depends on your tools.  See test/T570-normalization.sh\r
  69 for a working example.\r
  70 \r
  71 Since we're talking about differing representations that one wouldn't\r
  72 normally want to distinguish, this patch unifies the various\r
  73 representations by converting all incoming text to its canonical form\r
  74 before indexing, and canonicalizing all query strings.\r
  75 \r
  76 Up to now, notmuch has let Xapian handle converting the incoming bytes\r
  77 to UTF-8.  Xapian treats any byte sequence as UTF-8, and interprets\r
  78 any invalid UTF-8 bytes as Latin-1.  This patch maintains the existing\r
  79 behavior (excepting the new canonicalization) by using Xapian's\r
  80 Utf8Iterator to handle the initial Unicode character parsing.\r
  81 \r
  82 Note that the parsing approach in this patch is not particularly\r
  83 efficient, both because it traverses the incoming bytes three times:\r
  84 \r
  85    - once to determine how long the input is (currently the iterator\r
  86      can't directly handle null terminated char*'s),\r
  87 \r
  88    - once to determine how long the final UTF-8 allocation needs to\r
  89      be,\r
  90 \r
  91    - and once for the conversion.\r
  92 \r
  93 And because when the input is already UTF-8, it just blindly converts\r
  94 from UTF-8 to Unicode code points, and then back to UTF-8 (after\r
  95 canonicalization), during each pass.  There are certainly\r
  96 opportunities to optimize, though it may be worth discussing the\r
  97 detection of data encodings more broadly first.\r
  98 \r
  99 FIXME: document current encoding behavior clearly in\r
 100 new/insert/search-terms.\r
 101 \r
 102 FIXME: what about existing indexed text?\r
 103 ---\r
 104 \r
 105  Posted for preliminary discussion, and as a milestone (it appears to\r
 106  mostly work now).  Though I doubt I'm handling things correctly\r
 107  everywhere notmuch-wise, wrt talloc, etc.\r
 108 \r
 109  lib/Makefile.local         |  1 +\r
 110  lib/database.cc            | 17 ++++++++--\r
 111  lib/message.cc             | 51 +++++++++++++++++++---------\r
 112  lib/notmuch.h              |  3 ++\r
 113  lib/query.cc               |  6 ++--\r
 114  lib/text-util.cc           | 82 ++++++++++++++++++++++++++++++++++++++++++++++\r
 115  test/Makefile.local        | 10 ++++--\r
 116  test/T150-tagging.sh       | 54 +++++++++++++++++++++++-------\r
 117  test/T240-dump-restore.sh  |  4 +--\r
 118  test/T480-hex-escaping.sh  |  4 +--\r
 119  test/T570-normalization.sh | 28 ++++++++++++++++\r
 120  test/corpus/cur/52:2,      |  6 ++--\r
 121  test/to-utf8.c             | 44 +++++++++++++++++++++++++\r
 122  13 files changed, 267 insertions(+), 43 deletions(-)\r
 123  create mode 100644 lib/text-util.cc\r
 124  create mode 100755 test/T570-normalization.sh\r
 125  create mode 100644 test/to-utf8.c\r
 126 \r
 127 diff --git a/lib/Makefile.local b/lib/Makefile.local\r
 128 index 3a07090..41fd1e1 100644\r
 129 --- a/lib/Makefile.local\r
 130 +++ b/lib/Makefile.local\r
 131 @@ -48,6 +48,7 @@ libnotmuch_cxx_srcs =         \\r
 132         $(dir)/index.cc         \\r
 133         $(dir)/message.cc       \\r
 134         $(dir)/query.cc         \\r
 135 +       $(dir)/text-util.cc     \\r
 136         $(dir)/thread.cc\r
 137  \r
 138  libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)\r
 139 diff --git a/lib/database.cc b/lib/database.cc\r
 140 index 6a15174..7a01f95 100644\r
 141 --- a/lib/database.cc\r
 142 +++ b/lib/database.cc\r
 143 @@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, unsigned doc_id)\r
 144  char *\r
 145  _notmuch_message_id_compressed (void *ctx, const char *message_id)\r
 146  {\r
 147 +    // Assumes message_id is normalized utf-8.\r
 148      char *sha1, *compressed;\r
 149  \r
 150      sha1 = _notmuch_sha1_of_string (message_id);\r
 151 @@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t *notmuch,\r
 152      if (message_ret == NULL)\r
 153         return NOTMUCH_STATUS_NULL_POINTER;\r
 154  \r
 155 -    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)\r
 156 -       message_id = _notmuch_message_id_compressed (notmuch, message_id);\r
 157 +    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);\r
 158 +\r
 159 +    // Is strlen still appropriate?\r
 160 +    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX)\r
 161 +    {\r
 162 +       message_id = _notmuch_message_id_compressed (notmuch, u8_id);\r
 163 +       talloc_free ((char *) u8_id);\r
 164 +    } else\r
 165 +       message_id = u8_id;\r
 166  \r
 167      try {\r
 168         status = _notmuch_database_find_unique_doc_id (notmuch, "id",\r
 169                                                        message_id, &doc_id);\r
 170 +       talloc_free ((char *) message_id);\r
 171  \r
 172         if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND)\r
 173             *message_ret = NULL;\r
 174 @@ -1910,6 +1919,7 @@ _notmuch_database_generate_thread_id (notmuch_database_t *notmuch)\r
 175  static char *\r
 176  _get_metadata_thread_id_key (void *ctx, const char *message_id)\r
 177  {\r
 178 +    // Assumes message_id is normalized utf-8.\r
 179      if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)\r
 180         message_id = _notmuch_message_id_compressed (ctx, message_id);\r
 181  \r
 182 @@ -2011,7 +2021,8 @@ _resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,\r
 183       * generate a new thread ID and store it there.\r
 184       */\r
 185      db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);\r
 186 -    metadata_key = _get_metadata_thread_id_key (ctx, message_id);\r
 187 +    const char *mid = notmuch_message_get_message_id (message);\r
 188 +    metadata_key =_get_metadata_thread_id_key (ctx, mid);\r
 189      thread_id_string = notmuch->xapian_db->get_metadata (metadata_key);\r
 190  \r
 191      if (thread_id_string.empty()) {\r
 192 diff --git a/lib/message.cc b/lib/message.cc\r
 193 index 1ddce3c..afd0264 100644\r
 194 --- a/lib/message.cc\r
 195 +++ b/lib/message.cc\r
 196 @@ -225,20 +225,28 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,\r
 197      unsigned int doc_id;\r
 198      char *term;\r
 199  \r
 200 -    *status_ret = (notmuch_private_status_t) notmuch_database_find_message (notmuch,\r
 201 -                                                                           message_id,\r
 202 -                                                                           &message);\r
 203 -    if (message)\r
 204 +    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);\r
 205 +    *status_ret =\r
 206 +       (notmuch_private_status_t) notmuch_database_find_message (notmuch,\r
 207 +                                                                 u8_id,\r
 208 +                                                                 &message);\r
 209 +    if (message) {\r
 210 +       talloc_free ((char *) u8_id);\r
 211         return talloc_steal (notmuch, message);\r
 212 -    else if (*status_ret)\r
 213 +    } else if (*status_ret) {\r
 214 +       talloc_free ((char *) u8_id);\r
 215         return NULL;\r
 216 +    }\r
 217  \r
 218      /* If the message ID is too long, substitute its sha1 instead. */\r
 219 -    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)\r
 220 -       message_id = _notmuch_message_id_compressed (message, message_id);\r
 221 -\r
 222 -    term = talloc_asprintf (NULL, "%s%s",\r
 223 -                           _find_prefix ("id"), message_id);\r
 224 +    // Strlen still OK?\r
 225 +    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) {\r
 226 +       message_id = _notmuch_message_id_compressed (message, u8_id);\r
 227 +       talloc_free ((char *) u8_id);\r
 228 +    } else\r
 229 +       message_id = u8_id;\r
 230 +\r
 231 +    term = talloc_asprintf (NULL, "%s%s", _find_prefix ("id"), message_id);\r
 232      if (term == NULL) {\r
 233         *status_ret = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY;\r
 234         return NULL;\r
 235 @@ -252,6 +260,7 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,\r
 236         talloc_free (term);\r
 237  \r
 238         doc.add_value (NOTMUCH_VALUE_MESSAGE_ID, message_id);\r
 239 +       talloc_free ((char *) message_id);\r
 240  \r
 241         doc_id = _notmuch_database_generate_doc_id (notmuch);\r
 242      } catch (const Xapian::Error &error) {\r
 243 @@ -1109,13 +1118,14 @@ _notmuch_message_gen_terms (notmuch_message_t *message,\r
 244      if (text == NULL)\r
 245         return NOTMUCH_PRIVATE_STATUS_NULL_POINTER;\r
 246  \r
 247 +    const char *u8_text = notmuch_bytes_to_utf8(NULL, text, -1);\r
 248      term_gen->set_document (message->doc);\r
 249  \r
 250      if (prefix_name) {\r
 251         const char *prefix = _find_prefix (prefix_name);\r
 252  \r
 253         term_gen->set_termpos (message->termpos);\r
 254 -       term_gen->index_text (text, 1, prefix);\r
 255 +       term_gen->index_text (u8_text, 1, prefix);\r
 256         /* Create a gap between this an the next terms so they don't\r
 257          * appear to be a phrase. */\r
 258         message->termpos = term_gen->get_termpos () + 100;\r
 259 @@ -1124,10 +1134,11 @@ _notmuch_message_gen_terms (notmuch_message_t *message,\r
 260      }\r
 261  \r
 262      term_gen->set_termpos (message->termpos);\r
 263 -    term_gen->index_text (text);\r
 264 +    term_gen->index_text (u8_text);\r
 265      /* Create a term gap, as above. */\r
 266      message->termpos = term_gen->get_termpos () + 100;\r
 267  \r
 268 +    talloc_free ((char *) u8_text);\r
 269      return NOTMUCH_PRIVATE_STATUS_SUCCESS;\r
 270  }\r
 271  \r
 272 @@ -1184,10 +1195,14 @@ notmuch_message_add_tag (notmuch_message_t *message, const char *tag)\r
 273      if (tag == NULL)\r
 274         return NOTMUCH_STATUS_NULL_POINTER;\r
 275  \r
 276 -    if (strlen (tag) > NOTMUCH_TAG_MAX)\r
 277 +    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);\r
 278 +    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {\r
 279 +       talloc_free ((char *) u8_tag);\r
 280         return NOTMUCH_STATUS_TAG_TOO_LONG;\r
 281 +    }\r
 282  \r
 283 -    private_status = _notmuch_message_add_term (message, "tag", tag);\r
 284 +    private_status = _notmuch_message_add_term (message, "tag", u8_tag);\r
 285 +    talloc_free ((char *) u8_tag);\r
 286      if (private_status) {\r
 287         INTERNAL_ERROR ("_notmuch_message_add_term return unexpected value: %d\n",\r
 288                         private_status);\r
 289 @@ -1212,10 +1227,14 @@ notmuch_message_remove_tag (notmuch_message_t *message, const char *tag)\r
 290      if (tag == NULL)\r
 291         return NOTMUCH_STATUS_NULL_POINTER;\r
 292  \r
 293 -    if (strlen (tag) > NOTMUCH_TAG_MAX)\r
 294 +    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);\r
 295 +    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {\r
 296 +       talloc_free ((char *) u8_tag);\r
 297         return NOTMUCH_STATUS_TAG_TOO_LONG;\r
 298 +    }\r
 299  \r
 300 -    private_status = _notmuch_message_remove_term (message, "tag", tag);\r
 301 +    private_status = _notmuch_message_remove_term (message, "tag", u8_tag);\r
 302 +    talloc_free ((char *) u8_tag);\r
 303      if (private_status) {\r
 304         INTERNAL_ERROR ("_notmuch_message_remove_term return unexpected value: %d\n",\r
 305                         private_status);\r
 306 diff --git a/lib/notmuch.h b/lib/notmuch.h\r
 307 index b1f5bfa..6e13eb1 100644\r
 308 --- a/lib/notmuch.h\r
 309 +++ b/lib/notmuch.h\r
 310 @@ -1759,6 +1759,9 @@ notmuch_filenames_move_to_next (notmuch_filenames_t *filenames);\r
 311  void\r
 312  notmuch_filenames_destroy (notmuch_filenames_t *filenames);\r
 313  \r
 314 +char *\r
 315 +notmuch_bytes_to_utf8 (const void *ctx, const char *bytes, const size_t len);\r
 316 +\r
 317  /* @} */\r
 318  \r
 319  NOTMUCH_END_DECLS\r
 320 diff --git a/lib/query.cc b/lib/query.cc\r
 321 index 5275b5a..e48f06a 100644\r
 322 --- a/lib/query.cc\r
 323 +++ b/lib/query.cc\r
 324 @@ -86,7 +86,7 @@ notmuch_query_create (notmuch_database_t *notmuch,\r
 325  \r
 326      query->notmuch = notmuch;\r
 327  \r
 328 -    query->query_string = talloc_strdup (query, query_string);\r
 329 +    query->query_string = notmuch_bytes_to_utf8 (query, query_string, -1);\r
 330  \r
 331      query->sort = NOTMUCH_SORT_NEWEST_FIRST;\r
 332  \r
 333 @@ -125,7 +125,9 @@ notmuch_query_get_sort (notmuch_query_t *query)\r
 334  void\r
 335  notmuch_query_add_tag_exclude (notmuch_query_t *query, const char *tag)\r
 336  {\r
 337 -    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), tag);\r
 338 +    const char *u8_tag = notmuch_bytes_to_utf8 (query, tag, -1);\r
 339 +    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), u8_tag);\r
 340 +    talloc_free ((char *) u8_tag);\r
 341      _notmuch_string_list_append (query->exclude_terms, term);\r
 342  }\r
 343  \r
 344 diff --git a/lib/text-util.cc b/lib/text-util.cc\r
 345 new file mode 100644\r
 346 index 0000000..9dfd31f\r
 347 --- /dev/null\r
 348 +++ b/lib/text-util.cc\r
 349 @@ -0,0 +1,82 @@\r
 350 +/* text-util.cc - notmuch text processing utility functions\r
 351 + *\r
 352 + * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>\r
 353 + *\r
 354 + * This program is free software: you can redistribute it and/or modify\r
 355 + * it under the terms of the GNU General Public License as published by\r
 356 + * the Free Software Foundation, either version 3 of the License, or\r
 357 + * (at your option) any later version.\r
 358 + *\r
 359 + * This program is distributed in the hope that it will be useful,\r
 360 + * but WITHOUT ANY WARRANTY; without even the implied warranty of\r
 361 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r
 362 + * GNU General Public License for more details.\r
 363 + *\r
 364 + * You should have received a copy of the GNU General Public License\r
 365 + * along with this program.  If not, see http://www.gnu.org/licenses/ .\r
 366 + *\r
 367 + * Author: Rob Browning <rlb@defaultvalue.org>\r
 368 + *\r
 369 + */\r
 370 +\r
 371 +#include "notmuch.h"\r
 372 +\r
 373 +#include <assert.h>\r
 374 +#include <glib.h>\r
 375 +#include <string.h>\r
 376 +#include <talloc.h>\r
 377 +#include <xapian.h>\r
 378 +\r
 379 +static gsize\r
 380 +_notmuch_decompose_to_utf8 (const gunichar uc, gchar *out)\r
 381 +{\r
 382 +    gunichar dc[G_UNICHAR_MAX_DECOMPOSITION_LENGTH];\r
 383 +    // This currently performs canonical decomposition.\r
 384 +    const gsize dcn =\r
 385 +       g_unichar_fully_decompose (uc, FALSE, dc,\r
 386 +                                  G_UNICHAR_MAX_DECOMPOSITION_LENGTH);\r
 387 +    gsize utf8_len = 0;\r
 388 +    for (gsize i = 0; i < dcn; i++)\r
 389 +    {\r
 390 +       const gint dc_bytes = g_unichar_to_utf8 (dc[i], out);\r
 391 +       utf8_len += dc_bytes;\r
 392 +       if (out != NULL)\r
 393 +           out += dc_bytes;\r
 394 +    }\r
 395 +    return utf8_len;\r
 396 +}\r
 397 +\r
 398 +/* Convert a sequence of bytes to UTF-8, handling input encodings as\r
 399 + * Xapian does, but produce the canonical encoding.\r
 400 + */\r
 401 +char *\r
 402 +notmuch_bytes_to_utf8(const void *ctx, const char *bytes, const size_t len)\r
 403 +{\r
 404 +    // FIXME: try/catch to convert to error status messages?  Can the\r
 405 +    // iterator throw?\r
 406 +    Xapian::Utf8Iterator it;\r
 407 +    gsize u8_len = 0;\r
 408 +\r
 409 +    // Compute the utf-8 length\r
 410 +    if (len == (size_t) -1)\r
 411 +       it.assign (bytes, strlen(bytes));\r
 412 +    else\r
 413 +       it.assign (bytes, len);\r
 414 +    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it)\r
 415 +       u8_len += _notmuch_decompose_to_utf8 (uc, NULL);\r
 416 +\r
 417 +    // Convert to utf-8\r
 418 +    if (len == (size_t) -1)\r
 419 +       it.assign (bytes, strlen(bytes));\r
 420 +    else\r
 421 +       it.assign (bytes, len);\r
 422 +    char *result = talloc_array (ctx, char, u8_len + 1);\r
 423 +    gsize u8_i = 0;\r
 424 +    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it) {\r
 425 +       const gsize dc_bytes = _notmuch_decompose_to_utf8 (uc, &(result[u8_i]));\r
 426 +       u8_i += dc_bytes;\r
 427 +    }\r
 428 +    assert (u8_i == u8_len);\r
 429 +    result[u8_i] = '\0';\r
 430 +    return result;\r
 431 +}\r
 432 diff --git a/test/Makefile.local b/test/Makefile.local\r
 433 index 2331ceb..fd6d06d 100644\r
 434 --- a/test/Makefile.local\r
 435 +++ b/test/Makefile.local\r
 436 @@ -15,8 +15,11 @@ smtp_dummy_modules = $(smtp_dummy_srcs:.c=.o)\r
 437  $(dir)/arg-test: $(dir)/arg-test.o command-line-arguments.o util/libutil.a\r
 438         $(call quiet,CC) $^ -o $@ $(LDFLAGS)\r
 439  \r
 440 -$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o util/libutil.a\r
 441 -       $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS)\r
 442 +$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o lib/libnotmuch.a util/libutil.a\r
 443 +       $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)\r
 444 +\r
 445 +$(dir)/to-utf8: $(dir)/to-utf8.o command-line-arguments.o lib/libnotmuch.a util/libutil.a\r
 446 +       $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)\r
 447  \r
 448  random_corpus_deps =  $(dir)/random-corpus.o  $(dir)/database-test.o \\r
 449                         notmuch-config.o command-line-arguments.o \\r
 450 @@ -46,7 +49,8 @@ test_main_srcs=$(dir)/arg-test.c \\r
 451               $(dir)/parse-time.c \\r
 452               $(dir)/smtp-dummy.c \\r
 453               $(dir)/symbol-test.cc \\r
 454 -             $(dir)/make-db-version.cc \\r
 455 +             $(dir)/to-utf8.c \\r
 456 +             $(dir)/make-db-version.cc\r
 457  \r
 458  test_srcs=$(test_main_srcs) $(dir)/database-test.c\r
 459  \r
 460 diff --git a/test/T150-tagging.sh b/test/T150-tagging.sh\r
 461 index 821d393..d983fe0 100755\r
 462 --- a/test/T150-tagging.sh\r
 463 +++ b/test/T150-tagging.sh\r
 464 @@ -2,6 +2,14 @@\r
 465  test_description='"notmuch tag"'\r
 466  . ./test-lib.sh || exit 1\r
 467  \r
 468 +canonicalize_encoding()\r
 469 +{\r
 470 +  local decoded u8\r
 471 +  decoded=$($TEST_DIRECTORY/hex-xcode --direction=decode "$1") || return 1\r
 472 +  u8=$($TEST_DIRECTORY/to-utf8 "$decoded") || return 1\r
 473 +  $TEST_DIRECTORY/hex-xcode --direction=encode "$u8"\r
 474 +}\r
 475 +\r
 476  add_message '[subject]=One'\r
 477  add_message '[subject]=Two'\r
 478  \r
 479 @@ -191,23 +199,45 @@ test_expect_equal_file EXPECTED OUTPUT\r
 480  test_begin_subtest '--batch: unicode tags'\r
 481  notmuch dump --format=batch-tag > BACKUP\r
 482  \r
 483 +# FIXME: test canonical and non-canonical output?\r
 484 +\r
 485 +enctag1='%2a@%7d%cf%b5%f4%85%80%adO3%da%a7'\r
 486 +enctag2='=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d'\r
 487 +enctag3='A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27'\r
 488 +enctag4='%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6'\r
 489 +enctag5='%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d'\r
 490 +enctag6='L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1'\r
 491 +enctag7='P%c4%98%2f'\r
 492 +enctag8='%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d'\r
 493 +enctag9='%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b'\r
 494 +\r
 495  notmuch tag --batch <<EOF\r
 496 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 -- One\r
 497 -+=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d -- One\r
 498 -+A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 -- One\r
 499 ++$enctag1 -- One\r
 500 ++$enctag2 -- One\r
 501 ++$enctag3 -- One\r
 502  +R -- One\r
 503 -+%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 -- One\r
 504 -+%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- One\r
 505 -+L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 -- One\r
 506 -+P%c4%98%2f -- One\r
 507 -+%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d -- One\r
 508 -+%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- One\r
 509 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7  +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d  +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27  +R  +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6  +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d  +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1  +P%c4%98%2f  +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d  +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- Two\r
 510 ++$enctag4 -- One\r
 511 ++$enctag5 -- One\r
 512 ++$enctag6 -- One\r
 513 ++$enctag7 -- One\r
 514 ++$enctag8 -- One\r
 515 ++$enctag9 -- One\r
 516 ++$enctag1  +$enctag2  +$enctag3  +R  +$enctag4  +$enctag5  +$enctag6  +$enctag7  +$enctag8  +$enctag9 -- Two\r
 517  EOF\r
 518  \r
 519 +# FIXME: double-check that we need all of these, or do we want to do everything?\r
 520 +cetag1=$(canonicalize_encoding "$enctag1") || exit 1\r
 521 +cetag2=$(canonicalize_encoding "$enctag2") || exit 1\r
 522 +cetag4=$(canonicalize_encoding "$enctag4") || exit 1\r
 523 +cetag5=$(canonicalize_encoding "$enctag5") || exit 1\r
 524 +cetag6=$(canonicalize_encoding "$enctag6") || exit 1\r
 525 +cetag7=$(canonicalize_encoding "$enctag7") || exit 1\r
 526 +cetag8=$(canonicalize_encoding "$enctag8") || exit 1\r
 527 +cetag9=$(canonicalize_encoding "$enctag9") || exit 1\r
 528 +\r
 529  cat <<EOF > EXPECTED\r
 530 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag4 +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-002@notmuch-test-suite\r
 531 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-001@notmuch-test-suite\r
 532 ++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag4 +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-002@notmuch-test-suite\r
 533 ++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-001@notmuch-test-suite\r
 534  EOF\r
 535  \r
 536  notmuch dump --format=batch-tag | sort > OUTPUT\r
 537 diff --git a/test/T240-dump-restore.sh b/test/T240-dump-restore.sh\r
 538 index e6976ff..37722fb 100755\r
 539 --- a/test/T240-dump-restore.sh\r
 540 +++ b/test/T240-dump-restore.sh\r
 541 @@ -164,7 +164,7 @@ enc1=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag1")\r
 542  tag2=$(printf 'this\n tag\t has\n spaces')\r
 543  enc2=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag2")\r
 544  \r
 545 -enc3='%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a'\r
 546 +enc3='N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82'\r
 547  tag3=$($TEST_DIRECTORY/hex-xcode --direction=decode $enc3)\r
 548  \r
 549  notmuch dump --format=batch-tag > BACKUP\r
 550 @@ -218,7 +218,7 @@ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
 551  \r
 552  test_begin_subtest 'format=batch-tag, checking encoded output'\r
 553  notmuch dump --format=batch-tag -- from:cworth |\\r
 554 -        awk "{ print \"+$enc1 +$enc2 +$enc3 -- \" \$5 }" > EXPECTED.$test_count\r
 555 +        awk "{ print \"+$enc3 +$enc1 +$enc2 -- \" \$5 }" > EXPECTED.$test_count\r
 556  notmuch dump --format=batch-tag -- from:cworth  > OUTPUT.$test_count\r
 557  test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
 558  \r
 559 diff --git a/test/T480-hex-escaping.sh b/test/T480-hex-escaping.sh\r
 560 index 10527b1..b9c5eac 100755\r
 561 --- a/test/T480-hex-escaping.sh\r
 562 +++ b/test/T480-hex-escaping.sh\r
 563 @@ -19,7 +19,7 @@ $TEST_DIRECTORY/hex-xcode --direction=encode  < EXPECTED.$test_count |\\r
 564  test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
 565  \r
 566  test_begin_subtest "round trip 8bit chars"\r
 567 -echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count\r
 568 +echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count\r
 569  $TEST_DIRECTORY/hex-xcode --direction=decode  < EXPECTED.$test_count |\\r
 570      $TEST_DIRECTORY/hex-xcode --direction=encode > OUTPUT.$test_count\r
 571  test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
 572 @@ -42,7 +42,7 @@ $TEST_DIRECTORY/hex-xcode --in-place --direction=encode  < EXPECTED.$test_count\r
 573  test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
 574  \r
 575  test_begin_subtest "round trip 8bit chars (in-place)"\r
 576 -echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count\r
 577 +echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count\r
 578  $TEST_DIRECTORY/hex-xcode --in-place --direction=decode  < EXPECTED.$test_count |\\r
 579      $TEST_DIRECTORY/hex-xcode --in-place --direction=encode > OUTPUT.$test_count\r
 580  test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
 581 diff --git a/test/T570-normalization.sh b/test/T570-normalization.sh\r
 582 new file mode 100755\r
 583 index 0000000..ee3fa94\r
 584 --- /dev/null\r
 585 +++ b/test/T570-normalization.sh\r
 586 @@ -0,0 +1,28 @@\r
 587 +#!/usr/bin/env bash\r
 588 +\r
 589 +test_description="text normalization"\r
 590 +\r
 591 +. ./test-lib.sh || exit 1\r
 592 +\r
 593 +combining_a='Á'\r
 594 +noncombining_a='Á'\r
 595 +\r
 596 +# FIXME: these are extraneous/vestigial, remove from the final patch if still\r
 597 +# unneeded.\r
 598 +combining_o='ó' # should be U+006f U+0301\r
 599 +noncombining_o='ó' # U+00f3 latin small letter o with acute\r
 600 +# utf-8:\r
 601 +#   combining: o b11001100 b10000001 (o 0xcc 0x81)\r
 602 +#   non-combining: b11000011 b10110011 (0xc3 0xb3)\r
 603 +combining_token='tóken' # should be U+006f U+0301\r
 604 +normalized_token='tóken' # should be U+0243\r
 605 +\r
 606 +test_begin_subtest "Term with combining characters"\r
 607 +add_message '[content-type]="text/plain; charset=unknown-8bit"' \\r
 608 +           '[subject]="reproduc$noncombining_a"' \\r
 609 +           '[body]="reproduc$noncombining_a"'\r
 610 +output=$(notmuch count "reproduc$combining_a" 2>&1 | notmuch_show_sanitize_all)\r
 611 +\r
 612 +test_expect_equal "$output" 1\r
 613 +\r
 614 +test_done\r
 615 diff --git a/test/corpus/cur/52:2, b/test/corpus/cur/52:2,\r
 616 index 6028340..852e2bd 100644\r
 617 --- a/test/corpus/cur/52:2,\r
 618 +++ b/test/corpus/cur/52:2,\r
 619 @@ -12,8 +12,8 @@ Content-Type: text/plain; charset=ISO-8859-1\r
 620  Content-Transfer-Encoding: 8bit\r
 621  Subject: Re: [aur-general] Guidelines: cp, mkdir vs install\r
 622  \r
 623 -Le 29/12/2011 11:13, Allan McRae a écrit :\r
 624 -> On 29/12/11 19:56, François Boulogne wrote:\r
 625 +Le 29/12/2011 11:13, Allan McRae a écrit :\r
 626 +> On 29/12/11 19:56, François Boulogne wrote:\r
 627  >> Hi,\r
 628  >>\r
 629  >> Looking to improve the quality of my packages, I read again the guidelines.\r
 630 @@ -35,5 +35,5 @@ Thank you Allan\r
 631  \r
 632  \r
 633  -- \r
 634 -François Boulogne.\r
 635 +François Boulogne.\r
 636  https://www.sciunto.org\r
 637 diff --git a/test/to-utf8.c b/test/to-utf8.c\r
 638 new file mode 100644\r
 639 index 0000000..17bf40d\r
 640 --- /dev/null\r
 641 +++ b/test/to-utf8.c\r
 642 @@ -0,0 +1,44 @@\r
 643 +/* to-utf8.cc - convert bytes to UTF-8 as notmuch would\r
 644 + *\r
 645 + * usage:\r
 646 + * to-utf8 [bytes ...]\r
 647 + *\r
 648 + * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>\r
 649 + *\r
 650 + * This program is free software: you can redistribute it and/or modify\r
 651 + * it under the terms of the GNU General Public License as published by\r
 652 + * the Free Software Foundation, either version 3 of the License, or\r
 653 + * (at your option) any later version.\r
 654 + *\r
 655 + * This program is distributed in the hope that it will be useful,\r
 656 + * but WITHOUT ANY WARRANTY; without even the implied warranty of\r
 657 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r
 658 + * GNU General Public License for more details.\r
 659 + *\r
 660 + * You should have received a copy of the GNU General Public License\r
 661 + * along with this program.  If not, see http://www.gnu.org/licenses/ .\r
 662 + *\r
 663 + * Author: Rob Browning <rlb@defaultvalue.org>\r
 664 + *\r
 665 + */\r
 666 +\r
 667 +#include "notmuch.h"\r
 668 +\r
 669 +#include <stdio.h>\r
 670 +#include <stdlib.h>\r
 671 +#include <talloc.h>\r
 672 +\r
 673 +int\r
 674 +main (int argc, char **argv)\r
 675 +{\r
 676 +    void *ctx = talloc_new (NULL);\r
 677 +\r
 678 +    for (int i = 1; i < argc; i++) {\r
 679 +       char *u8 = notmuch_bytes_to_utf8(ctx, argv[i], -1);\r
 680 +       fputs (u8, stdout);\r
 681 +       talloc_free (u8);\r
 682 +    }\r
 683 +\r
 684 +    talloc_free (ctx);\r
 685 +    return 0;\r
 686 +}\r
 687 -- \r
 688 2.5.0\r
 689 \r