[PATCH 1/1] Store and search for canonical Unicode text [WIP]

author Rob Browning <rlb@defaultvalue.org>

Sun, 30 Aug 2015 16:21:16 +0000 (11:21 +1900)

committer W. Trevor King <wking@tremily.us>

Sat, 20 Aug 2016 21:49:28 +0000 (14:49 -0700)
author Rob Browning <rlb@defaultvalue.org>
Sun, 30 Aug 2015 16:21:16 +0000 (11:21 +1900)
committer W. Trevor King <wking@tremily.us>
Sat, 20 Aug 2016 21:49:28 +0000 (14:49 -0700)
diff --git a/7c/77303c51600f536be5b341809b4adf06b0e1cb b/7c/77303c51600f536be5b341809b4adf06b0e1cb

new file mode 100644 (file)

index 0000000..29d385e
--- /dev/null
+++ b/7c/77303c51600f536be5b341809b4adf06b0e1cb
@@ -0,0 +1,689 @@
+Return-Path: <rlb@defaultvalue.org>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+ by arlo.cworth.org (Postfix) with ESMTP id 66BC46DE1512\r
+ for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:31 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at cworth.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: 0.134\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=0.134 tagged_above=-999 required=5 tests=[AWL=0.684, \r
+ RP_MATCHES_RCVD=-0.55] autolearn=disabled\r
+Received: from arlo.cworth.org ([127.0.0.1])\r
+ by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)\r
+ with ESMTP id fmevZxWHrujW for <notmuch@notmuchmail.org>;\r
+ Sun, 30 Aug 2015 09:26:28 -0700 (PDT)\r
+X-Greylist: delayed 309 seconds by postgrey-1.35 at arlo;\r
+ Sun, 30 Aug 2015 09:26:28 PDT\r
+Received: from defaultvalue.org (defaultvalue.org [70.85.129.156])\r
+ by arlo.cworth.org (Postfix) with ESMTP id 0C8116DE14FD\r
+ for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:28 -0700 (PDT)\r
+Received: from trouble.defaultvalue.org (localhost [127.0.0.1])\r
+ (Authenticated sender: rlb@defaultvalue.org)\r
+ by defaultvalue.org (Postfix) with ESMTPSA id 93EC820235\r
+ for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 11:21:17 -0500 (CDT)\r
+Received: by trouble.defaultvalue.org (Postfix, from userid 1000)\r
+ id 1B1DE14E0F9; Sun, 30 Aug 2015 11:21:16 -0500 (CDT)\r
+From: Rob Browning <rlb@defaultvalue.org>\r
+To: notmuch@notmuchmail.org\r
+Subject: [PATCH 1/1] Store and search for canonical Unicode text [WIP]\r
+Date: Sun, 30 Aug 2015 11:21:16 -0500\r
+Message-Id: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>\r
+X-Mailer: git-send-email 2.5.0\r
+MIME-Version: 1.0\r
+Content-Type: text/plain; charset=UTF-8\r
+Content-Transfer-Encoding: 8bit\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.18\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+ <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch/>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Sun, 30 Aug 2015 16:26:31 -0000\r
+\r
+WARNING: this version is very preliminary, and might eat your data.\r
+\r
+Unicode has multiple sequences representing what should normally be\r
+considered the same text.  For example here's a combining Á and a\r
+noncombining Á.\r
+\r
+Depending on the way you view this, you may or may not see a\r
+difference, but the former is the canonical form, and is represented\r
+by two Unicode code points: a capital A (U+0041) followed by a\r
+"combining acute accent" (U+0301); the latter is the single code\r
+point (U+00C1), which is probably what most people would type.\r
+\r
+Before this change, notmuch would index two strings that differ only\r
+with respect to canonicalization, like tóken and tóken, as separate\r
+terms, even though they may be visually indistinguishable, and do (for\r
+most purposes) represent the same text.  After indexing, searching for\r
+one would not find the other, and which one you present to notmuch\r
+when you search depends on your tools.  See test/T570-normalization.sh\r
+for a working example.\r
+\r
+Since we're talking about differing representations that one wouldn't\r
+normally want to distinguish, this patch unifies the various\r
+representations by converting all incoming text to its canonical form\r
+before indexing, and canonicalizing all query strings.\r
+\r
+Up to now, notmuch has let Xapian handle converting the incoming bytes\r
+to UTF-8.  Xapian treats any byte sequence as UTF-8, and interprets\r
+any invalid UTF-8 bytes as Latin-1.  This patch maintains the existing\r
+behavior (excepting the new canonicalization) by using Xapian's\r
+Utf8Iterator to handle the initial Unicode character parsing.\r
+\r
+Note that the parsing approach in this patch is not particularly\r
+efficient, both because it traverses the incoming bytes three times:\r
+\r
+   - once to determine how long the input is (currently the iterator\r
+     can't directly handle null terminated char*'s),\r
+\r
+   - once to determine how long the final UTF-8 allocation needs to\r
+     be,\r
+\r
+   - and once for the conversion.\r
+\r
+And because when the input is already UTF-8, it just blindly converts\r
+from UTF-8 to Unicode code points, and then back to UTF-8 (after\r
+canonicalization), during each pass.  There are certainly\r
+opportunities to optimize, though it may be worth discussing the\r
+detection of data encodings more broadly first.\r
+\r
+FIXME: document current encoding behavior clearly in\r
+new/insert/search-terms.\r
+\r
+FIXME: what about existing indexed text?\r
+---\r
+\r
+ Posted for preliminary discussion, and as a milestone (it appears to\r
+ mostly work now).  Though I doubt I'm handling things correctly\r
+ everywhere notmuch-wise, wrt talloc, etc.\r
+\r
+ lib/Makefile.local         |  1 +\r
+ lib/database.cc            | 17 ++++++++--\r
+ lib/message.cc             | 51 +++++++++++++++++++---------\r
+ lib/notmuch.h              |  3 ++\r
+ lib/query.cc               |  6 ++--\r
+ lib/text-util.cc           | 82 ++++++++++++++++++++++++++++++++++++++++++++++\r
+ test/Makefile.local        | 10 ++++--\r
+ test/T150-tagging.sh       | 54 +++++++++++++++++++++++-------\r
+ test/T240-dump-restore.sh  |  4 +--\r
+ test/T480-hex-escaping.sh  |  4 +--\r
+ test/T570-normalization.sh | 28 ++++++++++++++++\r
+ test/corpus/cur/52:2,      |  6 ++--\r
+ test/to-utf8.c             | 44 +++++++++++++++++++++++++\r
+ 13 files changed, 267 insertions(+), 43 deletions(-)\r
+ create mode 100644 lib/text-util.cc\r
+ create mode 100755 test/T570-normalization.sh\r
+ create mode 100644 test/to-utf8.c\r
+\r
+diff --git a/lib/Makefile.local b/lib/Makefile.local\r
+index 3a07090..41fd1e1 100644\r
+--- a/lib/Makefile.local\r
++++ b/lib/Makefile.local\r
+@@ -48,6 +48,7 @@ libnotmuch_cxx_srcs =                \\r
+       $(dir)/index.cc         \\r
+       $(dir)/message.cc       \\r
+       $(dir)/query.cc         \\r
++      $(dir)/text-util.cc     \\r
+       $(dir)/thread.cc\r
+ \r
+ libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)\r
+diff --git a/lib/database.cc b/lib/database.cc\r
+index 6a15174..7a01f95 100644\r
+--- a/lib/database.cc\r
++++ b/lib/database.cc\r
+@@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, unsigned doc_id)\r
+ char *\r
+ _notmuch_message_id_compressed (void *ctx, const char *message_id)\r
+ {\r
++    // Assumes message_id is normalized utf-8.\r
+     char *sha1, *compressed;\r
+ \r
+     sha1 = _notmuch_sha1_of_string (message_id);\r
+@@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t *notmuch,\r
+     if (message_ret == NULL)\r
+       return NOTMUCH_STATUS_NULL_POINTER;\r
+ \r
+-    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)\r
+-      message_id = _notmuch_message_id_compressed (notmuch, message_id);\r
++    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);\r
++\r
++    // Is strlen still appropriate?\r
++    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX)\r
++    {\r
++      message_id = _notmuch_message_id_compressed (notmuch, u8_id);\r
++      talloc_free ((char *) u8_id);\r
++    } else\r
++      message_id = u8_id;\r
+ \r
+     try {\r
+       status = _notmuch_database_find_unique_doc_id (notmuch, "id",\r
+                                                      message_id, &doc_id);\r
++      talloc_free ((char *) message_id);\r
+ \r
+       if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND)\r
+           *message_ret = NULL;\r
+@@ -1910,6 +1919,7 @@ _notmuch_database_generate_thread_id (notmuch_database_t *notmuch)\r
+ static char *\r
+ _get_metadata_thread_id_key (void *ctx, const char *message_id)\r
+ {\r
++    // Assumes message_id is normalized utf-8.\r
+     if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)\r
+       message_id = _notmuch_message_id_compressed (ctx, message_id);\r
+ \r
+@@ -2011,7 +2021,8 @@ _resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,\r
+      * generate a new thread ID and store it there.\r
+      */\r
+     db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);\r
+-    metadata_key = _get_metadata_thread_id_key (ctx, message_id);\r
++    const char *mid = notmuch_message_get_message_id (message);\r
++    metadata_key =_get_metadata_thread_id_key (ctx, mid);\r
+     thread_id_string = notmuch->xapian_db->get_metadata (metadata_key);\r
+ \r
+     if (thread_id_string.empty()) {\r
+diff --git a/lib/message.cc b/lib/message.cc\r
+index 1ddce3c..afd0264 100644\r
+--- a/lib/message.cc\r
++++ b/lib/message.cc\r
+@@ -225,20 +225,28 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,\r
+     unsigned int doc_id;\r
+     char *term;\r
+ \r
+-    *status_ret = (notmuch_private_status_t) notmuch_database_find_message (notmuch,\r
+-                                                                          message_id,\r
+-                                                                          &message);\r
+-    if (message)\r
++    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);\r
++    *status_ret =\r
++      (notmuch_private_status_t) notmuch_database_find_message (notmuch,\r
++                                                                u8_id,\r
++                                                                &message);\r
++    if (message) {\r
++      talloc_free ((char *) u8_id);\r
+       return talloc_steal (notmuch, message);\r
+-    else if (*status_ret)\r
++    } else if (*status_ret) {\r
++      talloc_free ((char *) u8_id);\r
+       return NULL;\r
++    }\r
+ \r
+     /* If the message ID is too long, substitute its sha1 instead. */\r
+-    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)\r
+-      message_id = _notmuch_message_id_compressed (message, message_id);\r
+-\r
+-    term = talloc_asprintf (NULL, "%s%s",\r
+-                          _find_prefix ("id"), message_id);\r
++    // Strlen still OK?\r
++    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) {\r
++      message_id = _notmuch_message_id_compressed (message, u8_id);\r
++      talloc_free ((char *) u8_id);\r
++    } else\r
++      message_id = u8_id;\r
++\r
++    term = talloc_asprintf (NULL, "%s%s", _find_prefix ("id"), message_id);\r
+     if (term == NULL) {\r
+       *status_ret = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY;\r
+       return NULL;\r
+@@ -252,6 +260,7 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,\r
+       talloc_free (term);\r
+ \r
+       doc.add_value (NOTMUCH_VALUE_MESSAGE_ID, message_id);\r
++      talloc_free ((char *) message_id);\r
+ \r
+       doc_id = _notmuch_database_generate_doc_id (notmuch);\r
+     } catch (const Xapian::Error &error) {\r
+@@ -1109,13 +1118,14 @@ _notmuch_message_gen_terms (notmuch_message_t *message,\r
+     if (text == NULL)\r
+       return NOTMUCH_PRIVATE_STATUS_NULL_POINTER;\r
+ \r
++    const char *u8_text = notmuch_bytes_to_utf8(NULL, text, -1);\r
+     term_gen->set_document (message->doc);\r
+ \r
+     if (prefix_name) {\r
+       const char *prefix = _find_prefix (prefix_name);\r
+ \r
+       term_gen->set_termpos (message->termpos);\r
+-      term_gen->index_text (text, 1, prefix);\r
++      term_gen->index_text (u8_text, 1, prefix);\r
+       /* Create a gap between this an the next terms so they don't\r
+        * appear to be a phrase. */\r
+       message->termpos = term_gen->get_termpos () + 100;\r
+@@ -1124,10 +1134,11 @@ _notmuch_message_gen_terms (notmuch_message_t *message,\r
+     }\r
+ \r
+     term_gen->set_termpos (message->termpos);\r
+-    term_gen->index_text (text);\r
++    term_gen->index_text (u8_text);\r
+     /* Create a term gap, as above. */\r
+     message->termpos = term_gen->get_termpos () + 100;\r
+ \r
++    talloc_free ((char *) u8_text);\r
+     return NOTMUCH_PRIVATE_STATUS_SUCCESS;\r
+ }\r
+ \r
+@@ -1184,10 +1195,14 @@ notmuch_message_add_tag (notmuch_message_t *message, const char *tag)\r
+     if (tag == NULL)\r
+       return NOTMUCH_STATUS_NULL_POINTER;\r
+ \r
+-    if (strlen (tag) > NOTMUCH_TAG_MAX)\r
++    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);\r
++    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {\r
++      talloc_free ((char *) u8_tag);\r
+       return NOTMUCH_STATUS_TAG_TOO_LONG;\r
++    }\r
+ \r
+-    private_status = _notmuch_message_add_term (message, "tag", tag);\r
++    private_status = _notmuch_message_add_term (message, "tag", u8_tag);\r
++    talloc_free ((char *) u8_tag);\r
+     if (private_status) {\r
+       INTERNAL_ERROR ("_notmuch_message_add_term return unexpected value: %d\n",\r
+                       private_status);\r
+@@ -1212,10 +1227,14 @@ notmuch_message_remove_tag (notmuch_message_t *message, const char *tag)\r
+     if (tag == NULL)\r
+       return NOTMUCH_STATUS_NULL_POINTER;\r
+ \r
+-    if (strlen (tag) > NOTMUCH_TAG_MAX)\r
++    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);\r
++    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {\r
++      talloc_free ((char *) u8_tag);\r
+       return NOTMUCH_STATUS_TAG_TOO_LONG;\r
++    }\r
+ \r
+-    private_status = _notmuch_message_remove_term (message, "tag", tag);\r
++    private_status = _notmuch_message_remove_term (message, "tag", u8_tag);\r
++    talloc_free ((char *) u8_tag);\r
+     if (private_status) {\r
+       INTERNAL_ERROR ("_notmuch_message_remove_term return unexpected value: %d\n",\r
+                       private_status);\r
+diff --git a/lib/notmuch.h b/lib/notmuch.h\r
+index b1f5bfa..6e13eb1 100644\r
+--- a/lib/notmuch.h\r
++++ b/lib/notmuch.h\r
+@@ -1759,6 +1759,9 @@ notmuch_filenames_move_to_next (notmuch_filenames_t *filenames);\r
+ void\r
+ notmuch_filenames_destroy (notmuch_filenames_t *filenames);\r
+ \r
++char *\r
++notmuch_bytes_to_utf8 (const void *ctx, const char *bytes, const size_t len);\r
++\r
+ /* @} */\r
+ \r
+ NOTMUCH_END_DECLS\r
+diff --git a/lib/query.cc b/lib/query.cc\r
+index 5275b5a..e48f06a 100644\r
+--- a/lib/query.cc\r
++++ b/lib/query.cc\r
+@@ -86,7 +86,7 @@ notmuch_query_create (notmuch_database_t *notmuch,\r
+ \r
+     query->notmuch = notmuch;\r
+ \r
+-    query->query_string = talloc_strdup (query, query_string);\r
++    query->query_string = notmuch_bytes_to_utf8 (query, query_string, -1);\r
+ \r
+     query->sort = NOTMUCH_SORT_NEWEST_FIRST;\r
+ \r
+@@ -125,7 +125,9 @@ notmuch_query_get_sort (notmuch_query_t *query)\r
+ void\r
+ notmuch_query_add_tag_exclude (notmuch_query_t *query, const char *tag)\r
+ {\r
+-    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), tag);\r
++    const char *u8_tag = notmuch_bytes_to_utf8 (query, tag, -1);\r
++    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), u8_tag);\r
++    talloc_free ((char *) u8_tag);\r
+     _notmuch_string_list_append (query->exclude_terms, term);\r
+ }\r
+ \r
+diff --git a/lib/text-util.cc b/lib/text-util.cc\r
+new file mode 100644\r
+index 0000000..9dfd31f\r
+--- /dev/null\r
++++ b/lib/text-util.cc\r
+@@ -0,0 +1,82 @@\r
++/* text-util.cc - notmuch text processing utility functions\r
++ *\r
++ * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>\r
++ *\r
++ * This program is free software: you can redistribute it and/or modify\r
++ * it under the terms of the GNU General Public License as published by\r
++ * the Free Software Foundation, either version 3 of the License, or\r
++ * (at your option) any later version.\r
++ *\r
++ * This program is distributed in the hope that it will be useful,\r
++ * but WITHOUT ANY WARRANTY; without even the implied warranty of\r
++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r
++ * GNU General Public License for more details.\r
++ *\r
++ * You should have received a copy of the GNU General Public License\r
++ * along with this program.  If not, see http://www.gnu.org/licenses/ .\r
++ *\r
++ * Author: Rob Browning <rlb@defaultvalue.org>\r
++ *\r
++ */\r
++\r
++#include "notmuch.h"\r
++\r
++#include <assert.h>\r
++#include <glib.h>\r
++#include <string.h>\r
++#include <talloc.h>\r
++#include <xapian.h>\r
++\r
++static gsize\r
++_notmuch_decompose_to_utf8 (const gunichar uc, gchar *out)\r
++{\r
++    gunichar dc[G_UNICHAR_MAX_DECOMPOSITION_LENGTH];\r
++    // This currently performs canonical decomposition.\r
++    const gsize dcn =\r
++      g_unichar_fully_decompose (uc, FALSE, dc,\r
++                                 G_UNICHAR_MAX_DECOMPOSITION_LENGTH);\r
++    gsize utf8_len = 0;\r
++    for (gsize i = 0; i < dcn; i++)\r
++    {\r
++      const gint dc_bytes = g_unichar_to_utf8 (dc[i], out);\r
++      utf8_len += dc_bytes;\r
++      if (out != NULL)\r
++          out += dc_bytes;\r
++    }\r
++    return utf8_len;\r
++}\r
++\r
++/* Convert a sequence of bytes to UTF-8, handling input encodings as\r
++ * Xapian does, but produce the canonical encoding.\r
++ */\r
++char *\r
++notmuch_bytes_to_utf8(const void *ctx, const char *bytes, const size_t len)\r
++{\r
++    // FIXME: try/catch to convert to error status messages?  Can the\r
++    // iterator throw?\r
++    Xapian::Utf8Iterator it;\r
++    gsize u8_len = 0;\r
++\r
++    // Compute the utf-8 length\r
++    if (len == (size_t) -1)\r
++      it.assign (bytes, strlen(bytes));\r
++    else\r
++      it.assign (bytes, len);\r
++    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it)\r
++      u8_len += _notmuch_decompose_to_utf8 (uc, NULL);\r
++\r
++    // Convert to utf-8\r
++    if (len == (size_t) -1)\r
++      it.assign (bytes, strlen(bytes));\r
++    else\r
++      it.assign (bytes, len);\r
++    char *result = talloc_array (ctx, char, u8_len + 1);\r
++    gsize u8_i = 0;\r
++    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it) {\r
++      const gsize dc_bytes = _notmuch_decompose_to_utf8 (uc, &(result[u8_i]));\r
++      u8_i += dc_bytes;\r
++    }\r
++    assert (u8_i == u8_len);\r
++    result[u8_i] = '\0';\r
++    return result;\r
++}\r
+diff --git a/test/Makefile.local b/test/Makefile.local\r
+index 2331ceb..fd6d06d 100644\r
+--- a/test/Makefile.local\r
++++ b/test/Makefile.local\r
+@@ -15,8 +15,11 @@ smtp_dummy_modules = $(smtp_dummy_srcs:.c=.o)\r
+ $(dir)/arg-test: $(dir)/arg-test.o command-line-arguments.o util/libutil.a\r
+       $(call quiet,CC) $^ -o $@ $(LDFLAGS)\r
+ \r
+-$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o util/libutil.a\r
+-      $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS)\r
++$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o lib/libnotmuch.a util/libutil.a\r
++      $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)\r
++\r
++$(dir)/to-utf8: $(dir)/to-utf8.o command-line-arguments.o lib/libnotmuch.a util/libutil.a\r
++      $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)\r
+ \r
+ random_corpus_deps =  $(dir)/random-corpus.o  $(dir)/database-test.o \\r
+                       notmuch-config.o command-line-arguments.o \\r
+@@ -46,7 +49,8 @@ test_main_srcs=$(dir)/arg-test.c \\r
+             $(dir)/parse-time.c \\r
+             $(dir)/smtp-dummy.c \\r
+             $(dir)/symbol-test.cc \\r
+-            $(dir)/make-db-version.cc \\r
++            $(dir)/to-utf8.c \\r
++            $(dir)/make-db-version.cc\r
+ \r
+ test_srcs=$(test_main_srcs) $(dir)/database-test.c\r
+ \r
+diff --git a/test/T150-tagging.sh b/test/T150-tagging.sh\r
+index 821d393..d983fe0 100755\r
+--- a/test/T150-tagging.sh\r
++++ b/test/T150-tagging.sh\r
+@@ -2,6 +2,14 @@\r
+ test_description='"notmuch tag"'\r
+ . ./test-lib.sh || exit 1\r
+ \r
++canonicalize_encoding()\r
++{\r
++  local decoded u8\r
++  decoded=$($TEST_DIRECTORY/hex-xcode --direction=decode "$1") || return 1\r
++  u8=$($TEST_DIRECTORY/to-utf8 "$decoded") || return 1\r
++  $TEST_DIRECTORY/hex-xcode --direction=encode "$u8"\r
++}\r
++\r
+ add_message '[subject]=One'\r
+ add_message '[subject]=Two'\r
+ \r
+@@ -191,23 +199,45 @@ test_expect_equal_file EXPECTED OUTPUT\r
+ test_begin_subtest '--batch: unicode tags'\r
+ notmuch dump --format=batch-tag > BACKUP\r
+ \r
++# FIXME: test canonical and non-canonical output?\r
++\r
++enctag1='%2a@%7d%cf%b5%f4%85%80%adO3%da%a7'\r
++enctag2='=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d'\r
++enctag3='A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27'\r
++enctag4='%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6'\r
++enctag5='%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d'\r
++enctag6='L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1'\r
++enctag7='P%c4%98%2f'\r
++enctag8='%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d'\r
++enctag9='%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b'\r
++\r
+ notmuch tag --batch <<EOF\r
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 -- One\r
+-+=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d -- One\r
+-+A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 -- One\r
+++$enctag1 -- One\r
+++$enctag2 -- One\r
+++$enctag3 -- One\r
+ +R -- One\r
+-+%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 -- One\r
+-+%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- One\r
+-+L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 -- One\r
+-+P%c4%98%2f -- One\r
+-+%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d -- One\r
+-+%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- One\r
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7  +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d  +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27  +R  +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6  +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d  +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1  +P%c4%98%2f  +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d  +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- Two\r
+++$enctag4 -- One\r
+++$enctag5 -- One\r
+++$enctag6 -- One\r
+++$enctag7 -- One\r
+++$enctag8 -- One\r
+++$enctag9 -- One\r
+++$enctag1  +$enctag2  +$enctag3  +R  +$enctag4  +$enctag5  +$enctag6  +$enctag7  +$enctag8  +$enctag9 -- Two\r
+ EOF\r
+ \r
++# FIXME: double-check that we need all of these, or do we want to do everything?\r
++cetag1=$(canonicalize_encoding "$enctag1") || exit 1\r
++cetag2=$(canonicalize_encoding "$enctag2") || exit 1\r
++cetag4=$(canonicalize_encoding "$enctag4") || exit 1\r
++cetag5=$(canonicalize_encoding "$enctag5") || exit 1\r
++cetag6=$(canonicalize_encoding "$enctag6") || exit 1\r
++cetag7=$(canonicalize_encoding "$enctag7") || exit 1\r
++cetag8=$(canonicalize_encoding "$enctag8") || exit 1\r
++cetag9=$(canonicalize_encoding "$enctag9") || exit 1\r
++\r
+ cat <<EOF > EXPECTED\r
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag4 +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-002@notmuch-test-suite\r
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-001@notmuch-test-suite\r
+++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag4 +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-002@notmuch-test-suite\r
+++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-001@notmuch-test-suite\r
+ EOF\r
+ \r
+ notmuch dump --format=batch-tag | sort > OUTPUT\r
+diff --git a/test/T240-dump-restore.sh b/test/T240-dump-restore.sh\r
+index e6976ff..37722fb 100755\r
+--- a/test/T240-dump-restore.sh\r
++++ b/test/T240-dump-restore.sh\r
+@@ -164,7 +164,7 @@ enc1=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag1")\r
+ tag2=$(printf 'this\n tag\t has\n spaces')\r
+ enc2=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag2")\r
+ \r
+-enc3='%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a'\r
++enc3='N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82'\r
+ tag3=$($TEST_DIRECTORY/hex-xcode --direction=decode $enc3)\r
+ \r
+ notmuch dump --format=batch-tag > BACKUP\r
+@@ -218,7 +218,7 @@ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
+ \r
+ test_begin_subtest 'format=batch-tag, checking encoded output'\r
+ notmuch dump --format=batch-tag -- from:cworth |\\r
+-       awk "{ print \"+$enc1 +$enc2 +$enc3 -- \" \$5 }" > EXPECTED.$test_count\r
++       awk "{ print \"+$enc3 +$enc1 +$enc2 -- \" \$5 }" > EXPECTED.$test_count\r
+ notmuch dump --format=batch-tag -- from:cworth  > OUTPUT.$test_count\r
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
+ \r
+diff --git a/test/T480-hex-escaping.sh b/test/T480-hex-escaping.sh\r
+index 10527b1..b9c5eac 100755\r
+--- a/test/T480-hex-escaping.sh\r
++++ b/test/T480-hex-escaping.sh\r
+@@ -19,7 +19,7 @@ $TEST_DIRECTORY/hex-xcode --direction=encode  < EXPECTED.$test_count |\\r
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
+ \r
+ test_begin_subtest "round trip 8bit chars"\r
+-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count\r
++echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count\r
+ $TEST_DIRECTORY/hex-xcode --direction=decode  < EXPECTED.$test_count |\\r
+     $TEST_DIRECTORY/hex-xcode --direction=encode > OUTPUT.$test_count\r
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
+@@ -42,7 +42,7 @@ $TEST_DIRECTORY/hex-xcode --in-place --direction=encode  < EXPECTED.$test_count\r
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
+ \r
+ test_begin_subtest "round trip 8bit chars (in-place)"\r
+-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count\r
++echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count\r
+ $TEST_DIRECTORY/hex-xcode --in-place --direction=decode  < EXPECTED.$test_count |\\r
+     $TEST_DIRECTORY/hex-xcode --in-place --direction=encode > OUTPUT.$test_count\r
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count\r
+diff --git a/test/T570-normalization.sh b/test/T570-normalization.sh\r
+new file mode 100755\r
+index 0000000..ee3fa94\r
+--- /dev/null\r
++++ b/test/T570-normalization.sh\r
+@@ -0,0 +1,28 @@\r
++#!/usr/bin/env bash\r
++\r
++test_description="text normalization"\r
++\r
++. ./test-lib.sh || exit 1\r
++\r
++combining_a='Á'\r
++noncombining_a='Á'\r
++\r
++# FIXME: these are extraneous/vestigial, remove from the final patch if still\r
++# unneeded.\r
++combining_o='ó' # should be U+006f U+0301\r
++noncombining_o='ó' # U+00f3 latin small letter o with acute\r
++# utf-8:\r
++#   combining: o b11001100 b10000001 (o 0xcc 0x81)\r
++#   non-combining: b11000011 b10110011 (0xc3 0xb3)\r
++combining_token='tóken' # should be U+006f U+0301\r
++normalized_token='tóken' # should be U+0243\r
++\r
++test_begin_subtest "Term with combining characters"\r
++add_message '[content-type]="text/plain; charset=unknown-8bit"' \\r
++          '[subject]="reproduc$noncombining_a"' \\r
++          '[body]="reproduc$noncombining_a"'\r
++output=$(notmuch count "reproduc$combining_a" 2>&1 | notmuch_show_sanitize_all)\r
++\r
++test_expect_equal "$output" 1\r
++\r
++test_done\r
+diff --git a/test/corpus/cur/52:2, b/test/corpus/cur/52:2,\r
+index 6028340..852e2bd 100644\r
+--- a/test/corpus/cur/52:2,\r
++++ b/test/corpus/cur/52:2,\r
+@@ -12,8 +12,8 @@ Content-Type: text/plain; charset=ISO-8859-1\r
+ Content-Transfer-Encoding: 8bit\r
+ Subject: Re: [aur-general] Guidelines: cp, mkdir vs install\r
+ \r
+-Le 29/12/2011 11:13, Allan McRae a écrit :\r
+-> On 29/12/11 19:56, François Boulogne wrote:\r
++Le 29/12/2011 11:13, Allan McRae a écrit :\r
++> On 29/12/11 19:56, François Boulogne wrote:\r
+ >> Hi,\r
+ >>\r
+ >> Looking to improve the quality of my packages, I read again the guidelines.\r
+@@ -35,5 +35,5 @@ Thank you Allan\r
+ \r
+ \r
+ -- \r
+-François Boulogne.\r
++François Boulogne.\r
+ https://www.sciunto.org\r
+diff --git a/test/to-utf8.c b/test/to-utf8.c\r
+new file mode 100644\r
+index 0000000..17bf40d\r
+--- /dev/null\r
++++ b/test/to-utf8.c\r
+@@ -0,0 +1,44 @@\r
++/* to-utf8.cc - convert bytes to UTF-8 as notmuch would\r
++ *\r
++ * usage:\r
++ * to-utf8 [bytes ...]\r
++ *\r
++ * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>\r
++ *\r
++ * This program is free software: you can redistribute it and/or modify\r
++ * it under the terms of the GNU General Public License as published by\r
++ * the Free Software Foundation, either version 3 of the License, or\r
++ * (at your option) any later version.\r
++ *\r
++ * This program is distributed in the hope that it will be useful,\r
++ * but WITHOUT ANY WARRANTY; without even the implied warranty of\r
++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\r
++ * GNU General Public License for more details.\r
++ *\r
++ * You should have received a copy of the GNU General Public License\r
++ * along with this program.  If not, see http://www.gnu.org/licenses/ .\r
++ *\r
++ * Author: Rob Browning <rlb@defaultvalue.org>\r
++ *\r
++ */\r
++\r
++#include "notmuch.h"\r
++\r
++#include <stdio.h>\r
++#include <stdlib.h>\r
++#include <talloc.h>\r
++\r
++int\r
++main (int argc, char **argv)\r
++{\r
++    void *ctx = talloc_new (NULL);\r
++\r
++    for (int i = 1; i < argc; i++) {\r
++      char *u8 = notmuch_bytes_to_utf8(ctx, argv[i], -1);\r
++      fputs (u8, stdout);\r
++      talloc_free (u8);\r
++    }\r
++\r
++    talloc_free (ctx);\r
++    return 0;\r
++}\r
+-- \r
+2.5.0\r
+\r
author	Rob Browning <rlb@defaultvalue.org>
	Sun, 30 Aug 2015 16:21:16 +0000 (11:21 +1900)
committer	W. Trevor King <wking@tremily.us>
	Sat, 20 Aug 2016 21:49:28 +0000 (14:49 -0700)