From: Rob Browning <rlb@defaultvalue.org>
Date: Sun, 30 Aug 2015 16:21:16 +0000 (+1900)
Subject: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=a3f498012499ac7260a16be8ab4ef9c86165236e;p=notmuch-archives.git

[PATCH 1/1] Store and search for canonical Unicode text [WIP]
---

diff --git a/7c/77303c51600f536be5b341809b4adf06b0e1cb b/7c/77303c51600f536be5b341809b4adf06b0e1cb
new file mode 100644
index 000000000..29d385e74
--- /dev/null
+++ b/7c/77303c51600f536be5b341809b4adf06b0e1cb
@@ -0,0 +1,689 @@
+Return-Path: <rlb@defaultvalue.org>
+X-Original-To: notmuch@notmuchmail.org
+Delivered-To: notmuch@notmuchmail.org
+Received: from localhost (localhost [127.0.0.1])
+ by arlo.cworth.org (Postfix) with ESMTP id 66BC46DE1512
+ for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:31 -0700 (PDT)
+X-Virus-Scanned: Debian amavisd-new at cworth.org
+X-Spam-Flag: NO
+X-Spam-Score: 0.134
+X-Spam-Level: 
+X-Spam-Status: No, score=0.134 tagged_above=-999 required=5 tests=[AWL=0.684, 
+ RP_MATCHES_RCVD=-0.55] autolearn=disabled
+Received: from arlo.cworth.org ([127.0.0.1])
+ by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)
+ with ESMTP id fmevZxWHrujW for <notmuch@notmuchmail.org>;
+ Sun, 30 Aug 2015 09:26:28 -0700 (PDT)
+X-Greylist: delayed 309 seconds by postgrey-1.35 at arlo;
+ Sun, 30 Aug 2015 09:26:28 PDT
+Received: from defaultvalue.org (defaultvalue.org [70.85.129.156])
+ by arlo.cworth.org (Postfix) with ESMTP id 0C8116DE14FD
+ for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:28 -0700 (PDT)
+Received: from trouble.defaultvalue.org (localhost [127.0.0.1])
+ (Authenticated sender: rlb@defaultvalue.org)
+ by defaultvalue.org (Postfix) with ESMTPSA id 93EC820235
+ for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 11:21:17 -0500 (CDT)
+Received: by trouble.defaultvalue.org (Postfix, from userid 1000)
+ id 1B1DE14E0F9; Sun, 30 Aug 2015 11:21:16 -0500 (CDT)
+From: Rob Browning <rlb@defaultvalue.org>
+To: notmuch@notmuchmail.org
+Subject: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
+Date: Sun, 30 Aug 2015 11:21:16 -0500
+Message-Id: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>
+X-Mailer: git-send-email 2.5.0
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
+X-BeenThere: notmuch@notmuchmail.org
+X-Mailman-Version: 2.1.18
+Precedence: list
+List-Id: "Use and development of the notmuch mail system."
+ <notmuch.notmuchmail.org>
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
+List-Archive: <http://notmuchmail.org/pipermail/notmuch/>
+List-Post: <mailto:notmuch@notmuchmail.org>
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
+X-List-Received-Date: Sun, 30 Aug 2015 16:26:31 -0000
+
+WARNING: this version is very preliminary, and might eat your data.
+
+Unicode has multiple sequences representing what should normally be
+considered the same text.  For example here's a combining AÌ and a
+noncombining Ã.
+
+Depending on the way you view this, you may or may not see a
+difference, but the former is the canonical form, and is represented
+by two Unicode code points: a capital A (U+0041) followed by a
+"combining acute accent" (U+0301); the latter is the single code
+point (U+00C1), which is probably what most people would type.
+
+Before this change, notmuch would index two strings that differ only
+with respect to canonicalization, like toÌken and tÃ³ken, as separate
+terms, even though they may be visually indistinguishable, and do (for
+most purposes) represent the same text.  After indexing, searching for
+one would not find the other, and which one you present to notmuch
+when you search depends on your tools.  See test/T570-normalization.sh
+for a working example.
+
+Since we're talking about differing representations that one wouldn't
+normally want to distinguish, this patch unifies the various
+representations by converting all incoming text to its canonical form
+before indexing, and canonicalizing all query strings.
+
+Up to now, notmuch has let Xapian handle converting the incoming bytes
+to UTF-8.  Xapian treats any byte sequence as UTF-8, and interprets
+any invalid UTF-8 bytes as Latin-1.  This patch maintains the existing
+behavior (excepting the new canonicalization) by using Xapian's
+Utf8Iterator to handle the initial Unicode character parsing.
+
+Note that the parsing approach in this patch is not particularly
+efficient, both because it traverses the incoming bytes three times:
+
+   - once to determine how long the input is (currently the iterator
+     can't directly handle null terminated char*'s),
+
+   - once to determine how long the final UTF-8 allocation needs to
+     be,
+
+   - and once for the conversion.
+
+And because when the input is already UTF-8, it just blindly converts
+from UTF-8 to Unicode code points, and then back to UTF-8 (after
+canonicalization), during each pass.  There are certainly
+opportunities to optimize, though it may be worth discussing the
+detection of data encodings more broadly first.
+
+FIXME: document current encoding behavior clearly in
+new/insert/search-terms.
+
+FIXME: what about existing indexed text?
+---
+
+ Posted for preliminary discussion, and as a milestone (it appears to
+ mostly work now).  Though I doubt I'm handling things correctly
+ everywhere notmuch-wise, wrt talloc, etc.
+
+ lib/Makefile.local         |  1 +
+ lib/database.cc            | 17 ++++++++--
+ lib/message.cc             | 51 +++++++++++++++++++---------
+ lib/notmuch.h              |  3 ++
+ lib/query.cc               |  6 ++--
+ lib/text-util.cc           | 82 ++++++++++++++++++++++++++++++++++++++++++++++
+ test/Makefile.local        | 10 ++++--
+ test/T150-tagging.sh       | 54 +++++++++++++++++++++++-------
+ test/T240-dump-restore.sh  |  4 +--
+ test/T480-hex-escaping.sh  |  4 +--
+ test/T570-normalization.sh | 28 ++++++++++++++++
+ test/corpus/cur/52:2,      |  6 ++--
+ test/to-utf8.c             | 44 +++++++++++++++++++++++++
+ 13 files changed, 267 insertions(+), 43 deletions(-)
+ create mode 100644 lib/text-util.cc
+ create mode 100755 test/T570-normalization.sh
+ create mode 100644 test/to-utf8.c
+
+diff --git a/lib/Makefile.local b/lib/Makefile.local
+index 3a07090..41fd1e1 100644
+--- a/lib/Makefile.local
++++ b/lib/Makefile.local
+@@ -48,6 +48,7 @@ libnotmuch_cxx_srcs =		\
+ 	$(dir)/index.cc		\
+ 	$(dir)/message.cc	\
+ 	$(dir)/query.cc		\
++	$(dir)/text-util.cc	\
+ 	$(dir)/thread.cc
+ 
+ libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
+diff --git a/lib/database.cc b/lib/database.cc
+index 6a15174..7a01f95 100644
+--- a/lib/database.cc
++++ b/lib/database.cc
+@@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, unsigned doc_id)
+ char *
+ _notmuch_message_id_compressed (void *ctx, const char *message_id)
+ {
++    // Assumes message_id is normalized utf-8.
+     char *sha1, *compressed;
+ 
+     sha1 = _notmuch_sha1_of_string (message_id);
+@@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t *notmuch,
+     if (message_ret == NULL)
+ 	return NOTMUCH_STATUS_NULL_POINTER;
+ 
+-    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
+-	message_id = _notmuch_message_id_compressed (notmuch, message_id);
++    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);
++
++    // Is strlen still appropriate?
++    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX)
++    {
++	message_id = _notmuch_message_id_compressed (notmuch, u8_id);
++	talloc_free ((char *) u8_id);
++    } else
++	message_id = u8_id;
+ 
+     try {
+ 	status = _notmuch_database_find_unique_doc_id (notmuch, "id",
+ 						       message_id, &doc_id);
++	talloc_free ((char *) message_id);
+ 
+ 	if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND)
+ 	    *message_ret = NULL;
+@@ -1910,6 +1919,7 @@ _notmuch_database_generate_thread_id (notmuch_database_t *notmuch)
+ static char *
+ _get_metadata_thread_id_key (void *ctx, const char *message_id)
+ {
++    // Assumes message_id is normalized utf-8.
+     if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
+ 	message_id = _notmuch_message_id_compressed (ctx, message_id);
+ 
+@@ -2011,7 +2021,8 @@ _resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,
+      * generate a new thread ID and store it there.
+      */
+     db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
+-    metadata_key = _get_metadata_thread_id_key (ctx, message_id);
++    const char *mid = notmuch_message_get_message_id (message);
++    metadata_key =_get_metadata_thread_id_key (ctx, mid);
+     thread_id_string = notmuch->xapian_db->get_metadata (metadata_key);
+ 
+     if (thread_id_string.empty()) {
+diff --git a/lib/message.cc b/lib/message.cc
+index 1ddce3c..afd0264 100644
+--- a/lib/message.cc
++++ b/lib/message.cc
+@@ -225,20 +225,28 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
+     unsigned int doc_id;
+     char *term;
+ 
+-    *status_ret = (notmuch_private_status_t) notmuch_database_find_message (notmuch,
+-									    message_id,
+-									    &message);
+-    if (message)
++    const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);
++    *status_ret =
++	(notmuch_private_status_t) notmuch_database_find_message (notmuch,
++								  u8_id,
++								  &message);
++    if (message) {
++	talloc_free ((char *) u8_id);
+ 	return talloc_steal (notmuch, message);
+-    else if (*status_ret)
++    } else if (*status_ret) {
++	talloc_free ((char *) u8_id);
+ 	return NULL;
++    }
+ 
+     /* If the message ID is too long, substitute its sha1 instead. */
+-    if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
+-	message_id = _notmuch_message_id_compressed (message, message_id);
+-
+-    term = talloc_asprintf (NULL, "%s%s",
+-			    _find_prefix ("id"), message_id);
++    // Strlen still OK?
++    if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) {
++	message_id = _notmuch_message_id_compressed (message, u8_id);
++	talloc_free ((char *) u8_id);
++    } else
++	message_id = u8_id;
++
++    term = talloc_asprintf (NULL, "%s%s", _find_prefix ("id"), message_id);
+     if (term == NULL) {
+ 	*status_ret = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY;
+ 	return NULL;
+@@ -252,6 +260,7 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
+ 	talloc_free (term);
+ 
+ 	doc.add_value (NOTMUCH_VALUE_MESSAGE_ID, message_id);
++	talloc_free ((char *) message_id);
+ 
+ 	doc_id = _notmuch_database_generate_doc_id (notmuch);
+     } catch (const Xapian::Error &error) {
+@@ -1109,13 +1118,14 @@ _notmuch_message_gen_terms (notmuch_message_t *message,
+     if (text == NULL)
+ 	return NOTMUCH_PRIVATE_STATUS_NULL_POINTER;
+ 
++    const char *u8_text = notmuch_bytes_to_utf8(NULL, text, -1);
+     term_gen->set_document (message->doc);
+ 
+     if (prefix_name) {
+ 	const char *prefix = _find_prefix (prefix_name);
+ 
+ 	term_gen->set_termpos (message->termpos);
+-	term_gen->index_text (text, 1, prefix);
++	term_gen->index_text (u8_text, 1, prefix);
+ 	/* Create a gap between this an the next terms so they don't
+ 	 * appear to be a phrase. */
+ 	message->termpos = term_gen->get_termpos () + 100;
+@@ -1124,10 +1134,11 @@ _notmuch_message_gen_terms (notmuch_message_t *message,
+     }
+ 
+     term_gen->set_termpos (message->termpos);
+-    term_gen->index_text (text);
++    term_gen->index_text (u8_text);
+     /* Create a term gap, as above. */
+     message->termpos = term_gen->get_termpos () + 100;
+ 
++    talloc_free ((char *) u8_text);
+     return NOTMUCH_PRIVATE_STATUS_SUCCESS;
+ }
+ 
+@@ -1184,10 +1195,14 @@ notmuch_message_add_tag (notmuch_message_t *message, const char *tag)
+     if (tag == NULL)
+ 	return NOTMUCH_STATUS_NULL_POINTER;
+ 
+-    if (strlen (tag) > NOTMUCH_TAG_MAX)
++    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);
++    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {
++	talloc_free ((char *) u8_tag);
+ 	return NOTMUCH_STATUS_TAG_TOO_LONG;
++    }
+ 
+-    private_status = _notmuch_message_add_term (message, "tag", tag);
++    private_status = _notmuch_message_add_term (message, "tag", u8_tag);
++    talloc_free ((char *) u8_tag);
+     if (private_status) {
+ 	INTERNAL_ERROR ("_notmuch_message_add_term return unexpected value: %d\n",
+ 			private_status);
+@@ -1212,10 +1227,14 @@ notmuch_message_remove_tag (notmuch_message_t *message, const char *tag)
+     if (tag == NULL)
+ 	return NOTMUCH_STATUS_NULL_POINTER;
+ 
+-    if (strlen (tag) > NOTMUCH_TAG_MAX)
++    const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);
++    if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {
++	talloc_free ((char *) u8_tag);
+ 	return NOTMUCH_STATUS_TAG_TOO_LONG;
++    }
+ 
+-    private_status = _notmuch_message_remove_term (message, "tag", tag);
++    private_status = _notmuch_message_remove_term (message, "tag", u8_tag);
++    talloc_free ((char *) u8_tag);
+     if (private_status) {
+ 	INTERNAL_ERROR ("_notmuch_message_remove_term return unexpected value: %d\n",
+ 			private_status);
+diff --git a/lib/notmuch.h b/lib/notmuch.h
+index b1f5bfa..6e13eb1 100644
+--- a/lib/notmuch.h
++++ b/lib/notmuch.h
+@@ -1759,6 +1759,9 @@ notmuch_filenames_move_to_next (notmuch_filenames_t *filenames);
+ void
+ notmuch_filenames_destroy (notmuch_filenames_t *filenames);
+ 
++char *
++notmuch_bytes_to_utf8 (const void *ctx, const char *bytes, const size_t len);
++
+ /* @} */
+ 
+ NOTMUCH_END_DECLS
+diff --git a/lib/query.cc b/lib/query.cc
+index 5275b5a..e48f06a 100644
+--- a/lib/query.cc
++++ b/lib/query.cc
+@@ -86,7 +86,7 @@ notmuch_query_create (notmuch_database_t *notmuch,
+ 
+     query->notmuch = notmuch;
+ 
+-    query->query_string = talloc_strdup (query, query_string);
++    query->query_string = notmuch_bytes_to_utf8 (query, query_string, -1);
+ 
+     query->sort = NOTMUCH_SORT_NEWEST_FIRST;
+ 
+@@ -125,7 +125,9 @@ notmuch_query_get_sort (notmuch_query_t *query)
+ void
+ notmuch_query_add_tag_exclude (notmuch_query_t *query, const char *tag)
+ {
+-    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), tag);
++    const char *u8_tag = notmuch_bytes_to_utf8 (query, tag, -1);
++    char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), u8_tag);
++    talloc_free ((char *) u8_tag);
+     _notmuch_string_list_append (query->exclude_terms, term);
+ }
+ 
+diff --git a/lib/text-util.cc b/lib/text-util.cc
+new file mode 100644
+index 0000000..9dfd31f
+--- /dev/null
++++ b/lib/text-util.cc
+@@ -0,0 +1,82 @@
++/* text-util.cc - notmuch text processing utility functions
++ *
++ * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>
++ *
++ * This program is free software: you can redistribute it and/or modify
++ * it under the terms of the GNU General Public License as published by
++ * the Free Software Foundation, either version 3 of the License, or
++ * (at your option) any later version.
++ *
++ * This program is distributed in the hope that it will be useful,
++ * but WITHOUT ANY WARRANTY; without even the implied warranty of
++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++ * GNU General Public License for more details.
++ *
++ * You should have received a copy of the GNU General Public License
++ * along with this program.  If not, see http://www.gnu.org/licenses/ .
++ *
++ * Author: Rob Browning <rlb@defaultvalue.org>
++ *
++ */
++
++#include "notmuch.h"
++
++#include <assert.h>
++#include <glib.h>
++#include <string.h>
++#include <talloc.h>
++#include <xapian.h>
++
++static gsize
++_notmuch_decompose_to_utf8 (const gunichar uc, gchar *out)
++{
++    gunichar dc[G_UNICHAR_MAX_DECOMPOSITION_LENGTH];
++    // This currently performs canonical decomposition.
++    const gsize dcn =
++	g_unichar_fully_decompose (uc, FALSE, dc,
++				   G_UNICHAR_MAX_DECOMPOSITION_LENGTH);
++    gsize utf8_len = 0;
++    for (gsize i = 0; i < dcn; i++)
++    {
++	const gint dc_bytes = g_unichar_to_utf8 (dc[i], out);
++	utf8_len += dc_bytes;
++	if (out != NULL)
++	    out += dc_bytes;
++    }
++    return utf8_len;
++}
++
++/* Convert a sequence of bytes to UTF-8, handling input encodings as
++ * Xapian does, but produce the canonical encoding.
++ */
++char *
++notmuch_bytes_to_utf8(const void *ctx, const char *bytes, const size_t len)
++{
++    // FIXME: try/catch to convert to error status messages?  Can the
++    // iterator throw?
++    Xapian::Utf8Iterator it;
++    gsize u8_len = 0;
++
++    // Compute the utf-8 length
++    if (len == (size_t) -1)
++	it.assign (bytes, strlen(bytes));
++    else
++	it.assign (bytes, len);
++    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it)
++	u8_len += _notmuch_decompose_to_utf8 (uc, NULL);
++
++    // Convert to utf-8
++    if (len == (size_t) -1)
++	it.assign (bytes, strlen(bytes));
++    else
++	it.assign (bytes, len);
++    char *result = talloc_array (ctx, char, u8_len + 1);
++    gsize u8_i = 0;
++    for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it) {
++	const gsize dc_bytes = _notmuch_decompose_to_utf8 (uc, &(result[u8_i]));
++	u8_i += dc_bytes;
++    }
++    assert (u8_i == u8_len);
++    result[u8_i] = '\0';
++    return result;
++}
+diff --git a/test/Makefile.local b/test/Makefile.local
+index 2331ceb..fd6d06d 100644
+--- a/test/Makefile.local
++++ b/test/Makefile.local
+@@ -15,8 +15,11 @@ smtp_dummy_modules = $(smtp_dummy_srcs:.c=.o)
+ $(dir)/arg-test: $(dir)/arg-test.o command-line-arguments.o util/libutil.a
+ 	$(call quiet,CC) $^ -o $@ $(LDFLAGS)
+ 
+-$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o util/libutil.a
+-	$(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS)
++$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o lib/libnotmuch.a util/libutil.a
++	$(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)
++
++$(dir)/to-utf8: $(dir)/to-utf8.o command-line-arguments.o lib/libnotmuch.a util/libutil.a
++	$(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)
+ 
+ random_corpus_deps =  $(dir)/random-corpus.o  $(dir)/database-test.o \
+ 			notmuch-config.o command-line-arguments.o \
+@@ -46,7 +49,8 @@ test_main_srcs=$(dir)/arg-test.c \
+ 	      $(dir)/parse-time.c \
+ 	      $(dir)/smtp-dummy.c \
+ 	      $(dir)/symbol-test.cc \
+-	      $(dir)/make-db-version.cc \
++	      $(dir)/to-utf8.c \
++	      $(dir)/make-db-version.cc
+ 
+ test_srcs=$(test_main_srcs) $(dir)/database-test.c
+ 
+diff --git a/test/T150-tagging.sh b/test/T150-tagging.sh
+index 821d393..d983fe0 100755
+--- a/test/T150-tagging.sh
++++ b/test/T150-tagging.sh
+@@ -2,6 +2,14 @@
+ test_description='"notmuch tag"'
+ . ./test-lib.sh || exit 1
+ 
++canonicalize_encoding()
++{
++  local decoded u8
++  decoded=$($TEST_DIRECTORY/hex-xcode --direction=decode "$1") || return 1
++  u8=$($TEST_DIRECTORY/to-utf8 "$decoded") || return 1
++  $TEST_DIRECTORY/hex-xcode --direction=encode "$u8"
++}
++
+ add_message '[subject]=One'
+ add_message '[subject]=Two'
+ 
+@@ -191,23 +199,45 @@ test_expect_equal_file EXPECTED OUTPUT
+ test_begin_subtest '--batch: unicode tags'
+ notmuch dump --format=batch-tag > BACKUP
+ 
++# FIXME: test canonical and non-canonical output?
++
++enctag1='%2a@%7d%cf%b5%f4%85%80%adO3%da%a7'
++enctag2='=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d'
++enctag3='A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27'
++enctag4='%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6'
++enctag5='%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d'
++enctag6='L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1'
++enctag7='P%c4%98%2f'
++enctag8='%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d'
++enctag9='%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b'
++
+ notmuch tag --batch <<EOF
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 -- One
+-+=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d -- One
+-+A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 -- One
+++$enctag1 -- One
+++$enctag2 -- One
+++$enctag3 -- One
+ +R -- One
+-+%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 -- One
+-+%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- One
+-+L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 -- One
+-+P%c4%98%2f -- One
+-+%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d -- One
+-+%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- One
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7  +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d  +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27  +R  +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6  +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d  +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1  +P%c4%98%2f  +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d  +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- Two
+++$enctag4 -- One
+++$enctag5 -- One
+++$enctag6 -- One
+++$enctag7 -- One
+++$enctag8 -- One
+++$enctag9 -- One
+++$enctag1  +$enctag2  +$enctag3  +R  +$enctag4  +$enctag5  +$enctag6  +$enctag7  +$enctag8  +$enctag9 -- Two
+ EOF
+ 
++# FIXME: double-check that we need all of these, or do we want to do everything?
++cetag1=$(canonicalize_encoding "$enctag1") || exit 1
++cetag2=$(canonicalize_encoding "$enctag2") || exit 1
++cetag4=$(canonicalize_encoding "$enctag4") || exit 1
++cetag5=$(canonicalize_encoding "$enctag5") || exit 1
++cetag6=$(canonicalize_encoding "$enctag6") || exit 1
++cetag7=$(canonicalize_encoding "$enctag7") || exit 1
++cetag8=$(canonicalize_encoding "$enctag8") || exit 1
++cetag9=$(canonicalize_encoding "$enctag9") || exit 1
++
+ cat <<EOF > EXPECTED
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag4 +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-002@notmuch-test-suite
+-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-001@notmuch-test-suite
+++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag4 +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-002@notmuch-test-suite
+++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-001@notmuch-test-suite
+ EOF
+ 
+ notmuch dump --format=batch-tag | sort > OUTPUT
+diff --git a/test/T240-dump-restore.sh b/test/T240-dump-restore.sh
+index e6976ff..37722fb 100755
+--- a/test/T240-dump-restore.sh
++++ b/test/T240-dump-restore.sh
+@@ -164,7 +164,7 @@ enc1=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag1")
+ tag2=$(printf 'this\n tag\t has\n spaces')
+ enc2=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag2")
+ 
+-enc3='%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a'
++enc3='N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82'
+ tag3=$($TEST_DIRECTORY/hex-xcode --direction=decode $enc3)
+ 
+ notmuch dump --format=batch-tag > BACKUP
+@@ -218,7 +218,7 @@ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
+ 
+ test_begin_subtest 'format=batch-tag, checking encoded output'
+ notmuch dump --format=batch-tag -- from:cworth |\
+-	 awk "{ print \"+$enc1 +$enc2 +$enc3 -- \" \$5 }" > EXPECTED.$test_count
++	 awk "{ print \"+$enc3 +$enc1 +$enc2 -- \" \$5 }" > EXPECTED.$test_count
+ notmuch dump --format=batch-tag -- from:cworth  > OUTPUT.$test_count
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
+ 
+diff --git a/test/T480-hex-escaping.sh b/test/T480-hex-escaping.sh
+index 10527b1..b9c5eac 100755
+--- a/test/T480-hex-escaping.sh
++++ b/test/T480-hex-escaping.sh
+@@ -19,7 +19,7 @@ $TEST_DIRECTORY/hex-xcode --direction=encode  < EXPECTED.$test_count |\
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
+ 
+ test_begin_subtest "round trip 8bit chars"
+-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count
++echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count
+ $TEST_DIRECTORY/hex-xcode --direction=decode  < EXPECTED.$test_count |\
+     $TEST_DIRECTORY/hex-xcode --direction=encode > OUTPUT.$test_count
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
+@@ -42,7 +42,7 @@ $TEST_DIRECTORY/hex-xcode --in-place --direction=encode  < EXPECTED.$test_count
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
+ 
+ test_begin_subtest "round trip 8bit chars (in-place)"
+-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count
++echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count
+ $TEST_DIRECTORY/hex-xcode --in-place --direction=decode  < EXPECTED.$test_count |\
+     $TEST_DIRECTORY/hex-xcode --in-place --direction=encode > OUTPUT.$test_count
+ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
+diff --git a/test/T570-normalization.sh b/test/T570-normalization.sh
+new file mode 100755
+index 0000000..ee3fa94
+--- /dev/null
++++ b/test/T570-normalization.sh
+@@ -0,0 +1,28 @@
++#!/usr/bin/env bash
++
++test_description="text normalization"
++
++. ./test-lib.sh || exit 1
++
++combining_a='AÌ'
++noncombining_a='Ã'
++
++# FIXME: these are extraneous/vestigial, remove from the final patch if still
++# unneeded.
++combining_o='oÌ' # should be U+006f U+0301
++noncombining_o='Ã³' # U+00f3 latin small letter o with acute
++# utf-8:
++#   combining: o b11001100 b10000001 (o 0xcc 0x81)
++#   non-combining: b11000011 b10110011 (0xc3 0xb3)
++combining_token='toÌken' # should be U+006f U+0301
++normalized_token='tÃ³ken' # should be U+0243
++
++test_begin_subtest "Term with combining characters"
++add_message '[content-type]="text/plain; charset=unknown-8bit"' \
++	    '[subject]="reproduc$noncombining_a"' \
++	    '[body]="reproduc$noncombining_a"'
++output=$(notmuch count "reproduc$combining_a" 2>&1 | notmuch_show_sanitize_all)
++
++test_expect_equal "$output" 1
++
++test_done
+diff --git a/test/corpus/cur/52:2, b/test/corpus/cur/52:2,
+index 6028340..852e2bd 100644
+--- a/test/corpus/cur/52:2,
++++ b/test/corpus/cur/52:2,
+@@ -12,8 +12,8 @@ Content-Type: text/plain; charset=ISO-8859-1
+ Content-Transfer-Encoding: 8bit
+ Subject: Re: [aur-general] Guidelines: cp, mkdir vs install
+ 
+-Le 29/12/2011 11:13, Allan McRae a écrit :
+-> On 29/12/11 19:56, François Boulogne wrote:
++Le 29/12/2011 11:13, Allan McRae a eÌcrit :
++> On 29/12/11 19:56, FrancÌ§ois Boulogne wrote:
+ >> Hi,
+ >>
+ >> Looking to improve the quality of my packages, I read again the guidelines.
+@@ -35,5 +35,5 @@ Thank you Allan
+ 
+ 
+ -- 
+-François Boulogne.
++FrancÌ§ois Boulogne.
+ https://www.sciunto.org
+diff --git a/test/to-utf8.c b/test/to-utf8.c
+new file mode 100644
+index 0000000..17bf40d
+--- /dev/null
++++ b/test/to-utf8.c
+@@ -0,0 +1,44 @@
++/* to-utf8.cc - convert bytes to UTF-8 as notmuch would
++ *
++ * usage:
++ * to-utf8 [bytes ...]
++ *
++ * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>
++ *
++ * This program is free software: you can redistribute it and/or modify
++ * it under the terms of the GNU General Public License as published by
++ * the Free Software Foundation, either version 3 of the License, or
++ * (at your option) any later version.
++ *
++ * This program is distributed in the hope that it will be useful,
++ * but WITHOUT ANY WARRANTY; without even the implied warranty of
++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++ * GNU General Public License for more details.
++ *
++ * You should have received a copy of the GNU General Public License
++ * along with this program.  If not, see http://www.gnu.org/licenses/ .
++ *
++ * Author: Rob Browning <rlb@defaultvalue.org>
++ *
++ */
++
++#include "notmuch.h"
++
++#include <stdio.h>
++#include <stdlib.h>
++#include <talloc.h>
++
++int
++main (int argc, char **argv)
++{
++    void *ctx = talloc_new (NULL);
++
++    for (int i = 1; i < argc; i++) {
++	char *u8 = notmuch_bytes_to_utf8(ctx, argv[i], -1);
++	fputs (u8, stdout);
++	talloc_free (u8);
++    }
++
++    talloc_free (ctx);
++    return 0;
++}
+-- 
+2.5.0
+