From a3f498012499ac7260a16be8ab4ef9c86165236e Mon Sep 17 00:00:00 2001 From: Rob Browning Date: Mon, 31 Aug 2015 11:21:16 +1900 Subject: [PATCH] [PATCH 1/1] Store and search for canonical Unicode text [WIP] --- 7c/77303c51600f536be5b341809b4adf06b0e1cb | 689 ++++++++++++++++++++++ 1 file changed, 689 insertions(+) create mode 100644 7c/77303c51600f536be5b341809b4adf06b0e1cb diff --git a/7c/77303c51600f536be5b341809b4adf06b0e1cb b/7c/77303c51600f536be5b341809b4adf06b0e1cb new file mode 100644 index 000000000..29d385e74 --- /dev/null +++ b/7c/77303c51600f536be5b341809b4adf06b0e1cb @@ -0,0 +1,689 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by arlo.cworth.org (Postfix) with ESMTP id 66BC46DE1512 + for ; Sun, 30 Aug 2015 09:26:31 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at cworth.org +X-Spam-Flag: NO +X-Spam-Score: 0.134 +X-Spam-Level: +X-Spam-Status: No, score=0.134 tagged_above=-999 required=5 tests=[AWL=0.684, + RP_MATCHES_RCVD=-0.55] autolearn=disabled +Received: from arlo.cworth.org ([127.0.0.1]) + by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id fmevZxWHrujW for ; + Sun, 30 Aug 2015 09:26:28 -0700 (PDT) +X-Greylist: delayed 309 seconds by postgrey-1.35 at arlo; + Sun, 30 Aug 2015 09:26:28 PDT +Received: from defaultvalue.org (defaultvalue.org [70.85.129.156]) + by arlo.cworth.org (Postfix) with ESMTP id 0C8116DE14FD + for ; Sun, 30 Aug 2015 09:26:28 -0700 (PDT) +Received: from trouble.defaultvalue.org (localhost [127.0.0.1]) + (Authenticated sender: rlb@defaultvalue.org) + by defaultvalue.org (Postfix) with ESMTPSA id 93EC820235 + for ; Sun, 30 Aug 2015 11:21:17 -0500 (CDT) +Received: by trouble.defaultvalue.org (Postfix, from userid 1000) + id 1B1DE14E0F9; Sun, 30 Aug 2015 11:21:16 -0500 (CDT) +From: Rob Browning +To: notmuch@notmuchmail.org +Subject: [PATCH 1/1] Store and search for canonical Unicode text [WIP] +Date: Sun, 30 Aug 2015 11:21:16 -0500 +Message-Id: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org> +X-Mailer: git-send-email 2.5.0 +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.18 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Sun, 30 Aug 2015 16:26:31 -0000 + +WARNING: this version is very preliminary, and might eat your data. + +Unicode has multiple sequences representing what should normally be +considered the same text. For example here's a combining Á and a +noncombining Á. + +Depending on the way you view this, you may or may not see a +difference, but the former is the canonical form, and is represented +by two Unicode code points: a capital A (U+0041) followed by a +"combining acute accent" (U+0301); the latter is the single code +point (U+00C1), which is probably what most people would type. + +Before this change, notmuch would index two strings that differ only +with respect to canonicalization, like tóken and tóken, as separate +terms, even though they may be visually indistinguishable, and do (for +most purposes) represent the same text. After indexing, searching for +one would not find the other, and which one you present to notmuch +when you search depends on your tools. See test/T570-normalization.sh +for a working example. + +Since we're talking about differing representations that one wouldn't +normally want to distinguish, this patch unifies the various +representations by converting all incoming text to its canonical form +before indexing, and canonicalizing all query strings. + +Up to now, notmuch has let Xapian handle converting the incoming bytes +to UTF-8. Xapian treats any byte sequence as UTF-8, and interprets +any invalid UTF-8 bytes as Latin-1. This patch maintains the existing +behavior (excepting the new canonicalization) by using Xapian's +Utf8Iterator to handle the initial Unicode character parsing. + +Note that the parsing approach in this patch is not particularly +efficient, both because it traverses the incoming bytes three times: + + - once to determine how long the input is (currently the iterator + can't directly handle null terminated char*'s), + + - once to determine how long the final UTF-8 allocation needs to + be, + + - and once for the conversion. + +And because when the input is already UTF-8, it just blindly converts +from UTF-8 to Unicode code points, and then back to UTF-8 (after +canonicalization), during each pass. There are certainly +opportunities to optimize, though it may be worth discussing the +detection of data encodings more broadly first. + +FIXME: document current encoding behavior clearly in +new/insert/search-terms. + +FIXME: what about existing indexed text? +--- + + Posted for preliminary discussion, and as a milestone (it appears to + mostly work now). Though I doubt I'm handling things correctly + everywhere notmuch-wise, wrt talloc, etc. + + lib/Makefile.local | 1 + + lib/database.cc | 17 ++++++++-- + lib/message.cc | 51 +++++++++++++++++++--------- + lib/notmuch.h | 3 ++ + lib/query.cc | 6 ++-- + lib/text-util.cc | 82 ++++++++++++++++++++++++++++++++++++++++++++++ + test/Makefile.local | 10 ++++-- + test/T150-tagging.sh | 54 +++++++++++++++++++++++------- + test/T240-dump-restore.sh | 4 +-- + test/T480-hex-escaping.sh | 4 +-- + test/T570-normalization.sh | 28 ++++++++++++++++ + test/corpus/cur/52:2, | 6 ++-- + test/to-utf8.c | 44 +++++++++++++++++++++++++ + 13 files changed, 267 insertions(+), 43 deletions(-) + create mode 100644 lib/text-util.cc + create mode 100755 test/T570-normalization.sh + create mode 100644 test/to-utf8.c + +diff --git a/lib/Makefile.local b/lib/Makefile.local +index 3a07090..41fd1e1 100644 +--- a/lib/Makefile.local ++++ b/lib/Makefile.local +@@ -48,6 +48,7 @@ libnotmuch_cxx_srcs = \ + $(dir)/index.cc \ + $(dir)/message.cc \ + $(dir)/query.cc \ ++ $(dir)/text-util.cc \ + $(dir)/thread.cc + + libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o) +diff --git a/lib/database.cc b/lib/database.cc +index 6a15174..7a01f95 100644 +--- a/lib/database.cc ++++ b/lib/database.cc +@@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, unsigned doc_id) + char * + _notmuch_message_id_compressed (void *ctx, const char *message_id) + { ++ // Assumes message_id is normalized utf-8. + char *sha1, *compressed; + + sha1 = _notmuch_sha1_of_string (message_id); +@@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t *notmuch, + if (message_ret == NULL) + return NOTMUCH_STATUS_NULL_POINTER; + +- if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX) +- message_id = _notmuch_message_id_compressed (notmuch, message_id); ++ const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1); ++ ++ // Is strlen still appropriate? ++ if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) ++ { ++ message_id = _notmuch_message_id_compressed (notmuch, u8_id); ++ talloc_free ((char *) u8_id); ++ } else ++ message_id = u8_id; + + try { + status = _notmuch_database_find_unique_doc_id (notmuch, "id", + message_id, &doc_id); ++ talloc_free ((char *) message_id); + + if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND) + *message_ret = NULL; +@@ -1910,6 +1919,7 @@ _notmuch_database_generate_thread_id (notmuch_database_t *notmuch) + static char * + _get_metadata_thread_id_key (void *ctx, const char *message_id) + { ++ // Assumes message_id is normalized utf-8. + if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX) + message_id = _notmuch_message_id_compressed (ctx, message_id); + +@@ -2011,7 +2021,8 @@ _resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch, + * generate a new thread ID and store it there. + */ + db = static_cast (notmuch->xapian_db); +- metadata_key = _get_metadata_thread_id_key (ctx, message_id); ++ const char *mid = notmuch_message_get_message_id (message); ++ metadata_key =_get_metadata_thread_id_key (ctx, mid); + thread_id_string = notmuch->xapian_db->get_metadata (metadata_key); + + if (thread_id_string.empty()) { +diff --git a/lib/message.cc b/lib/message.cc +index 1ddce3c..afd0264 100644 +--- a/lib/message.cc ++++ b/lib/message.cc +@@ -225,20 +225,28 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch, + unsigned int doc_id; + char *term; + +- *status_ret = (notmuch_private_status_t) notmuch_database_find_message (notmuch, +- message_id, +- &message); +- if (message) ++ const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1); ++ *status_ret = ++ (notmuch_private_status_t) notmuch_database_find_message (notmuch, ++ u8_id, ++ &message); ++ if (message) { ++ talloc_free ((char *) u8_id); + return talloc_steal (notmuch, message); +- else if (*status_ret) ++ } else if (*status_ret) { ++ talloc_free ((char *) u8_id); + return NULL; ++ } + + /* If the message ID is too long, substitute its sha1 instead. */ +- if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX) +- message_id = _notmuch_message_id_compressed (message, message_id); +- +- term = talloc_asprintf (NULL, "%s%s", +- _find_prefix ("id"), message_id); ++ // Strlen still OK? ++ if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) { ++ message_id = _notmuch_message_id_compressed (message, u8_id); ++ talloc_free ((char *) u8_id); ++ } else ++ message_id = u8_id; ++ ++ term = talloc_asprintf (NULL, "%s%s", _find_prefix ("id"), message_id); + if (term == NULL) { + *status_ret = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY; + return NULL; +@@ -252,6 +260,7 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch, + talloc_free (term); + + doc.add_value (NOTMUCH_VALUE_MESSAGE_ID, message_id); ++ talloc_free ((char *) message_id); + + doc_id = _notmuch_database_generate_doc_id (notmuch); + } catch (const Xapian::Error &error) { +@@ -1109,13 +1118,14 @@ _notmuch_message_gen_terms (notmuch_message_t *message, + if (text == NULL) + return NOTMUCH_PRIVATE_STATUS_NULL_POINTER; + ++ const char *u8_text = notmuch_bytes_to_utf8(NULL, text, -1); + term_gen->set_document (message->doc); + + if (prefix_name) { + const char *prefix = _find_prefix (prefix_name); + + term_gen->set_termpos (message->termpos); +- term_gen->index_text (text, 1, prefix); ++ term_gen->index_text (u8_text, 1, prefix); + /* Create a gap between this an the next terms so they don't + * appear to be a phrase. */ + message->termpos = term_gen->get_termpos () + 100; +@@ -1124,10 +1134,11 @@ _notmuch_message_gen_terms (notmuch_message_t *message, + } + + term_gen->set_termpos (message->termpos); +- term_gen->index_text (text); ++ term_gen->index_text (u8_text); + /* Create a term gap, as above. */ + message->termpos = term_gen->get_termpos () + 100; + ++ talloc_free ((char *) u8_text); + return NOTMUCH_PRIVATE_STATUS_SUCCESS; + } + +@@ -1184,10 +1195,14 @@ notmuch_message_add_tag (notmuch_message_t *message, const char *tag) + if (tag == NULL) + return NOTMUCH_STATUS_NULL_POINTER; + +- if (strlen (tag) > NOTMUCH_TAG_MAX) ++ const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1); ++ if (strlen (u8_tag) > NOTMUCH_TAG_MAX) { ++ talloc_free ((char *) u8_tag); + return NOTMUCH_STATUS_TAG_TOO_LONG; ++ } + +- private_status = _notmuch_message_add_term (message, "tag", tag); ++ private_status = _notmuch_message_add_term (message, "tag", u8_tag); ++ talloc_free ((char *) u8_tag); + if (private_status) { + INTERNAL_ERROR ("_notmuch_message_add_term return unexpected value: %d\n", + private_status); +@@ -1212,10 +1227,14 @@ notmuch_message_remove_tag (notmuch_message_t *message, const char *tag) + if (tag == NULL) + return NOTMUCH_STATUS_NULL_POINTER; + +- if (strlen (tag) > NOTMUCH_TAG_MAX) ++ const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1); ++ if (strlen (u8_tag) > NOTMUCH_TAG_MAX) { ++ talloc_free ((char *) u8_tag); + return NOTMUCH_STATUS_TAG_TOO_LONG; ++ } + +- private_status = _notmuch_message_remove_term (message, "tag", tag); ++ private_status = _notmuch_message_remove_term (message, "tag", u8_tag); ++ talloc_free ((char *) u8_tag); + if (private_status) { + INTERNAL_ERROR ("_notmuch_message_remove_term return unexpected value: %d\n", + private_status); +diff --git a/lib/notmuch.h b/lib/notmuch.h +index b1f5bfa..6e13eb1 100644 +--- a/lib/notmuch.h ++++ b/lib/notmuch.h +@@ -1759,6 +1759,9 @@ notmuch_filenames_move_to_next (notmuch_filenames_t *filenames); + void + notmuch_filenames_destroy (notmuch_filenames_t *filenames); + ++char * ++notmuch_bytes_to_utf8 (const void *ctx, const char *bytes, const size_t len); ++ + /* @} */ + + NOTMUCH_END_DECLS +diff --git a/lib/query.cc b/lib/query.cc +index 5275b5a..e48f06a 100644 +--- a/lib/query.cc ++++ b/lib/query.cc +@@ -86,7 +86,7 @@ notmuch_query_create (notmuch_database_t *notmuch, + + query->notmuch = notmuch; + +- query->query_string = talloc_strdup (query, query_string); ++ query->query_string = notmuch_bytes_to_utf8 (query, query_string, -1); + + query->sort = NOTMUCH_SORT_NEWEST_FIRST; + +@@ -125,7 +125,9 @@ notmuch_query_get_sort (notmuch_query_t *query) + void + notmuch_query_add_tag_exclude (notmuch_query_t *query, const char *tag) + { +- char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), tag); ++ const char *u8_tag = notmuch_bytes_to_utf8 (query, tag, -1); ++ char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), u8_tag); ++ talloc_free ((char *) u8_tag); + _notmuch_string_list_append (query->exclude_terms, term); + } + +diff --git a/lib/text-util.cc b/lib/text-util.cc +new file mode 100644 +index 0000000..9dfd31f +--- /dev/null ++++ b/lib/text-util.cc +@@ -0,0 +1,82 @@ ++/* text-util.cc - notmuch text processing utility functions ++ * ++ * Copyright (C) 2015 Rob Browning ++ * ++ * This program is free software: you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License as published by ++ * the Free Software Foundation, either version 3 of the License, or ++ * (at your option) any later version. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public License ++ * along with this program. If not, see http://www.gnu.org/licenses/ . ++ * ++ * Author: Rob Browning ++ * ++ */ ++ ++#include "notmuch.h" ++ ++#include ++#include ++#include ++#include ++#include ++ ++static gsize ++_notmuch_decompose_to_utf8 (const gunichar uc, gchar *out) ++{ ++ gunichar dc[G_UNICHAR_MAX_DECOMPOSITION_LENGTH]; ++ // This currently performs canonical decomposition. ++ const gsize dcn = ++ g_unichar_fully_decompose (uc, FALSE, dc, ++ G_UNICHAR_MAX_DECOMPOSITION_LENGTH); ++ gsize utf8_len = 0; ++ for (gsize i = 0; i < dcn; i++) ++ { ++ const gint dc_bytes = g_unichar_to_utf8 (dc[i], out); ++ utf8_len += dc_bytes; ++ if (out != NULL) ++ out += dc_bytes; ++ } ++ return utf8_len; ++} ++ ++/* Convert a sequence of bytes to UTF-8, handling input encodings as ++ * Xapian does, but produce the canonical encoding. ++ */ ++char * ++notmuch_bytes_to_utf8(const void *ctx, const char *bytes, const size_t len) ++{ ++ // FIXME: try/catch to convert to error status messages? Can the ++ // iterator throw? ++ Xapian::Utf8Iterator it; ++ gsize u8_len = 0; ++ ++ // Compute the utf-8 length ++ if (len == (size_t) -1) ++ it.assign (bytes, strlen(bytes)); ++ else ++ it.assign (bytes, len); ++ for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it) ++ u8_len += _notmuch_decompose_to_utf8 (uc, NULL); ++ ++ // Convert to utf-8 ++ if (len == (size_t) -1) ++ it.assign (bytes, strlen(bytes)); ++ else ++ it.assign (bytes, len); ++ char *result = talloc_array (ctx, char, u8_len + 1); ++ gsize u8_i = 0; ++ for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it) { ++ const gsize dc_bytes = _notmuch_decompose_to_utf8 (uc, &(result[u8_i])); ++ u8_i += dc_bytes; ++ } ++ assert (u8_i == u8_len); ++ result[u8_i] = '\0'; ++ return result; ++} +diff --git a/test/Makefile.local b/test/Makefile.local +index 2331ceb..fd6d06d 100644 +--- a/test/Makefile.local ++++ b/test/Makefile.local +@@ -15,8 +15,11 @@ smtp_dummy_modules = $(smtp_dummy_srcs:.c=.o) + $(dir)/arg-test: $(dir)/arg-test.o command-line-arguments.o util/libutil.a + $(call quiet,CC) $^ -o $@ $(LDFLAGS) + +-$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o util/libutil.a +- $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) ++$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o lib/libnotmuch.a util/libutil.a ++ $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS) ++ ++$(dir)/to-utf8: $(dir)/to-utf8.o command-line-arguments.o lib/libnotmuch.a util/libutil.a ++ $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS) + + random_corpus_deps = $(dir)/random-corpus.o $(dir)/database-test.o \ + notmuch-config.o command-line-arguments.o \ +@@ -46,7 +49,8 @@ test_main_srcs=$(dir)/arg-test.c \ + $(dir)/parse-time.c \ + $(dir)/smtp-dummy.c \ + $(dir)/symbol-test.cc \ +- $(dir)/make-db-version.cc \ ++ $(dir)/to-utf8.c \ ++ $(dir)/make-db-version.cc + + test_srcs=$(test_main_srcs) $(dir)/database-test.c + +diff --git a/test/T150-tagging.sh b/test/T150-tagging.sh +index 821d393..d983fe0 100755 +--- a/test/T150-tagging.sh ++++ b/test/T150-tagging.sh +@@ -2,6 +2,14 @@ + test_description='"notmuch tag"' + . ./test-lib.sh || exit 1 + ++canonicalize_encoding() ++{ ++ local decoded u8 ++ decoded=$($TEST_DIRECTORY/hex-xcode --direction=decode "$1") || return 1 ++ u8=$($TEST_DIRECTORY/to-utf8 "$decoded") || return 1 ++ $TEST_DIRECTORY/hex-xcode --direction=encode "$u8" ++} ++ + add_message '[subject]=One' + add_message '[subject]=Two' + +@@ -191,23 +199,45 @@ test_expect_equal_file EXPECTED OUTPUT + test_begin_subtest '--batch: unicode tags' + notmuch dump --format=batch-tag > BACKUP + ++# FIXME: test canonical and non-canonical output? ++ ++enctag1='%2a@%7d%cf%b5%f4%85%80%adO3%da%a7' ++enctag2='=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d' ++enctag3='A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27' ++enctag4='%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6' ++enctag5='%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d' ++enctag6='L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1' ++enctag7='P%c4%98%2f' ++enctag8='%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d' ++enctag9='%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b' ++ + notmuch tag --batch < EXPECTED +-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag4 +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-002@notmuch-test-suite +-+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-001@notmuch-test-suite +++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag4 +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-002@notmuch-test-suite +++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-001@notmuch-test-suite + EOF + + notmuch dump --format=batch-tag | sort > OUTPUT +diff --git a/test/T240-dump-restore.sh b/test/T240-dump-restore.sh +index e6976ff..37722fb 100755 +--- a/test/T240-dump-restore.sh ++++ b/test/T240-dump-restore.sh +@@ -164,7 +164,7 @@ enc1=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag1") + tag2=$(printf 'this\n tag\t has\n spaces') + enc2=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag2") + +-enc3='%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' ++enc3='N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' + tag3=$($TEST_DIRECTORY/hex-xcode --direction=decode $enc3) + + notmuch dump --format=batch-tag > BACKUP +@@ -218,7 +218,7 @@ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count + + test_begin_subtest 'format=batch-tag, checking encoded output' + notmuch dump --format=batch-tag -- from:cworth |\ +- awk "{ print \"+$enc1 +$enc2 +$enc3 -- \" \$5 }" > EXPECTED.$test_count ++ awk "{ print \"+$enc3 +$enc1 +$enc2 -- \" \$5 }" > EXPECTED.$test_count + notmuch dump --format=batch-tag -- from:cworth > OUTPUT.$test_count + test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count + +diff --git a/test/T480-hex-escaping.sh b/test/T480-hex-escaping.sh +index 10527b1..b9c5eac 100755 +--- a/test/T480-hex-escaping.sh ++++ b/test/T480-hex-escaping.sh +@@ -19,7 +19,7 @@ $TEST_DIRECTORY/hex-xcode --direction=encode < EXPECTED.$test_count |\ + test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count + + test_begin_subtest "round trip 8bit chars" +-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count ++echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count + $TEST_DIRECTORY/hex-xcode --direction=decode < EXPECTED.$test_count |\ + $TEST_DIRECTORY/hex-xcode --direction=encode > OUTPUT.$test_count + test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count +@@ -42,7 +42,7 @@ $TEST_DIRECTORY/hex-xcode --in-place --direction=encode < EXPECTED.$test_count + test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count + + test_begin_subtest "round trip 8bit chars (in-place)" +-echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count ++echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count + $TEST_DIRECTORY/hex-xcode --in-place --direction=decode < EXPECTED.$test_count |\ + $TEST_DIRECTORY/hex-xcode --in-place --direction=encode > OUTPUT.$test_count + test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count +diff --git a/test/T570-normalization.sh b/test/T570-normalization.sh +new file mode 100755 +index 0000000..ee3fa94 +--- /dev/null ++++ b/test/T570-normalization.sh +@@ -0,0 +1,28 @@ ++#!/usr/bin/env bash ++ ++test_description="text normalization" ++ ++. ./test-lib.sh || exit 1 ++ ++combining_a='Á' ++noncombining_a='Á' ++ ++# FIXME: these are extraneous/vestigial, remove from the final patch if still ++# unneeded. ++combining_o='ó' # should be U+006f U+0301 ++noncombining_o='ó' # U+00f3 latin small letter o with acute ++# utf-8: ++# combining: o b11001100 b10000001 (o 0xcc 0x81) ++# non-combining: b11000011 b10110011 (0xc3 0xb3) ++combining_token='tóken' # should be U+006f U+0301 ++normalized_token='tóken' # should be U+0243 ++ ++test_begin_subtest "Term with combining characters" ++add_message '[content-type]="text/plain; charset=unknown-8bit"' \ ++ '[subject]="reproduc$noncombining_a"' \ ++ '[body]="reproduc$noncombining_a"' ++output=$(notmuch count "reproduc$combining_a" 2>&1 | notmuch_show_sanitize_all) ++ ++test_expect_equal "$output" 1 ++ ++test_done +diff --git a/test/corpus/cur/52:2, b/test/corpus/cur/52:2, +index 6028340..852e2bd 100644 +--- a/test/corpus/cur/52:2, ++++ b/test/corpus/cur/52:2, +@@ -12,8 +12,8 @@ Content-Type: text/plain; charset=ISO-8859-1 + Content-Transfer-Encoding: 8bit + Subject: Re: [aur-general] Guidelines: cp, mkdir vs install + +-Le 29/12/2011 11:13, Allan McRae a écrit : +-> On 29/12/11 19:56, François Boulogne wrote: ++Le 29/12/2011 11:13, Allan McRae a écrit : ++> On 29/12/11 19:56, François Boulogne wrote: + >> Hi, + >> + >> Looking to improve the quality of my packages, I read again the guidelines. +@@ -35,5 +35,5 @@ Thank you Allan + + + -- +-François Boulogne. ++François Boulogne. + https://www.sciunto.org +diff --git a/test/to-utf8.c b/test/to-utf8.c +new file mode 100644 +index 0000000..17bf40d +--- /dev/null ++++ b/test/to-utf8.c +@@ -0,0 +1,44 @@ ++/* to-utf8.cc - convert bytes to UTF-8 as notmuch would ++ * ++ * usage: ++ * to-utf8 [bytes ...] ++ * ++ * Copyright (C) 2015 Rob Browning ++ * ++ * This program is free software: you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License as published by ++ * the Free Software Foundation, either version 3 of the License, or ++ * (at your option) any later version. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public License ++ * along with this program. If not, see http://www.gnu.org/licenses/ . ++ * ++ * Author: Rob Browning ++ * ++ */ ++ ++#include "notmuch.h" ++ ++#include ++#include ++#include ++ ++int ++main (int argc, char **argv) ++{ ++ void *ctx = talloc_new (NULL); ++ ++ for (int i = 1; i < argc; i++) { ++ char *u8 = notmuch_bytes_to_utf8(ctx, argv[i], -1); ++ fputs (u8, stdout); ++ talloc_free (u8); ++ } ++ ++ talloc_free (ctx); ++ return 0; ++} +-- +2.5.0 + -- 2.26.2