1 Return-Path: <rlb@defaultvalue.org>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by arlo.cworth.org (Postfix) with ESMTP id 66BC46DE1512
\r
6 for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:31 -0700 (PDT)
\r
7 X-Virus-Scanned: Debian amavisd-new at cworth.org
\r
11 X-Spam-Status: No, score=0.134 tagged_above=-999 required=5 tests=[AWL=0.684,
\r
12 RP_MATCHES_RCVD=-0.55] autolearn=disabled
\r
13 Received: from arlo.cworth.org ([127.0.0.1])
\r
14 by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)
\r
15 with ESMTP id fmevZxWHrujW for <notmuch@notmuchmail.org>;
\r
16 Sun, 30 Aug 2015 09:26:28 -0700 (PDT)
\r
17 X-Greylist: delayed 309 seconds by postgrey-1.35 at arlo;
\r
18 Sun, 30 Aug 2015 09:26:28 PDT
\r
19 Received: from defaultvalue.org (defaultvalue.org [70.85.129.156])
\r
20 by arlo.cworth.org (Postfix) with ESMTP id 0C8116DE14FD
\r
21 for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 09:26:28 -0700 (PDT)
\r
22 Received: from trouble.defaultvalue.org (localhost [127.0.0.1])
\r
23 (Authenticated sender: rlb@defaultvalue.org)
\r
24 by defaultvalue.org (Postfix) with ESMTPSA id 93EC820235
\r
25 for <notmuch@notmuchmail.org>; Sun, 30 Aug 2015 11:21:17 -0500 (CDT)
\r
26 Received: by trouble.defaultvalue.org (Postfix, from userid 1000)
\r
27 id 1B1DE14E0F9; Sun, 30 Aug 2015 11:21:16 -0500 (CDT)
\r
28 From: Rob Browning <rlb@defaultvalue.org>
\r
29 To: notmuch@notmuchmail.org
\r
30 Subject: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
\r
31 Date: Sun, 30 Aug 2015 11:21:16 -0500
\r
32 Message-Id: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>
\r
33 X-Mailer: git-send-email 2.5.0
\r
35 Content-Type: text/plain; charset=UTF-8
\r
36 Content-Transfer-Encoding: 8bit
\r
37 X-BeenThere: notmuch@notmuchmail.org
\r
38 X-Mailman-Version: 2.1.18
\r
40 List-Id: "Use and development of the notmuch mail system."
\r
41 <notmuch.notmuchmail.org>
\r
42 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
43 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
44 List-Archive: <http://notmuchmail.org/pipermail/notmuch/>
\r
45 List-Post: <mailto:notmuch@notmuchmail.org>
\r
46 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
47 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
48 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
49 X-List-Received-Date: Sun, 30 Aug 2015 16:26:31 -0000
\r
51 WARNING: this version is very preliminary, and might eat your data.
\r
53 Unicode has multiple sequences representing what should normally be
\r
54 considered the same text. For example here's a combining Á and a
\r
57 Depending on the way you view this, you may or may not see a
\r
58 difference, but the former is the canonical form, and is represented
\r
59 by two Unicode code points: a capital A (U+0041) followed by a
\r
60 "combining acute accent" (U+0301); the latter is the single code
\r
61 point (U+00C1), which is probably what most people would type.
\r
63 Before this change, notmuch would index two strings that differ only
\r
64 with respect to canonicalization, like tóken and tóken, as separate
\r
65 terms, even though they may be visually indistinguishable, and do (for
\r
66 most purposes) represent the same text. After indexing, searching for
\r
67 one would not find the other, and which one you present to notmuch
\r
68 when you search depends on your tools. See test/T570-normalization.sh
\r
69 for a working example.
\r
71 Since we're talking about differing representations that one wouldn't
\r
72 normally want to distinguish, this patch unifies the various
\r
73 representations by converting all incoming text to its canonical form
\r
74 before indexing, and canonicalizing all query strings.
\r
76 Up to now, notmuch has let Xapian handle converting the incoming bytes
\r
77 to UTF-8. Xapian treats any byte sequence as UTF-8, and interprets
\r
78 any invalid UTF-8 bytes as Latin-1. This patch maintains the existing
\r
79 behavior (excepting the new canonicalization) by using Xapian's
\r
80 Utf8Iterator to handle the initial Unicode character parsing.
\r
82 Note that the parsing approach in this patch is not particularly
\r
83 efficient, both because it traverses the incoming bytes three times:
\r
85 - once to determine how long the input is (currently the iterator
\r
86 can't directly handle null terminated char*'s),
\r
88 - once to determine how long the final UTF-8 allocation needs to
\r
91 - and once for the conversion.
\r
93 And because when the input is already UTF-8, it just blindly converts
\r
94 from UTF-8 to Unicode code points, and then back to UTF-8 (after
\r
95 canonicalization), during each pass. There are certainly
\r
96 opportunities to optimize, though it may be worth discussing the
\r
97 detection of data encodings more broadly first.
\r
99 FIXME: document current encoding behavior clearly in
\r
100 new/insert/search-terms.
\r
102 FIXME: what about existing indexed text?
\r
105 Posted for preliminary discussion, and as a milestone (it appears to
\r
106 mostly work now). Though I doubt I'm handling things correctly
\r
107 everywhere notmuch-wise, wrt talloc, etc.
\r
109 lib/Makefile.local | 1 +
\r
110 lib/database.cc | 17 ++++++++--
\r
111 lib/message.cc | 51 +++++++++++++++++++---------
\r
112 lib/notmuch.h | 3 ++
\r
113 lib/query.cc | 6 ++--
\r
114 lib/text-util.cc | 82 ++++++++++++++++++++++++++++++++++++++++++++++
\r
115 test/Makefile.local | 10 ++++--
\r
116 test/T150-tagging.sh | 54 +++++++++++++++++++++++-------
\r
117 test/T240-dump-restore.sh | 4 +--
\r
118 test/T480-hex-escaping.sh | 4 +--
\r
119 test/T570-normalization.sh | 28 ++++++++++++++++
\r
120 test/corpus/cur/52:2, | 6 ++--
\r
121 test/to-utf8.c | 44 +++++++++++++++++++++++++
\r
122 13 files changed, 267 insertions(+), 43 deletions(-)
\r
123 create mode 100644 lib/text-util.cc
\r
124 create mode 100755 test/T570-normalization.sh
\r
125 create mode 100644 test/to-utf8.c
\r
127 diff --git a/lib/Makefile.local b/lib/Makefile.local
\r
128 index 3a07090..41fd1e1 100644
\r
129 --- a/lib/Makefile.local
\r
130 +++ b/lib/Makefile.local
\r
131 @@ -48,6 +48,7 @@ libnotmuch_cxx_srcs = \
\r
133 $(dir)/message.cc \
\r
135 + $(dir)/text-util.cc \
\r
138 libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o)
\r
139 diff --git a/lib/database.cc b/lib/database.cc
\r
140 index 6a15174..7a01f95 100644
\r
141 --- a/lib/database.cc
\r
142 +++ b/lib/database.cc
\r
143 @@ -436,6 +436,7 @@ find_document_for_doc_id (notmuch_database_t *notmuch, unsigned doc_id)
\r
145 _notmuch_message_id_compressed (void *ctx, const char *message_id)
\r
147 + // Assumes message_id is normalized utf-8.
\r
148 char *sha1, *compressed;
\r
150 sha1 = _notmuch_sha1_of_string (message_id);
\r
151 @@ -457,12 +458,20 @@ notmuch_database_find_message (notmuch_database_t *notmuch,
\r
152 if (message_ret == NULL)
\r
153 return NOTMUCH_STATUS_NULL_POINTER;
\r
155 - if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
\r
156 - message_id = _notmuch_message_id_compressed (notmuch, message_id);
\r
157 + const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);
\r
159 + // Is strlen still appropriate?
\r
160 + if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX)
\r
162 + message_id = _notmuch_message_id_compressed (notmuch, u8_id);
\r
163 + talloc_free ((char *) u8_id);
\r
165 + message_id = u8_id;
\r
168 status = _notmuch_database_find_unique_doc_id (notmuch, "id",
\r
169 message_id, &doc_id);
\r
170 + talloc_free ((char *) message_id);
\r
172 if (status == NOTMUCH_PRIVATE_STATUS_NO_DOCUMENT_FOUND)
\r
173 *message_ret = NULL;
\r
174 @@ -1910,6 +1919,7 @@ _notmuch_database_generate_thread_id (notmuch_database_t *notmuch)
\r
176 _get_metadata_thread_id_key (void *ctx, const char *message_id)
\r
178 + // Assumes message_id is normalized utf-8.
\r
179 if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
\r
180 message_id = _notmuch_message_id_compressed (ctx, message_id);
\r
182 @@ -2011,7 +2021,8 @@ _resolve_message_id_to_thread_id_old (notmuch_database_t *notmuch,
\r
183 * generate a new thread ID and store it there.
\r
185 db = static_cast <Xapian::WritableDatabase *> (notmuch->xapian_db);
\r
186 - metadata_key = _get_metadata_thread_id_key (ctx, message_id);
\r
187 + const char *mid = notmuch_message_get_message_id (message);
\r
188 + metadata_key =_get_metadata_thread_id_key (ctx, mid);
\r
189 thread_id_string = notmuch->xapian_db->get_metadata (metadata_key);
\r
191 if (thread_id_string.empty()) {
\r
192 diff --git a/lib/message.cc b/lib/message.cc
\r
193 index 1ddce3c..afd0264 100644
\r
194 --- a/lib/message.cc
\r
195 +++ b/lib/message.cc
\r
196 @@ -225,20 +225,28 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
\r
197 unsigned int doc_id;
\r
200 - *status_ret = (notmuch_private_status_t) notmuch_database_find_message (notmuch,
\r
204 + const char *u8_id = notmuch_bytes_to_utf8 (notmuch, message_id, -1);
\r
206 + (notmuch_private_status_t) notmuch_database_find_message (notmuch,
\r
210 + talloc_free ((char *) u8_id);
\r
211 return talloc_steal (notmuch, message);
\r
212 - else if (*status_ret)
\r
213 + } else if (*status_ret) {
\r
214 + talloc_free ((char *) u8_id);
\r
218 /* If the message ID is too long, substitute its sha1 instead. */
\r
219 - if (strlen (message_id) > NOTMUCH_MESSAGE_ID_MAX)
\r
220 - message_id = _notmuch_message_id_compressed (message, message_id);
\r
222 - term = talloc_asprintf (NULL, "%s%s",
\r
223 - _find_prefix ("id"), message_id);
\r
224 + // Strlen still OK?
\r
225 + if (strlen (u8_id) > NOTMUCH_MESSAGE_ID_MAX) {
\r
226 + message_id = _notmuch_message_id_compressed (message, u8_id);
\r
227 + talloc_free ((char *) u8_id);
\r
229 + message_id = u8_id;
\r
231 + term = talloc_asprintf (NULL, "%s%s", _find_prefix ("id"), message_id);
\r
232 if (term == NULL) {
\r
233 *status_ret = NOTMUCH_PRIVATE_STATUS_OUT_OF_MEMORY;
\r
235 @@ -252,6 +260,7 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
\r
236 talloc_free (term);
\r
238 doc.add_value (NOTMUCH_VALUE_MESSAGE_ID, message_id);
\r
239 + talloc_free ((char *) message_id);
\r
241 doc_id = _notmuch_database_generate_doc_id (notmuch);
\r
242 } catch (const Xapian::Error &error) {
\r
243 @@ -1109,13 +1118,14 @@ _notmuch_message_gen_terms (notmuch_message_t *message,
\r
245 return NOTMUCH_PRIVATE_STATUS_NULL_POINTER;
\r
247 + const char *u8_text = notmuch_bytes_to_utf8(NULL, text, -1);
\r
248 term_gen->set_document (message->doc);
\r
251 const char *prefix = _find_prefix (prefix_name);
\r
253 term_gen->set_termpos (message->termpos);
\r
254 - term_gen->index_text (text, 1, prefix);
\r
255 + term_gen->index_text (u8_text, 1, prefix);
\r
256 /* Create a gap between this an the next terms so they don't
\r
257 * appear to be a phrase. */
\r
258 message->termpos = term_gen->get_termpos () + 100;
\r
259 @@ -1124,10 +1134,11 @@ _notmuch_message_gen_terms (notmuch_message_t *message,
\r
262 term_gen->set_termpos (message->termpos);
\r
263 - term_gen->index_text (text);
\r
264 + term_gen->index_text (u8_text);
\r
265 /* Create a term gap, as above. */
\r
266 message->termpos = term_gen->get_termpos () + 100;
\r
268 + talloc_free ((char *) u8_text);
\r
269 return NOTMUCH_PRIVATE_STATUS_SUCCESS;
\r
272 @@ -1184,10 +1195,14 @@ notmuch_message_add_tag (notmuch_message_t *message, const char *tag)
\r
274 return NOTMUCH_STATUS_NULL_POINTER;
\r
276 - if (strlen (tag) > NOTMUCH_TAG_MAX)
\r
277 + const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);
\r
278 + if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {
\r
279 + talloc_free ((char *) u8_tag);
\r
280 return NOTMUCH_STATUS_TAG_TOO_LONG;
\r
283 - private_status = _notmuch_message_add_term (message, "tag", tag);
\r
284 + private_status = _notmuch_message_add_term (message, "tag", u8_tag);
\r
285 + talloc_free ((char *) u8_tag);
\r
286 if (private_status) {
\r
287 INTERNAL_ERROR ("_notmuch_message_add_term return unexpected value: %d\n",
\r
289 @@ -1212,10 +1227,14 @@ notmuch_message_remove_tag (notmuch_message_t *message, const char *tag)
\r
291 return NOTMUCH_STATUS_NULL_POINTER;
\r
293 - if (strlen (tag) > NOTMUCH_TAG_MAX)
\r
294 + const char *u8_tag = notmuch_bytes_to_utf8 (message, tag, -1);
\r
295 + if (strlen (u8_tag) > NOTMUCH_TAG_MAX) {
\r
296 + talloc_free ((char *) u8_tag);
\r
297 return NOTMUCH_STATUS_TAG_TOO_LONG;
\r
300 - private_status = _notmuch_message_remove_term (message, "tag", tag);
\r
301 + private_status = _notmuch_message_remove_term (message, "tag", u8_tag);
\r
302 + talloc_free ((char *) u8_tag);
\r
303 if (private_status) {
\r
304 INTERNAL_ERROR ("_notmuch_message_remove_term return unexpected value: %d\n",
\r
306 diff --git a/lib/notmuch.h b/lib/notmuch.h
\r
307 index b1f5bfa..6e13eb1 100644
\r
308 --- a/lib/notmuch.h
\r
309 +++ b/lib/notmuch.h
\r
310 @@ -1759,6 +1759,9 @@ notmuch_filenames_move_to_next (notmuch_filenames_t *filenames);
\r
312 notmuch_filenames_destroy (notmuch_filenames_t *filenames);
\r
315 +notmuch_bytes_to_utf8 (const void *ctx, const char *bytes, const size_t len);
\r
320 diff --git a/lib/query.cc b/lib/query.cc
\r
321 index 5275b5a..e48f06a 100644
\r
324 @@ -86,7 +86,7 @@ notmuch_query_create (notmuch_database_t *notmuch,
\r
326 query->notmuch = notmuch;
\r
328 - query->query_string = talloc_strdup (query, query_string);
\r
329 + query->query_string = notmuch_bytes_to_utf8 (query, query_string, -1);
\r
331 query->sort = NOTMUCH_SORT_NEWEST_FIRST;
\r
333 @@ -125,7 +125,9 @@ notmuch_query_get_sort (notmuch_query_t *query)
\r
335 notmuch_query_add_tag_exclude (notmuch_query_t *query, const char *tag)
\r
337 - char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), tag);
\r
338 + const char *u8_tag = notmuch_bytes_to_utf8 (query, tag, -1);
\r
339 + char *term = talloc_asprintf (query, "%s%s", _find_prefix ("tag"), u8_tag);
\r
340 + talloc_free ((char *) u8_tag);
\r
341 _notmuch_string_list_append (query->exclude_terms, term);
\r
344 diff --git a/lib/text-util.cc b/lib/text-util.cc
\r
345 new file mode 100644
\r
346 index 0000000..9dfd31f
\r
348 +++ b/lib/text-util.cc
\r
350 +/* text-util.cc - notmuch text processing utility functions
\r
352 + * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>
\r
354 + * This program is free software: you can redistribute it and/or modify
\r
355 + * it under the terms of the GNU General Public License as published by
\r
356 + * the Free Software Foundation, either version 3 of the License, or
\r
357 + * (at your option) any later version.
\r
359 + * This program is distributed in the hope that it will be useful,
\r
360 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
\r
361 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
\r
362 + * GNU General Public License for more details.
\r
364 + * You should have received a copy of the GNU General Public License
\r
365 + * along with this program. If not, see http://www.gnu.org/licenses/ .
\r
367 + * Author: Rob Browning <rlb@defaultvalue.org>
\r
371 +#include "notmuch.h"
\r
373 +#include <assert.h>
\r
375 +#include <string.h>
\r
376 +#include <talloc.h>
\r
377 +#include <xapian.h>
\r
380 +_notmuch_decompose_to_utf8 (const gunichar uc, gchar *out)
\r
382 + gunichar dc[G_UNICHAR_MAX_DECOMPOSITION_LENGTH];
\r
383 + // This currently performs canonical decomposition.
\r
384 + const gsize dcn =
\r
385 + g_unichar_fully_decompose (uc, FALSE, dc,
\r
386 + G_UNICHAR_MAX_DECOMPOSITION_LENGTH);
\r
387 + gsize utf8_len = 0;
\r
388 + for (gsize i = 0; i < dcn; i++)
\r
390 + const gint dc_bytes = g_unichar_to_utf8 (dc[i], out);
\r
391 + utf8_len += dc_bytes;
\r
398 +/* Convert a sequence of bytes to UTF-8, handling input encodings as
\r
399 + * Xapian does, but produce the canonical encoding.
\r
402 +notmuch_bytes_to_utf8(const void *ctx, const char *bytes, const size_t len)
\r
404 + // FIXME: try/catch to convert to error status messages? Can the
\r
405 + // iterator throw?
\r
406 + Xapian::Utf8Iterator it;
\r
407 + gsize u8_len = 0;
\r
409 + // Compute the utf-8 length
\r
410 + if (len == (size_t) -1)
\r
411 + it.assign (bytes, strlen(bytes));
\r
413 + it.assign (bytes, len);
\r
414 + for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it)
\r
415 + u8_len += _notmuch_decompose_to_utf8 (uc, NULL);
\r
417 + // Convert to utf-8
\r
418 + if (len == (size_t) -1)
\r
419 + it.assign (bytes, strlen(bytes));
\r
421 + it.assign (bytes, len);
\r
422 + char *result = talloc_array (ctx, char, u8_len + 1);
\r
424 + for (unsigned uc = *it; uc != unsigned(-1); it++, uc = *it) {
\r
425 + const gsize dc_bytes = _notmuch_decompose_to_utf8 (uc, &(result[u8_i]));
\r
426 + u8_i += dc_bytes;
\r
428 + assert (u8_i == u8_len);
\r
429 + result[u8_i] = '\0';
\r
432 diff --git a/test/Makefile.local b/test/Makefile.local
\r
433 index 2331ceb..fd6d06d 100644
\r
434 --- a/test/Makefile.local
\r
435 +++ b/test/Makefile.local
\r
436 @@ -15,8 +15,11 @@ smtp_dummy_modules = $(smtp_dummy_srcs:.c=.o)
\r
437 $(dir)/arg-test: $(dir)/arg-test.o command-line-arguments.o util/libutil.a
\r
438 $(call quiet,CC) $^ -o $@ $(LDFLAGS)
\r
440 -$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o util/libutil.a
\r
441 - $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS)
\r
442 +$(dir)/hex-xcode: $(dir)/hex-xcode.o command-line-arguments.o lib/libnotmuch.a util/libutil.a
\r
443 + $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)
\r
445 +$(dir)/to-utf8: $(dir)/to-utf8.o command-line-arguments.o lib/libnotmuch.a util/libutil.a
\r
446 + $(call quiet,CC) $^ -o $@ $(LDFLAGS) $(TALLOC_LDFLAGS) $(CONFIGURE_LDFLAGS)
\r
448 random_corpus_deps = $(dir)/random-corpus.o $(dir)/database-test.o \
\r
449 notmuch-config.o command-line-arguments.o \
\r
450 @@ -46,7 +49,8 @@ test_main_srcs=$(dir)/arg-test.c \
\r
451 $(dir)/parse-time.c \
\r
452 $(dir)/smtp-dummy.c \
\r
453 $(dir)/symbol-test.cc \
\r
454 - $(dir)/make-db-version.cc \
\r
455 + $(dir)/to-utf8.c \
\r
456 + $(dir)/make-db-version.cc
\r
458 test_srcs=$(test_main_srcs) $(dir)/database-test.c
\r
460 diff --git a/test/T150-tagging.sh b/test/T150-tagging.sh
\r
461 index 821d393..d983fe0 100755
\r
462 --- a/test/T150-tagging.sh
\r
463 +++ b/test/T150-tagging.sh
\r
465 test_description='"notmuch tag"'
\r
466 . ./test-lib.sh || exit 1
\r
468 +canonicalize_encoding()
\r
471 + decoded=$($TEST_DIRECTORY/hex-xcode --direction=decode "$1") || return 1
\r
472 + u8=$($TEST_DIRECTORY/to-utf8 "$decoded") || return 1
\r
473 + $TEST_DIRECTORY/hex-xcode --direction=encode "$u8"
\r
476 add_message '[subject]=One'
\r
477 add_message '[subject]=Two'
\r
479 @@ -191,23 +199,45 @@ test_expect_equal_file EXPECTED OUTPUT
\r
480 test_begin_subtest '--batch: unicode tags'
\r
481 notmuch dump --format=batch-tag > BACKUP
\r
483 +# FIXME: test canonical and non-canonical output?
\r
485 +enctag1='%2a@%7d%cf%b5%f4%85%80%adO3%da%a7'
\r
486 +enctag2='=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d'
\r
487 +enctag3='A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27'
\r
488 +enctag4='%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6'
\r
489 +enctag5='%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d'
\r
490 +enctag6='L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1'
\r
491 +enctag7='P%c4%98%2f'
\r
492 +enctag8='%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d'
\r
493 +enctag9='%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b'
\r
495 notmuch tag --batch <<EOF
\r
496 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 -- One
\r
497 -+=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d -- One
\r
498 -+A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 -- One
\r
503 -+%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 -- One
\r
504 -+%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- One
\r
505 -+L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 -- One
\r
506 -+P%c4%98%2f -- One
\r
507 -+%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d -- One
\r
508 -+%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- One
\r
509 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +R +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b -- Two
\r
516 ++$enctag1 +$enctag2 +$enctag3 +R +$enctag4 +$enctag5 +$enctag6 +$enctag7 +$enctag8 +$enctag9 -- Two
\r
519 +# FIXME: double-check that we need all of these, or do we want to do everything?
\r
520 +cetag1=$(canonicalize_encoding "$enctag1") || exit 1
\r
521 +cetag2=$(canonicalize_encoding "$enctag2") || exit 1
\r
522 +cetag4=$(canonicalize_encoding "$enctag4") || exit 1
\r
523 +cetag5=$(canonicalize_encoding "$enctag5") || exit 1
\r
524 +cetag6=$(canonicalize_encoding "$enctag6") || exit 1
\r
525 +cetag7=$(canonicalize_encoding "$enctag7") || exit 1
\r
526 +cetag8=$(canonicalize_encoding "$enctag8") || exit 1
\r
527 +cetag9=$(canonicalize_encoding "$enctag9") || exit 1
\r
529 cat <<EOF > EXPECTED
\r
530 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag4 +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-002@notmuch-test-suite
\r
531 -+%2a@%7d%cf%b5%f4%85%80%adO3%da%a7 +=%e0%ac%95%c8%b3+%ef%aa%95%c8%a64w%c7%9d%c9%a2%cf%b3%d6%82%24B%c4%a9%c5%a1UX%ee%99%b0%27E7%ca%a4%d0%8b%5d +A%e1%a0%bc%de%8b%d5%b2V%d9%9b%f3%b5%a2%a3M%d8%a1u@%f0%a0%ac%948%7e%f0%ab%86%af%27 +L%df%85%ef%a1%a5m@%d3%96%c2%ab%d4%9f%ca%b8%f3%b3%a2%bf%c7%b1_u%d7%b4%c7%b1 +P%c4%98%2f +R +inbox +tag5 +unread +%7e%d1%8b%25%ec%a0%ae%d1%a0M%3b%e3%b6%b7%e9%a4%87%3c%db%9a%cc%a8%e1%96%9d +%c4%bf7%c7%ab9H%c4%99k%ea%91%bd%c3%8ck%e2%b3%8dk%c5%952V%e4%99%b2%d9%b3%e4%8b%bda%5b%24%c7%9b +%da%88=f%cc%b9I%ce%af%7b%c9%97%e3%b9%8bH%cb%92X%d2%8c6 +%dc%9crh%d2%86B%e5%97%a2%22t%ed%99%82d -- id:msg-001@notmuch-test-suite
\r
532 ++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag4 +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-002@notmuch-test-suite
\r
533 ++$cetag1 +$cetag2 +$enctag3 +$cetag6 +$cetag7 +R +inbox +tag5 +unread +$cetag8 +$cetag9 +$cetag4 +$cetag5 -- id:msg-001@notmuch-test-suite
\r
536 notmuch dump --format=batch-tag | sort > OUTPUT
\r
537 diff --git a/test/T240-dump-restore.sh b/test/T240-dump-restore.sh
\r
538 index e6976ff..37722fb 100755
\r
539 --- a/test/T240-dump-restore.sh
\r
540 +++ b/test/T240-dump-restore.sh
\r
541 @@ -164,7 +164,7 @@ enc1=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag1")
\r
542 tag2=$(printf 'this\n tag\t has\n spaces')
\r
543 enc2=$($TEST_DIRECTORY/hex-xcode --direction=encode "$tag2")
\r
545 -enc3='%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a'
\r
546 +enc3='N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82'
\r
547 tag3=$($TEST_DIRECTORY/hex-xcode --direction=decode $enc3)
\r
549 notmuch dump --format=batch-tag > BACKUP
\r
550 @@ -218,7 +218,7 @@ test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
\r
552 test_begin_subtest 'format=batch-tag, checking encoded output'
\r
553 notmuch dump --format=batch-tag -- from:cworth |\
\r
554 - awk "{ print \"+$enc1 +$enc2 +$enc3 -- \" \$5 }" > EXPECTED.$test_count
\r
555 + awk "{ print \"+$enc3 +$enc1 +$enc2 -- \" \$5 }" > EXPECTED.$test_count
\r
556 notmuch dump --format=batch-tag -- from:cworth > OUTPUT.$test_count
\r
557 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
\r
559 diff --git a/test/T480-hex-escaping.sh b/test/T480-hex-escaping.sh
\r
560 index 10527b1..b9c5eac 100755
\r
561 --- a/test/T480-hex-escaping.sh
\r
562 +++ b/test/T480-hex-escaping.sh
\r
563 @@ -19,7 +19,7 @@ $TEST_DIRECTORY/hex-xcode --direction=encode < EXPECTED.$test_count |\
\r
564 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
\r
566 test_begin_subtest "round trip 8bit chars"
\r
567 -echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count
\r
568 +echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count
\r
569 $TEST_DIRECTORY/hex-xcode --direction=decode < EXPECTED.$test_count |\
\r
570 $TEST_DIRECTORY/hex-xcode --direction=encode > OUTPUT.$test_count
\r
571 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
\r
572 @@ -42,7 +42,7 @@ $TEST_DIRECTORY/hex-xcode --in-place --direction=encode < EXPECTED.$test_count
\r
573 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
\r
575 test_begin_subtest "round trip 8bit chars (in-place)"
\r
576 -echo '%c3%91%c3%a5%c3%b0%c3%a3%c3%a5%c3%a9-%c3%8f%c3%8a' > EXPECTED.$test_count
\r
577 +echo 'N%cc%83a%cc%8a%c3%b0a%cc%83a%cc%8ae%cc%81-I%cc%88E%cc%82' > EXPECTED.$test_count
\r
578 $TEST_DIRECTORY/hex-xcode --in-place --direction=decode < EXPECTED.$test_count |\
\r
579 $TEST_DIRECTORY/hex-xcode --in-place --direction=encode > OUTPUT.$test_count
\r
580 test_expect_equal_file EXPECTED.$test_count OUTPUT.$test_count
\r
581 diff --git a/test/T570-normalization.sh b/test/T570-normalization.sh
\r
582 new file mode 100755
\r
583 index 0000000..ee3fa94
\r
585 +++ b/test/T570-normalization.sh
\r
587 +#!/usr/bin/env bash
\r
589 +test_description="text normalization"
\r
591 +. ./test-lib.sh || exit 1
\r
594 +noncombining_a='Á'
\r
596 +# FIXME: these are extraneous/vestigial, remove from the final patch if still
\r
598 +combining_o='ó' # should be U+006f U+0301
\r
599 +noncombining_o='ó' # U+00f3 latin small letter o with acute
\r
601 +# combining: o b11001100 b10000001 (o 0xcc 0x81)
\r
602 +# non-combining: b11000011 b10110011 (0xc3 0xb3)
\r
603 +combining_token='tóken' # should be U+006f U+0301
\r
604 +normalized_token='tóken' # should be U+0243
\r
606 +test_begin_subtest "Term with combining characters"
\r
607 +add_message '[content-type]="text/plain; charset=unknown-8bit"' \
\r
608 + '[subject]="reproduc$noncombining_a"' \
\r
609 + '[body]="reproduc$noncombining_a"'
\r
610 +output=$(notmuch count "reproduc$combining_a" 2>&1 | notmuch_show_sanitize_all)
\r
612 +test_expect_equal "$output" 1
\r
615 diff --git a/test/corpus/cur/52:2, b/test/corpus/cur/52:2,
\r
616 index 6028340..852e2bd 100644
\r
617 --- a/test/corpus/cur/52:2,
\r
618 +++ b/test/corpus/cur/52:2,
\r
619 @@ -12,8 +12,8 @@ Content-Type: text/plain; charset=ISO-8859-1
\r
620 Content-Transfer-Encoding: 8bit
\r
621 Subject: Re: [aur-general] Guidelines: cp, mkdir vs install
\r
623 -Le 29/12/2011 11:13, Allan McRae a écrit :
\r
624 -> On 29/12/11 19:56, François Boulogne wrote:
\r
625 +Le 29/12/2011 11:13, Allan McRae a écrit :
\r
626 +> On 29/12/11 19:56, François Boulogne wrote:
\r
629 >> Looking to improve the quality of my packages, I read again the guidelines.
\r
630 @@ -35,5 +35,5 @@ Thank you Allan
\r
634 -François Boulogne.
\r
635 +François Boulogne.
\r
636 https://www.sciunto.org
\r
637 diff --git a/test/to-utf8.c b/test/to-utf8.c
\r
638 new file mode 100644
\r
639 index 0000000..17bf40d
\r
641 +++ b/test/to-utf8.c
\r
643 +/* to-utf8.cc - convert bytes to UTF-8 as notmuch would
\r
646 + * to-utf8 [bytes ...]
\r
648 + * Copyright (C) 2015 Rob Browning <rlb@defaultvalue.org>
\r
650 + * This program is free software: you can redistribute it and/or modify
\r
651 + * it under the terms of the GNU General Public License as published by
\r
652 + * the Free Software Foundation, either version 3 of the License, or
\r
653 + * (at your option) any later version.
\r
655 + * This program is distributed in the hope that it will be useful,
\r
656 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
\r
657 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
\r
658 + * GNU General Public License for more details.
\r
660 + * You should have received a copy of the GNU General Public License
\r
661 + * along with this program. If not, see http://www.gnu.org/licenses/ .
\r
663 + * Author: Rob Browning <rlb@defaultvalue.org>
\r
667 +#include "notmuch.h"
\r
669 +#include <stdio.h>
\r
670 +#include <stdlib.h>
\r
671 +#include <talloc.h>
\r
674 +main (int argc, char **argv)
\r
676 + void *ctx = talloc_new (NULL);
\r
678 + for (int i = 1; i < argc; i++) {
\r
679 + char *u8 = notmuch_bytes_to_utf8(ctx, argv[i], -1);
\r
680 + fputs (u8, stdout);
\r
681 + talloc_free (u8);
\r
684 + talloc_free (ctx);
\r