1 Return-Path: <amdragon@mit.edu>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by olra.theworths.org (Postfix) with ESMTP id D4F5F40DEF2
\r
6 for <notmuch@notmuchmail.org>; Wed, 17 Nov 2010 11:28:41 -0800 (PST)
\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
\r
11 X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5
\r
12 tests=[BAYES_00=-1.9] autolearn=ham
\r
13 Received: from olra.theworths.org ([127.0.0.1])
\r
14 by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
\r
15 with ESMTP id PKOzSoBVm2uk for <notmuch@notmuchmail.org>;
\r
16 Wed, 17 Nov 2010 11:28:28 -0800 (PST)
\r
17 Received: from dmz-mailsec-scanner-8.mit.edu (DMZ-MAILSEC-SCANNER-8.MIT.EDU
\r
19 by olra.theworths.org (Postfix) with ESMTP id 9B44740DEF0
\r
20 for <notmuch@notmuchmail.org>; Wed, 17 Nov 2010 11:28:28 -0800 (PST)
\r
21 X-AuditID: 12074425-b7c98ae000000a04-b9-4ce42cdc7f33
\r
22 Received: from mailhub-auth-1.mit.edu ( [18.9.21.35])
\r
23 by dmz-mailsec-scanner-8.mit.edu (Symantec Brightmail Gateway) with
\r
24 SMTP id 8A.37.02564.CDC24EC4; Wed, 17 Nov 2010 14:28:28 -0500 (EST)
\r
25 Received: from outgoing.mit.edu (OUTGOING-AUTH.MIT.EDU [18.7.22.103])
\r
26 by mailhub-auth-1.mit.edu (8.13.8/8.9.2) with ESMTP id oAHJSRH4018322
\r
27 for <notmuch@notmuchmail.org>; Wed, 17 Nov 2010 14:28:27 -0500
\r
28 Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91])
\r
29 (authenticated bits=0)
\r
30 (User authenticated as amdragon@ATHENA.MIT.EDU)
\r
31 by outgoing.mit.edu (8.13.6/8.12.4) with ESMTP id oAHJSQ1T007610
\r
32 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT)
\r
33 for <notmuch@notmuchmail.org>; Wed, 17 Nov 2010 14:28:27 -0500 (EST)
\r
34 Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.72)
\r
35 (envelope-from <amdragon@mit.edu>) id 1PIngA-0001ET-Tq
\r
36 for notmuch@notmuchmail.org; Wed, 17 Nov 2010 14:28:26 -0500
\r
37 Date: Wed, 17 Nov 2010 14:28:26 -0500
\r
38 From: Austin Clements <amdragon@MIT.EDU>
\r
39 To: notmuch@notmuchmail.org
\r
40 Subject: [PATCH 3/4] Optimize thread search using matched docid sets.
\r
41 Message-ID: <20101117192826.GU2439@mit.edu>
\r
43 Content-Type: text/plain; charset=us-ascii
\r
44 Content-Disposition: inline
\r
45 User-Agent: Mutt/1.5.20 (2009-06-14)
\r
46 X-Brightmail-Tracker: AAAAAA==
\r
47 X-BeenThere: notmuch@notmuchmail.org
\r
48 X-Mailman-Version: 2.1.13
\r
50 List-Id: "Use and development of the notmuch mail system."
\r
51 <notmuch.notmuchmail.org>
\r
52 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
53 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
54 List-Archive: <http://notmuchmail.org/pipermail/notmuch>
\r
55 List-Post: <mailto:notmuch@notmuchmail.org>
\r
56 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
57 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
58 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
59 X-List-Received-Date: Wed, 17 Nov 2010 19:28:42 -0000
\r
61 This reduces thread search's 1+2t Xapian queries (where t is the
\r
62 number of matched threads) to 1+t queries and constructs exactly one
\r
63 notmuch_message_t for each message instead of 2 to 3.
\r
64 notmuch_query_search_threads eagerly fetches the docids of all
\r
65 messages matching the user query instead of lazily constructing
\r
66 message objects and fetching thread ID's from term lists.
\r
67 _notmuch_thread_create takes a seed docid and the set of all matched
\r
68 docids and uses a single Xapian query to expand this docid to its
\r
69 containing thread, using the matched docid set to determine which
\r
70 messages in the thread match the user query instead of using a second
\r
73 As a side effect, this fixes author order so authors are always sorted
\r
74 by first occurrence in each thread. This breaks two emacs tests that
\r
75 hard-code the old, buggy author order.
\r
77 This reduces the amount of time required to load my inbox from 4.523
\r
78 seconds to 3.025 seconds (1.5X faster).
\r
80 lib/message.cc | 6 ++
\r
81 lib/notmuch-private.h | 17 +++++-
\r
82 lib/query.cc | 142 ++++++++++++++++++++++++++++++++++++------------
\r
83 lib/thread.cc | 102 ++++++++++-------------------------
\r
85 5 files changed, 158 insertions(+), 113 deletions(-)
\r
87 diff --git a/lib/message.cc b/lib/message.cc
\r
88 index 225b7e9..adcd07d 100644
\r
89 --- a/lib/message.cc
\r
90 +++ b/lib/message.cc
\r
91 @@ -254,6 +254,12 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
\r
96 +_notmuch_message_get_doc_id (notmuch_message_t *message)
\r
98 + return message->doc_id;
\r
102 notmuch_message_get_message_id (notmuch_message_t *message)
\r
104 diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
\r
105 index 592cfb2..303aeb3 100644
\r
106 --- a/lib/notmuch-private.h
\r
107 +++ b/lib/notmuch-private.h
\r
108 @@ -156,6 +156,8 @@ typedef enum _notmuch_private_status {
\r
110 (notmuch_status_t) private_status)
\r
112 +typedef struct _notmuch_doc_id_set notmuch_doc_id_set_t;
\r
116 /* Lookup a prefix value by name.
\r
117 @@ -222,8 +224,8 @@ _notmuch_directory_get_document_id (notmuch_directory_t *directory);
\r
119 _notmuch_thread_create (void *ctx,
\r
120 notmuch_database_t *notmuch,
\r
121 - const char *thread_id,
\r
122 - const char *query_string,
\r
123 + unsigned int seed_doc_id,
\r
124 + notmuch_doc_id_set_t *match_set,
\r
125 notmuch_sort_t sort);
\r
128 @@ -239,6 +241,9 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch,
\r
129 const char *message_id,
\r
130 notmuch_private_status_t *status);
\r
133 +_notmuch_message_get_doc_id (notmuch_message_t *message);
\r
136 _notmuch_message_get_in_reply_to (notmuch_message_t *message);
\r
138 @@ -426,6 +431,14 @@ _notmuch_mset_messages_get (notmuch_messages_t *messages);
\r
140 _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages);
\r
143 +_notmuch_doc_id_set_contains (notmuch_doc_id_set_t *doc_ids,
\r
144 + unsigned int doc_id);
\r
147 +_notmuch_doc_id_set_remove (notmuch_doc_id_set_t *doc_ids,
\r
148 + unsigned int doc_id);
\r
153 diff --git a/lib/query.cc b/lib/query.cc
\r
154 index 7916421..c7ae4ee 100644
\r
157 @@ -36,13 +36,21 @@ typedef struct _notmuch_mset_messages {
\r
158 Xapian::MSetIterator iterator_end;
\r
159 } notmuch_mset_messages_t;
\r
161 +struct _notmuch_doc_id_set {
\r
162 + unsigned int *bitmap;
\r
163 + unsigned int bound;
\r
166 struct _notmuch_threads {
\r
167 notmuch_query_t *query;
\r
168 - GHashTable *threads;
\r
169 - notmuch_messages_t *messages;
\r
171 - /* This thread ID is our iterator state. */
\r
172 - const char *thread_id;
\r
173 + /* The ordered list of doc ids matched by the query. */
\r
175 + /* Our iterator's current position in doc_ids. */
\r
176 + unsigned int doc_id_pos;
\r
177 + /* The set of matched docid's that have not been assigned to a
\r
178 + * thread. Initially, this contains every docid in doc_ids. */
\r
179 + notmuch_doc_id_set_t match_set;
\r
183 @@ -195,6 +203,19 @@ _notmuch_mset_messages_valid (notmuch_messages_t *messages)
\r
184 return (mset_messages->iterator != mset_messages->iterator_end);
\r
187 +static Xapian::docid
\r
188 +_notmuch_mset_messages_get_doc_id (notmuch_messages_t *messages)
\r
190 + notmuch_mset_messages_t *mset_messages;
\r
192 + mset_messages = (notmuch_mset_messages_t *) messages;
\r
194 + if (! _notmuch_mset_messages_valid (&mset_messages->base))
\r
197 + return *mset_messages->iterator;
\r
200 notmuch_message_t *
\r
201 _notmuch_mset_messages_get (notmuch_messages_t *messages)
\r
203 @@ -233,6 +254,49 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages)
\r
204 mset_messages->iterator++;
\r
207 +static notmuch_bool_t
\r
208 +_notmuch_doc_id_set_init (void *ctx,
\r
209 + notmuch_doc_id_set_t *doc_ids,
\r
210 + GArray *arr, unsigned int bound)
\r
212 + size_t count = (bound + sizeof (doc_ids->bitmap[0]) - 1) /
\r
213 + sizeof (doc_ids->bitmap[0]);
\r
214 + unsigned int *bitmap = talloc_zero_array (ctx, unsigned int, count);
\r
216 + if (bitmap == NULL)
\r
219 + doc_ids->bitmap = bitmap;
\r
220 + doc_ids->bound = bound;
\r
222 + for (unsigned int i = 0; i < arr->len; i++) {
\r
223 + unsigned int doc_id = g_array_index(arr, unsigned int, i);
\r
224 + bitmap[doc_id / sizeof (bitmap[0])] |=
\r
225 + 1 << (doc_id % sizeof (bitmap[0]));
\r
232 +_notmuch_doc_id_set_contains (notmuch_doc_id_set_t *doc_ids,
\r
233 + unsigned int doc_id)
\r
235 + if (doc_id >= doc_ids->bound)
\r
237 + return (doc_ids->bitmap[doc_id / sizeof (doc_ids->bitmap[0])] &
\r
238 + (1 << (doc_id % sizeof (doc_ids->bitmap[0])))) != 0;
\r
242 +_notmuch_doc_id_set_remove (notmuch_doc_id_set_t *doc_ids,
\r
243 + unsigned int doc_id)
\r
245 + if (doc_id < doc_ids->bound)
\r
246 + doc_ids->bitmap[doc_id / sizeof (doc_ids->bitmap[0])] &=
\r
247 + ~(1 << (doc_id % sizeof (doc_ids->bitmap[0])));
\r
250 /* Glib objects force use to use a talloc destructor as well, (but not
\r
251 * nearly as ugly as the for messages due to C++ objects). At
\r
252 * this point, I'd really like to have some talloc-friendly
\r
253 @@ -240,7 +304,8 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages)
\r
255 _notmuch_threads_destructor (notmuch_threads_t *threads)
\r
257 - g_hash_table_unref (threads->threads);
\r
258 + if (threads->doc_ids)
\r
259 + g_array_unref (threads->doc_ids);
\r
263 @@ -249,24 +314,39 @@ notmuch_threads_t *
\r
264 notmuch_query_search_threads (notmuch_query_t *query)
\r
266 notmuch_threads_t *threads;
\r
267 + notmuch_messages_t *messages;
\r
268 + Xapian::docid max_doc_id = 0;
\r
270 threads = talloc (query, notmuch_threads_t);
\r
271 if (threads == NULL)
\r
273 + threads->doc_ids = NULL;
\r
274 + talloc_set_destructor (threads, _notmuch_threads_destructor);
\r
276 threads->query = query;
\r
277 - threads->threads = g_hash_table_new_full (g_str_hash, g_str_equal,
\r
280 - threads->messages = notmuch_query_search_messages (query);
\r
281 - if (threads->messages == NULL) {
\r
282 + messages = notmuch_query_search_messages (query);
\r
283 + if (messages == NULL) {
\r
284 talloc_free (threads);
\r
288 - threads->thread_id = NULL;
\r
289 + threads->doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int));
\r
290 + while (notmuch_messages_valid (messages)) {
\r
291 + unsigned int doc_id = _notmuch_mset_messages_get_doc_id (messages);
\r
292 + g_array_append_val (threads->doc_ids, doc_id);
\r
293 + max_doc_id = MAX (max_doc_id, doc_id);
\r
294 + notmuch_messages_move_to_next (messages);
\r
296 + threads->doc_id_pos = 0;
\r
298 - talloc_set_destructor (threads, _notmuch_threads_destructor);
\r
299 + talloc_free (messages);
\r
301 + if (! _notmuch_doc_id_set_init (threads, &threads->match_set,
\r
302 + threads->doc_ids, max_doc_id + 1)) {
\r
303 + talloc_free (threads);
\r
309 @@ -280,51 +360,41 @@ notmuch_query_destroy (notmuch_query_t *query)
\r
311 notmuch_threads_valid (notmuch_threads_t *threads)
\r
313 - notmuch_message_t *message;
\r
315 - if (threads->thread_id)
\r
318 - while (notmuch_messages_valid (threads->messages))
\r
320 - message = notmuch_messages_get (threads->messages);
\r
321 + unsigned int doc_id;
\r
323 - threads->thread_id = notmuch_message_get_thread_id (message);
\r
325 - if (! g_hash_table_lookup_extended (threads->threads,
\r
326 - threads->thread_id,
\r
329 - g_hash_table_insert (threads->threads,
\r
330 - xstrdup (threads->thread_id), NULL);
\r
331 - notmuch_messages_move_to_next (threads->messages);
\r
334 + while (threads->doc_id_pos < threads->doc_ids->len) {
\r
335 + doc_id = g_array_index (threads->doc_ids, unsigned int,
\r
336 + threads->doc_id_pos);
\r
337 + if (_notmuch_doc_id_set_contains (&threads->match_set, doc_id))
\r
340 - notmuch_messages_move_to_next (threads->messages);
\r
341 + threads->doc_id_pos++;
\r
344 - threads->thread_id = NULL;
\r
346 + return threads->doc_id_pos < threads->doc_ids->len;
\r
350 notmuch_threads_get (notmuch_threads_t *threads)
\r
352 + unsigned int doc_id;
\r
354 if (! notmuch_threads_valid (threads))
\r
357 + doc_id = g_array_index (threads->doc_ids, unsigned int,
\r
358 + threads->doc_id_pos);
\r
359 return _notmuch_thread_create (threads->query,
\r
360 threads->query->notmuch,
\r
361 - threads->thread_id,
\r
362 - threads->query->query_string,
\r
364 + &threads->match_set,
\r
365 threads->query->sort);
\r
369 notmuch_threads_move_to_next (notmuch_threads_t *threads)
\r
371 - threads->thread_id = NULL;
\r
372 + threads->doc_id_pos++;
\r
376 diff --git a/lib/thread.cc b/lib/thread.cc
\r
377 index 7f15586..244c038 100644
\r
378 --- a/lib/thread.cc
\r
379 +++ b/lib/thread.cc
\r
380 @@ -305,7 +305,7 @@ _thread_add_matched_message (notmuch_thread_t *thread,
\r
382 _thread_add_matched_author (thread, notmuch_message_get_author (hashed_message));
\r
384 - if ((sort == NOTMUCH_SORT_OLDEST_FIRST && date <= thread->newest) ||
\r
385 + if ((sort == NOTMUCH_SORT_OLDEST_FIRST && date == thread->oldest) ||
\r
386 (sort != NOTMUCH_SORT_OLDEST_FIRST && date == thread->newest))
\r
388 _thread_set_subject_from_message (thread, message);
\r
389 @@ -350,16 +350,17 @@ _resolve_thread_relationships (unused (notmuch_thread_t *thread))
\r
393 -/* Create a new notmuch_thread_t object for the given thread ID,
\r
394 - * treating any messages matching 'query_string' as "matched".
\r
395 +/* Create a new notmuch_thread_t object by finding the thread
\r
396 + * containing the message with the given doc ID, treating any messages
\r
397 + * contained in match_set as "matched". Remove all messages in the
\r
398 + * thread from match_set.
\r
400 - * Creating the thread will trigger two database searches. The first
\r
401 - * is for all messages belonging to the thread, (to get the first
\r
402 - * subject line, the total count of messages, and all authors). The
\r
403 - * second search is for all messages that are in the thread and that
\r
404 - * also match the given query_string. This is to allow for a separate
\r
405 - * count of matched messages, and to allow a viewer to display these
\r
406 - * messages differently.
\r
407 + * Creating the thread will perform a database search to get all
\r
408 + * messages belonging to the thread and will get the first subject
\r
409 + * line, the total count of messages, and all authors in the thread.
\r
410 + * Each message in the thread is checked against match_set to allow
\r
411 + * for a separate count of matched messages, and to allow a viewer to
\r
412 + * display these messages differently.
\r
414 * Here, 'ctx' is talloc context for the resulting thread object.
\r
416 @@ -368,53 +369,28 @@ _resolve_thread_relationships (unused (notmuch_thread_t *thread))
\r
418 _notmuch_thread_create (void *ctx,
\r
419 notmuch_database_t *notmuch,
\r
420 - const char *thread_id,
\r
421 - const char *query_string,
\r
422 + unsigned int seed_doc_id,
\r
423 + notmuch_doc_id_set_t *match_set,
\r
424 notmuch_sort_t sort)
\r
426 notmuch_thread_t *thread;
\r
427 + notmuch_message_t *seed_message;
\r
428 + const char *thread_id;
\r
429 const char *thread_id_query_string;
\r
430 notmuch_query_t *thread_id_query;
\r
432 notmuch_messages_t *messages;
\r
433 notmuch_message_t *message;
\r
434 - notmuch_bool_t matched_is_subset_of_thread;
\r
436 + seed_message = _notmuch_message_create (ctx, notmuch, seed_doc_id, NULL);
\r
437 + if (! seed_message)
\r
438 + INTERNAL_ERROR ("Thread seed message %u does not exist", seed_doc_id);
\r
440 + thread_id = notmuch_message_get_thread_id (seed_message);
\r
441 thread_id_query_string = talloc_asprintf (ctx, "thread:%s", thread_id);
\r
442 - if (unlikely (query_string == NULL))
\r
443 + if (unlikely (thread_id_query_string == NULL))
\r
446 - /* Under normal circumstances we need to do two database
\r
447 - * queries. One is for the thread itself (thread_id_query_string)
\r
448 - * and the second is to determine which messages in that thread
\r
449 - * match the original query (matched_query_string).
\r
451 - * But under two circumstances, we use only the
\r
452 - * thread_id_query_string:
\r
454 - * 1. If the original query_string *is* just the thread
\r
457 - * 2. If the original query_string matches all messages ("" or
\r
460 - * In either of these cases, we can be more efficient by running
\r
461 - * just the thread_id query (since we know all messages in the
\r
462 - * thread will match the query_string).
\r
464 - * Beyond the performance advantage, in the second case, it's
\r
465 - * important to not try to create a concatenated query because our
\r
466 - * parser handles "" and "*" as special cases and will not do the
\r
467 - * right thing with a query string of "* and thread:<foo>".
\r
469 - matched_is_subset_of_thread = 1;
\r
470 - if (strcmp (query_string, thread_id_query_string) == 0 ||
\r
471 - strcmp (query_string, "") == 0 ||
\r
472 - strcmp (query_string, "*") == 0)
\r
474 - matched_is_subset_of_thread = 0;
\r
477 thread_id_query = notmuch_query_create (notmuch, thread_id_query_string);
\r
478 if (unlikely (thread_id_query == NULL))
\r
480 @@ -457,45 +433,25 @@ _notmuch_thread_create (void *ctx,
\r
481 notmuch_messages_valid (messages);
\r
482 notmuch_messages_move_to_next (messages))
\r
484 + unsigned int doc_id;
\r
486 message = notmuch_messages_get (messages);
\r
487 + doc_id = _notmuch_message_get_doc_id (message);
\r
488 + if (doc_id == seed_doc_id)
\r
489 + message = seed_message;
\r
491 _thread_add_message (thread, message);
\r
493 - if (! matched_is_subset_of_thread)
\r
494 + if ( _notmuch_doc_id_set_contains (match_set, doc_id)) {
\r
495 + _notmuch_doc_id_set_remove (match_set, doc_id);
\r
496 _thread_add_matched_message (thread, message, sort);
\r
499 _notmuch_message_close (message);
\r
502 notmuch_query_destroy (thread_id_query);
\r
504 - if (matched_is_subset_of_thread)
\r
506 - const char *matched_query_string;
\r
507 - notmuch_query_t *matched_query;
\r
509 - matched_query_string = talloc_asprintf (ctx, "%s AND (%s)",
\r
510 - thread_id_query_string,
\r
512 - if (unlikely (matched_query_string == NULL))
\r
515 - matched_query = notmuch_query_create (notmuch, matched_query_string);
\r
516 - if (unlikely (matched_query == NULL))
\r
519 - for (messages = notmuch_query_search_messages (matched_query);
\r
520 - notmuch_messages_valid (messages);
\r
521 - notmuch_messages_move_to_next (messages))
\r
523 - message = notmuch_messages_get (messages);
\r
524 - _thread_add_matched_message (thread, message, sort);
\r
525 - _notmuch_message_close (message);
\r
528 - notmuch_query_destroy (matched_query);
\r
531 _complete_thread_authors (thread);
\r
533 _resolve_thread_relationships (thread);
\r
534 diff --git a/test/emacs b/test/emacs
\r
535 index 75dec89..fd5ae07 100755
\r
538 @@ -24,12 +24,12 @@ test_expect_equal "$output" "$expected"
\r
539 test_begin_subtest "Basic notmuch-search view in emacs"
\r
540 output=$(test_emacs '(notmuch-search "tag:inbox") (notmuch-test-wait) (message (buffer-string))' 2>&1)
\r
541 expected=$(cat $EXPECTED/notmuch-search-tag-inbox)
\r
542 -test_expect_equal "$output" "$expected"
\r
543 +test_expect_equal_failure "$output" "$expected"
\r
545 test_begin_subtest "Navigation of notmuch-hello to search results"
\r
546 output=$(test_emacs '(notmuch-hello) (goto-char (point-min)) (re-search-forward "inbox") (widget-button-press (point)) (notmuch-test-wait) (message (buffer-string))' 2>&1)
\r
547 expected=$(cat $EXPECTED/notmuch-hello-view-inbox)
\r
548 -test_expect_equal "$output" "$expected"
\r
549 +test_expect_equal_failure "$output" "$expected"
\r
551 test_begin_subtest "Basic notmuch-show view in emacs"
\r
552 maildir_storage_thread=$(notmuch search --output=threads id:20091117190054.GU3165@dottiness.seas.harvard.edu)
\r