Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id D4F5F40DEF2 for ; Wed, 17 Nov 2010 11:28:41 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -1.9 X-Spam-Level: X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9] autolearn=ham Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PKOzSoBVm2uk for ; Wed, 17 Nov 2010 11:28:28 -0800 (PST) Received: from dmz-mailsec-scanner-8.mit.edu (DMZ-MAILSEC-SCANNER-8.MIT.EDU [18.7.68.37]) by olra.theworths.org (Postfix) with ESMTP id 9B44740DEF0 for ; Wed, 17 Nov 2010 11:28:28 -0800 (PST) X-AuditID: 12074425-b7c98ae000000a04-b9-4ce42cdc7f33 Received: from mailhub-auth-1.mit.edu ( [18.9.21.35]) by dmz-mailsec-scanner-8.mit.edu (Symantec Brightmail Gateway) with SMTP id 8A.37.02564.CDC24EC4; Wed, 17 Nov 2010 14:28:28 -0500 (EST) Received: from outgoing.mit.edu (OUTGOING-AUTH.MIT.EDU [18.7.22.103]) by mailhub-auth-1.mit.edu (8.13.8/8.9.2) with ESMTP id oAHJSRH4018322 for ; Wed, 17 Nov 2010 14:28:27 -0500 Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91]) (authenticated bits=0) (User authenticated as amdragon@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.6/8.12.4) with ESMTP id oAHJSQ1T007610 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Wed, 17 Nov 2010 14:28:27 -0500 (EST) Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.72) (envelope-from ) id 1PIngA-0001ET-Tq for notmuch@notmuchmail.org; Wed, 17 Nov 2010 14:28:26 -0500 Date: Wed, 17 Nov 2010 14:28:26 -0500 From: Austin Clements To: notmuch@notmuchmail.org Subject: [PATCH 3/4] Optimize thread search using matched docid sets. Message-ID: <20101117192826.GU2439@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-06-14) X-Brightmail-Tracker: AAAAAA== X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Nov 2010 19:28:42 -0000 This reduces thread search's 1+2t Xapian queries (where t is the number of matched threads) to 1+t queries and constructs exactly one notmuch_message_t for each message instead of 2 to 3. notmuch_query_search_threads eagerly fetches the docids of all messages matching the user query instead of lazily constructing message objects and fetching thread ID's from term lists. _notmuch_thread_create takes a seed docid and the set of all matched docids and uses a single Xapian query to expand this docid to its containing thread, using the matched docid set to determine which messages in the thread match the user query instead of using a second Xapian query. As a side effect, this fixes author order so authors are always sorted by first occurrence in each thread. This breaks two emacs tests that hard-code the old, buggy author order. This reduces the amount of time required to load my inbox from 4.523 seconds to 3.025 seconds (1.5X faster). --- lib/message.cc | 6 ++ lib/notmuch-private.h | 17 +++++- lib/query.cc | 142 ++++++++++++++++++++++++++++++++++++------------ lib/thread.cc | 102 ++++++++++------------------------- test/emacs | 4 +- 5 files changed, 158 insertions(+), 113 deletions(-) diff --git a/lib/message.cc b/lib/message.cc index 225b7e9..adcd07d 100644 --- a/lib/message.cc +++ b/lib/message.cc @@ -254,6 +254,12 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch, return message; } +unsigned int +_notmuch_message_get_doc_id (notmuch_message_t *message) +{ + return message->doc_id; +} + const char * notmuch_message_get_message_id (notmuch_message_t *message) { diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h index 592cfb2..303aeb3 100644 --- a/lib/notmuch-private.h +++ b/lib/notmuch-private.h @@ -156,6 +156,8 @@ typedef enum _notmuch_private_status { : \ (notmuch_status_t) private_status) +typedef struct _notmuch_doc_id_set notmuch_doc_id_set_t; + /* database.cc */ /* Lookup a prefix value by name. @@ -222,8 +224,8 @@ _notmuch_directory_get_document_id (notmuch_directory_t *directory); notmuch_thread_t * _notmuch_thread_create (void *ctx, notmuch_database_t *notmuch, - const char *thread_id, - const char *query_string, + unsigned int seed_doc_id, + notmuch_doc_id_set_t *match_set, notmuch_sort_t sort); /* message.cc */ @@ -239,6 +241,9 @@ _notmuch_message_create_for_message_id (notmuch_database_t *notmuch, const char *message_id, notmuch_private_status_t *status); +unsigned int +_notmuch_message_get_doc_id (notmuch_message_t *message); + const char * _notmuch_message_get_in_reply_to (notmuch_message_t *message); @@ -426,6 +431,14 @@ _notmuch_mset_messages_get (notmuch_messages_t *messages); void _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages); +notmuch_bool_t +_notmuch_doc_id_set_contains (notmuch_doc_id_set_t *doc_ids, + unsigned int doc_id); + +void +_notmuch_doc_id_set_remove (notmuch_doc_id_set_t *doc_ids, + unsigned int doc_id); + /* message.cc */ void diff --git a/lib/query.cc b/lib/query.cc index 7916421..c7ae4ee 100644 --- a/lib/query.cc +++ b/lib/query.cc @@ -36,13 +36,21 @@ typedef struct _notmuch_mset_messages { Xapian::MSetIterator iterator_end; } notmuch_mset_messages_t; +struct _notmuch_doc_id_set { + unsigned int *bitmap; + unsigned int bound; +}; + struct _notmuch_threads { notmuch_query_t *query; - GHashTable *threads; - notmuch_messages_t *messages; - /* This thread ID is our iterator state. */ - const char *thread_id; + /* The ordered list of doc ids matched by the query. */ + GArray *doc_ids; + /* Our iterator's current position in doc_ids. */ + unsigned int doc_id_pos; + /* The set of matched docid's that have not been assigned to a + * thread. Initially, this contains every docid in doc_ids. */ + notmuch_doc_id_set_t match_set; }; notmuch_query_t * @@ -195,6 +203,19 @@ _notmuch_mset_messages_valid (notmuch_messages_t *messages) return (mset_messages->iterator != mset_messages->iterator_end); } +static Xapian::docid +_notmuch_mset_messages_get_doc_id (notmuch_messages_t *messages) +{ + notmuch_mset_messages_t *mset_messages; + + mset_messages = (notmuch_mset_messages_t *) messages; + + if (! _notmuch_mset_messages_valid (&mset_messages->base)) + return 0; + + return *mset_messages->iterator; +} + notmuch_message_t * _notmuch_mset_messages_get (notmuch_messages_t *messages) { @@ -233,6 +254,49 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages) mset_messages->iterator++; } +static notmuch_bool_t +_notmuch_doc_id_set_init (void *ctx, + notmuch_doc_id_set_t *doc_ids, + GArray *arr, unsigned int bound) +{ + size_t count = (bound + sizeof (doc_ids->bitmap[0]) - 1) / + sizeof (doc_ids->bitmap[0]); + unsigned int *bitmap = talloc_zero_array (ctx, unsigned int, count); + + if (bitmap == NULL) + return FALSE; + + doc_ids->bitmap = bitmap; + doc_ids->bound = bound; + + for (unsigned int i = 0; i < arr->len; i++) { + unsigned int doc_id = g_array_index(arr, unsigned int, i); + bitmap[doc_id / sizeof (bitmap[0])] |= + 1 << (doc_id % sizeof (bitmap[0])); + } + + return TRUE; +} + +notmuch_bool_t +_notmuch_doc_id_set_contains (notmuch_doc_id_set_t *doc_ids, + unsigned int doc_id) +{ + if (doc_id >= doc_ids->bound) + return FALSE; + return (doc_ids->bitmap[doc_id / sizeof (doc_ids->bitmap[0])] & + (1 << (doc_id % sizeof (doc_ids->bitmap[0])))) != 0; +} + +void +_notmuch_doc_id_set_remove (notmuch_doc_id_set_t *doc_ids, + unsigned int doc_id) +{ + if (doc_id < doc_ids->bound) + doc_ids->bitmap[doc_id / sizeof (doc_ids->bitmap[0])] &= + ~(1 << (doc_id % sizeof (doc_ids->bitmap[0]))); +} + /* Glib objects force use to use a talloc destructor as well, (but not * nearly as ugly as the for messages due to C++ objects). At * this point, I'd really like to have some talloc-friendly @@ -240,7 +304,8 @@ _notmuch_mset_messages_move_to_next (notmuch_messages_t *messages) static int _notmuch_threads_destructor (notmuch_threads_t *threads) { - g_hash_table_unref (threads->threads); + if (threads->doc_ids) + g_array_unref (threads->doc_ids); return 0; } @@ -249,24 +314,39 @@ notmuch_threads_t * notmuch_query_search_threads (notmuch_query_t *query) { notmuch_threads_t *threads; + notmuch_messages_t *messages; + Xapian::docid max_doc_id = 0; threads = talloc (query, notmuch_threads_t); if (threads == NULL) return NULL; + threads->doc_ids = NULL; + talloc_set_destructor (threads, _notmuch_threads_destructor); threads->query = query; - threads->threads = g_hash_table_new_full (g_str_hash, g_str_equal, - free, NULL); - threads->messages = notmuch_query_search_messages (query); - if (threads->messages == NULL) { + messages = notmuch_query_search_messages (query); + if (messages == NULL) { talloc_free (threads); return NULL; } - threads->thread_id = NULL; + threads->doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); + while (notmuch_messages_valid (messages)) { + unsigned int doc_id = _notmuch_mset_messages_get_doc_id (messages); + g_array_append_val (threads->doc_ids, doc_id); + max_doc_id = MAX (max_doc_id, doc_id); + notmuch_messages_move_to_next (messages); + } + threads->doc_id_pos = 0; - talloc_set_destructor (threads, _notmuch_threads_destructor); + talloc_free (messages); + + if (! _notmuch_doc_id_set_init (threads, &threads->match_set, + threads->doc_ids, max_doc_id + 1)) { + talloc_free (threads); + return NULL; + } return threads; } @@ -280,51 +360,41 @@ notmuch_query_destroy (notmuch_query_t *query) notmuch_bool_t notmuch_threads_valid (notmuch_threads_t *threads) { - notmuch_message_t *message; - - if (threads->thread_id) - return TRUE; - - while (notmuch_messages_valid (threads->messages)) - { - message = notmuch_messages_get (threads->messages); + unsigned int doc_id; - threads->thread_id = notmuch_message_get_thread_id (message); - - if (! g_hash_table_lookup_extended (threads->threads, - threads->thread_id, - NULL, NULL)) - { - g_hash_table_insert (threads->threads, - xstrdup (threads->thread_id), NULL); - notmuch_messages_move_to_next (threads->messages); - return TRUE; - } + while (threads->doc_id_pos < threads->doc_ids->len) { + doc_id = g_array_index (threads->doc_ids, unsigned int, + threads->doc_id_pos); + if (_notmuch_doc_id_set_contains (&threads->match_set, doc_id)) + break; - notmuch_messages_move_to_next (threads->messages); + threads->doc_id_pos++; } - threads->thread_id = NULL; - return FALSE; + return threads->doc_id_pos < threads->doc_ids->len; } notmuch_thread_t * notmuch_threads_get (notmuch_threads_t *threads) { + unsigned int doc_id; + if (! notmuch_threads_valid (threads)) return NULL; + doc_id = g_array_index (threads->doc_ids, unsigned int, + threads->doc_id_pos); return _notmuch_thread_create (threads->query, threads->query->notmuch, - threads->thread_id, - threads->query->query_string, + doc_id, + &threads->match_set, threads->query->sort); } void notmuch_threads_move_to_next (notmuch_threads_t *threads) { - threads->thread_id = NULL; + threads->doc_id_pos++; } void diff --git a/lib/thread.cc b/lib/thread.cc index 7f15586..244c038 100644 --- a/lib/thread.cc +++ b/lib/thread.cc @@ -305,7 +305,7 @@ _thread_add_matched_message (notmuch_thread_t *thread, _thread_add_matched_author (thread, notmuch_message_get_author (hashed_message)); - if ((sort == NOTMUCH_SORT_OLDEST_FIRST && date <= thread->newest) || + if ((sort == NOTMUCH_SORT_OLDEST_FIRST && date == thread->oldest) || (sort != NOTMUCH_SORT_OLDEST_FIRST && date == thread->newest)) { _thread_set_subject_from_message (thread, message); @@ -350,16 +350,17 @@ _resolve_thread_relationships (unused (notmuch_thread_t *thread)) */ } -/* Create a new notmuch_thread_t object for the given thread ID, - * treating any messages matching 'query_string' as "matched". +/* Create a new notmuch_thread_t object by finding the thread + * containing the message with the given doc ID, treating any messages + * contained in match_set as "matched". Remove all messages in the + * thread from match_set. * - * Creating the thread will trigger two database searches. The first - * is for all messages belonging to the thread, (to get the first - * subject line, the total count of messages, and all authors). The - * second search is for all messages that are in the thread and that - * also match the given query_string. This is to allow for a separate - * count of matched messages, and to allow a viewer to display these - * messages differently. + * Creating the thread will perform a database search to get all + * messages belonging to the thread and will get the first subject + * line, the total count of messages, and all authors in the thread. + * Each message in the thread is checked against match_set to allow + * for a separate count of matched messages, and to allow a viewer to + * display these messages differently. * * Here, 'ctx' is talloc context for the resulting thread object. * @@ -368,53 +369,28 @@ _resolve_thread_relationships (unused (notmuch_thread_t *thread)) notmuch_thread_t * _notmuch_thread_create (void *ctx, notmuch_database_t *notmuch, - const char *thread_id, - const char *query_string, + unsigned int seed_doc_id, + notmuch_doc_id_set_t *match_set, notmuch_sort_t sort) { notmuch_thread_t *thread; + notmuch_message_t *seed_message; + const char *thread_id; const char *thread_id_query_string; notmuch_query_t *thread_id_query; notmuch_messages_t *messages; notmuch_message_t *message; - notmuch_bool_t matched_is_subset_of_thread; + seed_message = _notmuch_message_create (ctx, notmuch, seed_doc_id, NULL); + if (! seed_message) + INTERNAL_ERROR ("Thread seed message %u does not exist", seed_doc_id); + + thread_id = notmuch_message_get_thread_id (seed_message); thread_id_query_string = talloc_asprintf (ctx, "thread:%s", thread_id); - if (unlikely (query_string == NULL)) + if (unlikely (thread_id_query_string == NULL)) return NULL; - /* Under normal circumstances we need to do two database - * queries. One is for the thread itself (thread_id_query_string) - * and the second is to determine which messages in that thread - * match the original query (matched_query_string). - * - * But under two circumstances, we use only the - * thread_id_query_string: - * - * 1. If the original query_string *is* just the thread - * specification. - * - * 2. If the original query_string matches all messages ("" or - * "*"). - * - * In either of these cases, we can be more efficient by running - * just the thread_id query (since we know all messages in the - * thread will match the query_string). - * - * Beyond the performance advantage, in the second case, it's - * important to not try to create a concatenated query because our - * parser handles "" and "*" as special cases and will not do the - * right thing with a query string of "* and thread:". - **/ - matched_is_subset_of_thread = 1; - if (strcmp (query_string, thread_id_query_string) == 0 || - strcmp (query_string, "") == 0 || - strcmp (query_string, "*") == 0) - { - matched_is_subset_of_thread = 0; - } - thread_id_query = notmuch_query_create (notmuch, thread_id_query_string); if (unlikely (thread_id_query == NULL)) return NULL; @@ -457,45 +433,25 @@ _notmuch_thread_create (void *ctx, notmuch_messages_valid (messages); notmuch_messages_move_to_next (messages)) { + unsigned int doc_id; + message = notmuch_messages_get (messages); + doc_id = _notmuch_message_get_doc_id (message); + if (doc_id == seed_doc_id) + message = seed_message; _thread_add_message (thread, message); - if (! matched_is_subset_of_thread) + if ( _notmuch_doc_id_set_contains (match_set, doc_id)) { + _notmuch_doc_id_set_remove (match_set, doc_id); _thread_add_matched_message (thread, message, sort); + } _notmuch_message_close (message); } notmuch_query_destroy (thread_id_query); - if (matched_is_subset_of_thread) - { - const char *matched_query_string; - notmuch_query_t *matched_query; - - matched_query_string = talloc_asprintf (ctx, "%s AND (%s)", - thread_id_query_string, - query_string); - if (unlikely (matched_query_string == NULL)) - return NULL; - - matched_query = notmuch_query_create (notmuch, matched_query_string); - if (unlikely (matched_query == NULL)) - return NULL; - - for (messages = notmuch_query_search_messages (matched_query); - notmuch_messages_valid (messages); - notmuch_messages_move_to_next (messages)) - { - message = notmuch_messages_get (messages); - _thread_add_matched_message (thread, message, sort); - _notmuch_message_close (message); - } - - notmuch_query_destroy (matched_query); - } - _complete_thread_authors (thread); _resolve_thread_relationships (thread); diff --git a/test/emacs b/test/emacs index 75dec89..fd5ae07 100755 --- a/test/emacs +++ b/test/emacs @@ -24,12 +24,12 @@ test_expect_equal "$output" "$expected" test_begin_subtest "Basic notmuch-search view in emacs" output=$(test_emacs '(notmuch-search "tag:inbox") (notmuch-test-wait) (message (buffer-string))' 2>&1) expected=$(cat $EXPECTED/notmuch-search-tag-inbox) -test_expect_equal "$output" "$expected" +test_expect_equal_failure "$output" "$expected" test_begin_subtest "Navigation of notmuch-hello to search results" output=$(test_emacs '(notmuch-hello) (goto-char (point-min)) (re-search-forward "inbox") (widget-button-press (point)) (notmuch-test-wait) (message (buffer-string))' 2>&1) expected=$(cat $EXPECTED/notmuch-hello-view-inbox) -test_expect_equal "$output" "$expected" +test_expect_equal_failure "$output" "$expected" test_begin_subtest "Basic notmuch-show view in emacs" maildir_storage_thread=$(notmuch search --output=threads id:20091117190054.GU3165@dottiness.seas.harvard.edu) -- 1.7.2.3