Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 6017C431FB6 for ; Tue, 3 May 2011 21:10:05 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TpetoHBN9I9K for ; Tue, 3 May 2011 21:10:04 -0700 (PDT) X-Greylist: delayed 1755 seconds by postgrey-1.32 at olra; Tue, 03 May 2011 21:10:04 PDT Received: from imarko.xen.prgmr.com (imarko.xen.prgmr.com [72.13.95.244]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 7595A431FB5 for ; Tue, 3 May 2011 21:10:04 -0700 (PDT) Received: from localhost ([127.0.0.1] helo=zsu.kismala.com) by imarko.xen.prgmr.com with esmtp (Exim 4.72) (envelope-from ) id 1QHSxC-0002V3-VL for notmuch@notmuchmail.org; Tue, 03 May 2011 20:40:47 -0700 From: Istvan Marko To: notmuch@notmuchmail.org Subject: storing From and Subject in xapian Date: Tue, 03 May 2011 20:40:45 -0700 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Mailman-Approved-At: Wed, 04 May 2011 15:35:17 -0700 X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 May 2011 04:10:05 -0000 --=-=-= Content-Type: text/plain I have been looking at the I/O patterns of "notmuch search" with the default output format and noticed that it has to parse the maildir file of every matched message to get the From and Subject headers. I figured that this must be slowing things down, especially when the files are not in the filesystem cache. So I wanted to see how much difference would it make to have the From and Subject stored in xapian to avoid this parsing. With the attached patch I get a speedup of 2x with cached and almost 10x with uncached files for searches with many matches. The attached patch is only intended as proof of concept. I am not familiar with xapian so I wasn't sure if this kind of data should be stored as terms, values or data. I went with values simply because I saw that message-id and timestamp were already stored that way. Perhaps the data type would be more appropriate since the fields are not used for searching or sorting. Oh and for some reason I get blank Subject for about 1% of the matches. Is there a downside to this approach? The only one I see is that the xapian db size increases by about 1% but to me the speed increase would be well worth it. --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=notmuch-xapian-headers.patch diff --git a/lib/database.cc b/lib/database.cc index 7f79cf4..5f7f197 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -1654,7 +1654,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch, goto DONE; date = notmuch_message_file_get_header (message_file, "date"); - _notmuch_message_set_date (message, date); + _notmuch_message_set_header_values (message, date, from, subject); _notmuch_message_index_file (message, filename); } else { diff --git a/lib/message.cc b/lib/message.cc index ecda75a..8c85c40 100644 --- a/lib/message.cc +++ b/lib/message.cc @@ -726,6 +726,14 @@ notmuch_message_get_date (notmuch_message_t *message) return Xapian::sortable_unserialise (value); } +const char * +_notmuch_message_get_header_value (notmuch_message_t *message,int valuetag) +{ + std::string value; + value = message->doc.get_value (valuetag); + return value.c_str(); +} + notmuch_tags_t * notmuch_message_get_tags (notmuch_message_t *message) { @@ -762,8 +770,10 @@ notmuch_message_set_author (notmuch_message_t *message, } void -_notmuch_message_set_date (notmuch_message_t *message, - const char *date) +_notmuch_message_set_header_values (notmuch_message_t *message, + const char *date, + const char *from, + const char *subject) { time_t time_value; @@ -776,6 +786,8 @@ _notmuch_message_set_date (notmuch_message_t *message, message->doc.add_value (NOTMUCH_VALUE_TIMESTAMP, Xapian::sortable_serialise (time_value)); + message->doc.add_value (NOTMUCH_VALUE_FROM, from); + message->doc.add_value (NOTMUCH_VALUE_SUBJECT, subject); } /* Synchronize changes made to message->doc out into the database. */ diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h index 0856751..ef6348a 100644 --- a/lib/notmuch-private.h +++ b/lib/notmuch-private.h @@ -105,7 +105,9 @@ _internal_error (const char *format, ...) PRINTF_ATTRIBUTE (1, 2); typedef enum { NOTMUCH_VALUE_TIMESTAMP = 0, - NOTMUCH_VALUE_MESSAGE_ID + NOTMUCH_VALUE_MESSAGE_ID, + NOTMUCH_VALUE_FROM, + NOTMUCH_VALUE_SUBJECT } notmuch_value_t; /* Xapian (with flint backend) complains if we provide a term longer @@ -281,8 +283,14 @@ void _notmuch_message_ensure_thread_id (notmuch_message_t *message); void -_notmuch_message_set_date (notmuch_message_t *message, - const char *date); +_notmuch_message_set_header_values (notmuch_message_t *message, + const char *date, + const char *from, + const char *subject); +const char * +_notmuch_message_get_header_value (notmuch_message_t *message, + int valuetag); + void _notmuch_message_sync (notmuch_message_t *message); diff --git a/lib/thread.cc b/lib/thread.cc index ace5ce7..636a3dc 100644 --- a/lib/thread.cc +++ b/lib/thread.cc @@ -231,7 +231,8 @@ _thread_add_message (notmuch_thread_t *thread, xstrdup (notmuch_message_get_message_id (message)), message); - from = notmuch_message_get_header (message, "from"); + from = _notmuch_message_get_header_value(message,NOTMUCH_VALUE_FROM); + //notmuch_message_get_header (message, "from"); if (from) list = internet_address_list_parse_string (from); @@ -253,7 +254,8 @@ _thread_add_message (notmuch_thread_t *thread, if (! thread->subject) { const char *subject; - subject = notmuch_message_get_header (message, "subject"); + subject = _notmuch_message_get_header_value(message,NOTMUCH_VALUE_SUBJECT); + // subject = notmuch_message_get_header (message, "subject"); thread->subject = talloc_strdup (thread, subject ? subject : ""); } @@ -273,7 +275,8 @@ _thread_set_subject_from_message (notmuch_thread_t *thread, const char *subject; const char *cleaned_subject; - subject = notmuch_message_get_header (message, "subject"); + subject = _notmuch_message_get_header_value(message,NOTMUCH_VALUE_SUBJECT); + // subject = notmuch_message_get_header (message, "subject"); if (! subject) return; --=-=-= Content-Type: text/plain -- Istvan --=-=-=--