1 Return-Path: <mi@kismala.com>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by olra.theworths.org (Postfix) with ESMTP id 6017C431FB6
\r
6 for <notmuch@notmuchmail.org>; Tue, 3 May 2011 21:10:05 -0700 (PDT)
\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
\r
11 X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none]
\r
13 Received: from olra.theworths.org ([127.0.0.1])
\r
14 by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
\r
15 with ESMTP id TpetoHBN9I9K for <notmuch@notmuchmail.org>;
\r
16 Tue, 3 May 2011 21:10:04 -0700 (PDT)
\r
17 X-Greylist: delayed 1755 seconds by postgrey-1.32 at olra;
\r
18 Tue, 03 May 2011 21:10:04 PDT
\r
19 Received: from imarko.xen.prgmr.com (imarko.xen.prgmr.com [72.13.95.244])
\r
20 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
\r
21 (No client certificate requested)
\r
22 by olra.theworths.org (Postfix) with ESMTPS id 7595A431FB5
\r
23 for <notmuch@notmuchmail.org>; Tue, 3 May 2011 21:10:04 -0700 (PDT)
\r
24 Received: from localhost ([127.0.0.1] helo=zsu.kismala.com)
\r
25 by imarko.xen.prgmr.com with esmtp (Exim 4.72)
\r
26 (envelope-from <mi@kismala.com>) id 1QHSxC-0002V3-VL
\r
27 for notmuch@notmuchmail.org; Tue, 03 May 2011 20:40:47 -0700
\r
28 From: Istvan Marko <notmuch@kismala.com>
\r
29 To: notmuch@notmuchmail.org
\r
30 Subject: storing From and Subject in xapian
\r
31 Date: Tue, 03 May 2011 20:40:45 -0700
\r
32 Message-ID: <m3sjsv2kw2.fsf@zsu.kismala.com>
\r
33 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.0.50 (gnu/linux)
\r
35 Content-Type: multipart/mixed; boundary="=-=-="
\r
36 X-Mailman-Approved-At: Wed, 04 May 2011 15:35:17 -0700
\r
37 X-BeenThere: notmuch@notmuchmail.org
\r
38 X-Mailman-Version: 2.1.13
\r
40 List-Id: "Use and development of the notmuch mail system."
\r
41 <notmuch.notmuchmail.org>
\r
42 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
43 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
44 List-Archive: <http://notmuchmail.org/pipermail/notmuch>
\r
45 List-Post: <mailto:notmuch@notmuchmail.org>
\r
46 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
47 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
48 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
49 X-List-Received-Date: Wed, 04 May 2011 04:10:05 -0000
\r
52 Content-Type: text/plain
\r
55 I have been looking at the I/O patterns of "notmuch search" with the
\r
56 default output format and noticed that it has to parse the maildir file
\r
57 of every matched message to get the From and Subject headers. I figured
\r
58 that this must be slowing things down, especially when the files are not
\r
59 in the filesystem cache.
\r
61 So I wanted to see how much difference would it make to have the From
\r
62 and Subject stored in xapian to avoid this parsing.
\r
64 With the attached patch I get a speedup of 2x with cached and almost 10x
\r
65 with uncached files for searches with many matches.
\r
67 The attached patch is only intended as proof of concept. I am not
\r
68 familiar with xapian so I wasn't sure if this kind of data should be
\r
69 stored as terms, values or data. I went with values simply because I saw
\r
70 that message-id and timestamp were already stored that way. Perhaps the
\r
71 data type would be more appropriate since the fields are not used for
\r
72 searching or sorting. Oh and for some reason I get blank Subject for
\r
73 about 1% of the matches.
\r
76 Is there a downside to this approach? The only one I see is that the
\r
77 xapian db size increases by about 1% but to me the speed increase would
\r
83 Content-Type: text/x-patch
\r
84 Content-Disposition: inline; filename=notmuch-xapian-headers.patch
\r
86 diff --git a/lib/database.cc b/lib/database.cc
\r
87 index 7f79cf4..5f7f197 100644
\r
88 --- a/lib/database.cc
\r
89 +++ b/lib/database.cc
\r
90 @@ -1654,7 +1654,7 @@ notmuch_database_add_message (notmuch_database_t *notmuch,
\r
93 date = notmuch_message_file_get_header (message_file, "date");
\r
94 - _notmuch_message_set_date (message, date);
\r
95 + _notmuch_message_set_header_values (message, date, from, subject);
\r
97 _notmuch_message_index_file (message, filename);
\r
99 diff --git a/lib/message.cc b/lib/message.cc
\r
100 index ecda75a..8c85c40 100644
\r
101 --- a/lib/message.cc
\r
102 +++ b/lib/message.cc
\r
103 @@ -726,6 +726,14 @@ notmuch_message_get_date (notmuch_message_t *message)
\r
104 return Xapian::sortable_unserialise (value);
\r
108 +_notmuch_message_get_header_value (notmuch_message_t *message,int valuetag)
\r
110 + std::string value;
\r
111 + value = message->doc.get_value (valuetag);
\r
112 + return value.c_str();
\r
116 notmuch_message_get_tags (notmuch_message_t *message)
\r
118 @@ -762,8 +770,10 @@ notmuch_message_set_author (notmuch_message_t *message,
\r
122 -_notmuch_message_set_date (notmuch_message_t *message,
\r
123 - const char *date)
\r
124 +_notmuch_message_set_header_values (notmuch_message_t *message,
\r
125 + const char *date,
\r
126 + const char *from,
\r
127 + const char *subject)
\r
131 @@ -776,6 +786,8 @@ _notmuch_message_set_date (notmuch_message_t *message,
\r
133 message->doc.add_value (NOTMUCH_VALUE_TIMESTAMP,
\r
134 Xapian::sortable_serialise (time_value));
\r
135 + message->doc.add_value (NOTMUCH_VALUE_FROM, from);
\r
136 + message->doc.add_value (NOTMUCH_VALUE_SUBJECT, subject);
\r
139 /* Synchronize changes made to message->doc out into the database. */
\r
140 diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
\r
141 index 0856751..ef6348a 100644
\r
142 --- a/lib/notmuch-private.h
\r
143 +++ b/lib/notmuch-private.h
\r
144 @@ -105,7 +105,9 @@ _internal_error (const char *format, ...) PRINTF_ATTRIBUTE (1, 2);
\r
147 NOTMUCH_VALUE_TIMESTAMP = 0,
\r
148 - NOTMUCH_VALUE_MESSAGE_ID
\r
149 + NOTMUCH_VALUE_MESSAGE_ID,
\r
150 + NOTMUCH_VALUE_FROM,
\r
151 + NOTMUCH_VALUE_SUBJECT
\r
154 /* Xapian (with flint backend) complains if we provide a term longer
\r
155 @@ -281,8 +283,14 @@ void
\r
156 _notmuch_message_ensure_thread_id (notmuch_message_t *message);
\r
159 -_notmuch_message_set_date (notmuch_message_t *message,
\r
160 - const char *date);
\r
161 +_notmuch_message_set_header_values (notmuch_message_t *message,
\r
162 + const char *date,
\r
163 + const char *from,
\r
164 + const char *subject);
\r
166 +_notmuch_message_get_header_value (notmuch_message_t *message,
\r
171 _notmuch_message_sync (notmuch_message_t *message);
\r
172 diff --git a/lib/thread.cc b/lib/thread.cc
\r
173 index ace5ce7..636a3dc 100644
\r
174 --- a/lib/thread.cc
\r
175 +++ b/lib/thread.cc
\r
176 @@ -231,7 +231,8 @@ _thread_add_message (notmuch_thread_t *thread,
\r
177 xstrdup (notmuch_message_get_message_id (message)),
\r
180 - from = notmuch_message_get_header (message, "from");
\r
181 + from = _notmuch_message_get_header_value(message,NOTMUCH_VALUE_FROM);
\r
182 + //notmuch_message_get_header (message, "from");
\r
184 list = internet_address_list_parse_string (from);
\r
186 @@ -253,7 +254,8 @@ _thread_add_message (notmuch_thread_t *thread,
\r
188 if (! thread->subject) {
\r
189 const char *subject;
\r
190 - subject = notmuch_message_get_header (message, "subject");
\r
191 + subject = _notmuch_message_get_header_value(message,NOTMUCH_VALUE_SUBJECT);
\r
192 + // subject = notmuch_message_get_header (message, "subject");
\r
193 thread->subject = talloc_strdup (thread, subject ? subject : "");
\r
196 @@ -273,7 +275,8 @@ _thread_set_subject_from_message (notmuch_thread_t *thread,
\r
197 const char *subject;
\r
198 const char *cleaned_subject;
\r
200 - subject = notmuch_message_get_header (message, "subject");
\r
201 + subject = _notmuch_message_get_header_value(message,NOTMUCH_VALUE_SUBJECT);
\r
202 + // subject = notmuch_message_get_header (message, "subject");
\r
208 Content-Type: text/plain
\r