From: Austin Clements Date: Sat, 28 Jan 2012 18:33:40 +0000 (+1900) Subject: Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=9e138bd7267e346a901da32c71ce3d068fb5aaca;p=notmuch-archives.git Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag --- diff --git a/df/b6720041590a7c4fe5ce7fc1c22408e2690f6c b/df/b6720041590a7c4fe5ce7fc1c22408e2690f6c new file mode 100644 index 000000000..cd60efb4e --- /dev/null +++ b/df/b6720041590a7c4fe5ce7fc1c22408e2690f6c @@ -0,0 +1,232 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id B0F4E431FB6 + for ; Sat, 28 Jan 2012 10:34:28 -0800 (PST) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -0.7 +X-Spam-Level: +X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id oESKNv6pQEQY for ; + Sat, 28 Jan 2012 10:34:27 -0800 (PST) +Received: from dmz-mailsec-scanner-6.mit.edu (DMZ-MAILSEC-SCANNER-6.MIT.EDU + [18.7.68.35]) + by olra.theworths.org (Postfix) with ESMTP id AA998431FAE + for ; Sat, 28 Jan 2012 10:34:27 -0800 (PST) +X-AuditID: 12074423-b7f9c6d0000008c3-a4-4f243fb20c29 +Received: from mailhub-auth-3.mit.edu ( [18.9.21.43]) + by dmz-mailsec-scanner-6.mit.edu (Symantec Messaging Gateway) with SMTP + id B6.67.02243.2BF342F4; Sat, 28 Jan 2012 13:34:26 -0500 (EST) +Received: from outgoing.mit.edu (OUTGOING-AUTH.MIT.EDU [18.7.22.103]) + by mailhub-auth-3.mit.edu (8.13.8/8.9.2) with ESMTP id q0SIYPMe026548; + Sat, 28 Jan 2012 13:34:26 -0500 +Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91]) + (authenticated bits=0) + (User authenticated as amdragon@ATHENA.MIT.EDU) + by outgoing.mit.edu (8.13.6/8.12.4) with ESMTP id q0SIYOcQ001042 + (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT); + Sat, 28 Jan 2012 13:34:25 -0500 (EST) +Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.77) + (envelope-from ) + id 1RrD5o-0006MC-8C; Sat, 28 Jan 2012 13:33:40 -0500 +Date: Sat, 28 Jan 2012 13:33:40 -0500 +From: Austin Clements +To: Mark Walters +Subject: Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag +Message-ID: <20120128183340.GD17991@mit.edu> +References: <20120124011609.GX16740@mit.edu> + <1327367923-18228-2-git-send-email-markwalters1009@gmail.com> + <20120124024521.GY16740@mit.edu> <874nvg6qxn.fsf@qmul.ac.uk> +MIME-Version: 1.0 +Content-Type: text/plain; charset=us-ascii +Content-Disposition: inline +In-Reply-To: <874nvg6qxn.fsf@qmul.ac.uk> +User-Agent: Mutt/1.5.21 (2010-09-15) +X-Brightmail-Tracker: + H4sIAAAAAAAAA+NgFmphleLIzCtJLcpLzFFi42IR4hTV1t1kr+JvMHu1mMXquTwW12/OZHZg + 8tg56y67x7NVt5gDmKK4bFJSczLLUov07RK4Mo7feclSsNmwYsXBF8wNjLfUuhg5OSQETCTm + da9hg7DFJC7cWw9kc3EICexjlJj58wc7hLOBUWLPow1MEM5JJokHW/ezQjhLGCVuTDvFAtLP + IqAq8eDbbFYQm01AQ2Lb/uWMILaIgI7E7UML2EFsZgFpiW+/m5lAbGEBZ4nnTV/B6nmBatbc + WAC1ey2jxKenP9ghEoISJ2c+YYFo1pK48e8lUDMH2KDl/zhAwpxAu14uXgg2R1RARWLKyW1s + ExiFZiHpnoWkexZC9wJG5lWMsim5Vbq5iZk5xanJusXJiXl5qUW6Znq5mSV6qSmlmxhBgc3u + oryD8c9BpUOMAhyMSjy8F14p+QuxJpYVV+YeYpTkYFIS5T1rq+IvxJeUn1KZkVicEV9UmpNa + fIhRgoNZSYT3gxxQjjclsbIqtSgfJiXNwaIkzquh9c5PSCA9sSQ1OzW1ILUIJivDwaEkwcsM + jGAhwaLU9NSKtMycEoQ0EwcnyHAeoOGyIDW8xQWJucWZ6RD5U4yKUuK8j+2AEgIgiYzSPLhe + WOJ5xSgO9Iow7w+QKh5g0oLrfgU0mAlocMRVRZDBJYkIKakGRt51fB84szKa9nGt82NRLlfY + 3CT0nknN+mIIe/SLlyybeSUO8uq/W8mfr7fWO+7i3nBX7bKS+Vla4v97pzw3aq3Ke39k0aNH + XM8Wn9g44Wa8InNrVTHzvt356/ZvkG3km3QhNnjmn7r7qpfWpywKvJrN4nn7hrRuxQVxiR9J + p3bE/llYdHbCdiWW4oxEQy3mouJEACnVGSIXAwAA +Cc: notmuch@notmuchmail.org +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Sat, 28 Jan 2012 18:34:28 -0000 + +Quoth Mark Walters on Jan 28 at 10:51 am: +> +> > > exclude_query = _notmuch_exclude_tags (query, final_query); +> > > +> > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, +> > > - final_query, exclude_query); +> > > + enquire.set_weighting_scheme (Xapian::BoolWeight()); +> > > + enquire.set_query (exclude_query); +> > > + +> > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); +> > > + +> > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); +> > > + +> > > + for (iterator = mset.begin (); iterator != mset.end (); iterator++) +> > > + { +> > > + unsigned int doc_id = *iterator; +> > > + g_array_append_val (excluded_doc_ids, doc_id); +> > > + } +> > > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); +> > > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, +> > > + excluded_doc_ids); +> > +> > This might be inefficient for message-only queries, since it will +> > fetch *all* excluded docids. This highlights a basic difference +> > between message and thread search: thread search can return messages +> > that don't match the original query and hence needs to know all +> > potentially excluded messages, while message search can only return +> > messages that match the original query. +> +> I now have some benchmarks (not run enough times to be hugely accurate +> so ignore minor differences). The full results are below. The summary +> is: +> +> Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of +> lkml). I mark 1 000 000 deleted +> Small-archive = 70 000 messages in 35 000 threads. 10 000 marked +> deleted. +> +> Doing the initial exclude work on the big collection takes about 0.8s +> and on the small collection about 0.01s. So any query to the big +> collection takes at least 0.8s longer and this all occurs before any +> results appear. + +Interesting. Do you know where that time is spent? + +Also, it might be reasonable to assume that no more than, say, 10% of +a person's mail store is excluded, but maybe that depends on how +people use this feature. + +> I then implemented the exclude doing it once for each thread query in +> _notmuch_create_thread. Roughly this made any query 50% slower. + +That's not terrible. + +> In normal front end use even the 0.8s is not totally unusable, but it is +> totally unacceptable in the backend where a user might do something like +> +> for i in ` notmuch search --output=threads from:xxx ` ; +> do +> notmuch search --output=messages $i; +> done +> +> to list all messages in all matching threads. +> +> So I think my conclusions are: +> +> (1) message only queries must be done without the full exclude. +> (2) thread queries which only match one message should not do the full +> exclude +> (3) it would be nice to switch between the two approaches depending on +> size but I don't see how to do that without extra(!) queries +> (4) One possible might be do something that say does thirty threads with +> the by thread method and then if not finished does the full exclude. +> (5) thread-by-thread might be best for Jani's limit-match +> id:"1327692900-22926-1-git-send-email-jani@nikula.org" +> +> Obviously, anything setting an exclude flag like this will be slower +> (since it is doing more work): the question is are either of these (or a +> combination like (4) above) acceptable? + +Or only mark matched messages as excluded. + +Here's another idea (actually, a rehash of an old idea). For message +search do two queries, the original query and " AND +", and use this to keep everything in order and mark excluded +messages. For thread search, use message search results so it's easy +to both sort by unexcluded messages and include fully-excluded +threads, but compute the excluded flag (either just for unmatched +messages or for all messages) by examining each message's tags +directly (which thread_add_message already iterates over, so this is +easy and won't add any overhead). If the excluded query is fast, +which I think it will be, I think this should get the best of all +worlds and be fairly straightforward to implement (no asymmetries +between the queries used for message and thread search). It would be +easy and worth it to run the excluded query by hand on your test +corpus; I suspect it will be much faster than 0.8s because the query +already uses "Tmail", which is huge and doesn't seem to slow things +down. + +> I now have a mostly working implementation from library to +> emacs frontend and I do like the overall outcome. + +Awesome. + +> The complete benchmarks are below +> +> Best wishes +> +> Mark +> +> LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted +> SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted +> +> benchmarks: all times in seconds, x/y/z means a query which matches x +> threads with y matching messages and z messages in total. Ig or ignore +> means with the tag-exclude turned off (i.e. with a query matching the +> excluded tag). list all messages is the time for the for loop listed +> above giving all message-ids for all messages in any thread matching a +> query. +> +> Finally the three columns are master with exclude code disabled, +> thread-thread is doing excludes once per thread construction, and +> in-advance does all the exclude work in advance as in the patches I posted. +> +> In most cases the benchmark is the average of a lot of runs so the +> database should have been as cached as one could hope. +> +> master-(all) thread-thread in-advance +> LARGE COLLECTION +> show single message 0.016 0.018 0.78 +> search single message 0.015 0.016 0.78 +> search single with tag 0.015 0.015 0.009 +> 945/2627/20000 +> query ignore 2.9 n/a 3 +> query 2.9 4.2 3.8 +> list all messages (ig) 13 n/a 13 +> list all messages 13 14 12mins +> 4754/13000/110000 +> query ignore 15.9 n/a 17 +> query 15.9 22 17.6 +> only messages 1.25 1.26 1.9 +> 177/483/1752 +> query 0.3 0.42 1.1 +> +> search '*' 20mins 28mins 21.5mins +> +> SMALL COLLECTION +> 1500/2800/5600 +> query 1.8 2.7 2 +> list all messages 14.5 16.4 30 +> single message 0.008 0.008 0.018 +> +> search '*' 28 49 32 +>