Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 3409D431FB6 for ; Sat, 28 Jan 2012 02:50:20 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -1.098 X-Spam-Level: X-Spam-Status: No, score=-1.098 tagged_above=-999 required=5 tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_FROM=0.001, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_MED=-2.3] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rQjEfrIfDrz6 for ; Sat, 28 Jan 2012 02:50:19 -0800 (PST) Received: from mail2.qmul.ac.uk (mail2.qmul.ac.uk [138.37.6.6]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 4163A431FAE for ; Sat, 28 Jan 2012 02:50:19 -0800 (PST) Received: from smtp.qmul.ac.uk ([138.37.6.40]) by mail2.qmul.ac.uk with esmtp (Exim 4.71) (envelope-from ) id 1Rr5rJ-0004do-Np; Sat, 28 Jan 2012 10:50:14 +0000 Received: from 94-192-233-223.zone6.bethere.co.uk ([94.192.233.223] helo=localhost) by smtp.qmul.ac.uk with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.69) (envelope-from ) id 1Rr5rJ-0001IY-AW; Sat, 28 Jan 2012 10:50:13 +0000 From: Mark Walters To: Austin Clements Subject: Re: [RFC PATCH 2/4] Add NOTMUCH_MESSAGE_FLAG_EXCLUDED flag In-Reply-To: <20120124024521.GY16740@mit.edu> References: <20120124011609.GX16740@mit.edu> <1327367923-18228-2-git-send-email-markwalters1009@gmail.com> <20120124024521.GY16740@mit.edu> User-Agent: Notmuch/0.11+107~g185f859 (http://notmuchmail.org) Emacs/23.2.1 (i486-pc-linux-gnu) Date: Sat, 28 Jan 2012 10:51:16 +0000 Message-ID: <874nvg6qxn.fsf@qmul.ac.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Sender-Host-Address: 94.192.233.223 X-QM-SPAM-Info: Sender has good ham record. :) X-QM-Body-MD5: 04068cf5df31b37566bcf27dfc64a333 (of first 20000 bytes) X-SpamAssassin-Score: -1.8 X-SpamAssassin-SpamBar: - X-SpamAssassin-Report: The QM spam filters have analysed this message to determine if it is spam. We require at least 5.0 points to mark a message as spam. This message scored -1.8 points. Summary of the scoring: * -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, * medium trust * [138.37.6.40 listed in list.dnswl.org] * 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider * (markwalters1009[at]gmail.com) * -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay * domain * 0.5 AWL AWL: From: address is in the auto white-list X-QM-Scan-Virus: ClamAV says the message is clean Cc: notmuch@notmuchmail.org X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Jan 2012 10:50:20 -0000 > > exclude_query = _notmuch_exclude_tags (query, final_query); > > > > - final_query = Xapian::Query (Xapian::Query::OP_AND_NOT, > > - final_query, exclude_query); > > + enquire.set_weighting_scheme (Xapian::BoolWeight()); > > + enquire.set_query (exclude_query); > > + > > + mset = enquire.get_mset (0, notmuch->xapian_db->get_doccount ()); > > + > > + GArray *excluded_doc_ids = g_array_new (FALSE, FALSE, sizeof (unsigned int)); > > + > > + for (iterator = mset.begin (); iterator != mset.end (); iterator++) > > + { > > + unsigned int doc_id = *iterator; > > + g_array_append_val (excluded_doc_ids, doc_id); > > + } > > + messages->base.excluded_doc_ids = talloc (query, _notmuch_doc_id_set); > > + _notmuch_doc_id_set_init (query, messages->base.excluded_doc_ids, > > + excluded_doc_ids); > > This might be inefficient for message-only queries, since it will > fetch *all* excluded docids. This highlights a basic difference > between message and thread search: thread search can return messages > that don't match the original query and hence needs to know all > potentially excluded messages, while message search can only return > messages that match the original query. I now have some benchmarks (not run enough times to be hugely accurate so ignore minor differences). The full results are below. The summary is: Large-archive = 1 100 000 messages in 290 000 threads (about 10 years of lkml). I mark 1 000 000 deleted Small-archive = 70 000 messages in 35 000 threads. 10 000 marked deleted. Doing the initial exclude work on the big collection takes about 0.8s and on the small collection about 0.01s. So any query to the big collection takes at least 0.8s longer and this all occurs before any results appear. I then implemented the exclude doing it once for each thread query in _notmuch_create_thread. Roughly this made any query 50% slower. In normal front end use even the 0.8s is not totally unusable, but it is totally unacceptable in the backend where a user might do something like for i in ` notmuch search --output=threads from:xxx ` ; do notmuch search --output=messages $i; done to list all messages in all matching threads. So I think my conclusions are: (1) message only queries must be done without the full exclude. (2) thread queries which only match one message should not do the full exclude (3) it would be nice to switch between the two approaches depending on size but I don't see how to do that without extra(!) queries (4) One possible might be do something that say does thirty threads with the by thread method and then if not finished does the full exclude. (5) thread-by-thread might be best for Jani's limit-match id:"1327692900-22926-1-git-send-email-jani@nikula.org" Obviously, anything setting an exclude flag like this will be slower (since it is doing more work): the question is are either of these (or a combination like (4) above) acceptable? I now have a mostly working implementation from library to emacs frontend and I do like the overall outcome. The complete benchmarks are below Best wishes Mark LARGE COLLECTION is 1,100,000 messages 290,000 threads 1,000,000 deleted SMALL COLLECTION is 70,000 messages in 35,000 threads 10,000 deleted benchmarks: all times in seconds, x/y/z means a query which matches x threads with y matching messages and z messages in total. Ig or ignore means with the tag-exclude turned off (i.e. with a query matching the excluded tag). list all messages is the time for the for loop listed above giving all message-ids for all messages in any thread matching a query. Finally the three columns are master with exclude code disabled, thread-thread is doing excludes once per thread construction, and in-advance does all the exclude work in advance as in the patches I posted. In most cases the benchmark is the average of a lot of runs so the database should have been as cached as one could hope. master-(all) thread-thread in-advance LARGE COLLECTION show single message 0.016 0.018 0.78 search single message 0.015 0.016 0.78 search single with tag 0.015 0.015 0.009 945/2627/20000 query ignore 2.9 n/a 3 query 2.9 4.2 3.8 list all messages (ig) 13 n/a 13 list all messages 13 14 12mins 4754/13000/110000 query ignore 15.9 n/a 17 query 15.9 22 17.6 only messages 1.25 1.26 1.9 177/483/1752 query 0.3 0.42 1.1 search '*' 20mins 28mins 21.5mins SMALL COLLECTION 1500/2800/5600 query 1.8 2.7 2 list all messages 14.5 16.4 30 single message 0.008 0.008 0.018 search '*' 28 49 32