From: Mark Walters Date: Sun, 20 Apr 2014 12:03:34 +0000 (+0100) Subject: Re: [RFC PATCH] Re: excessive thread fusing X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=69b9f02df6c05f789f147fed189c5b00b08f5463;p=notmuch-archives.git Re: [RFC PATCH] Re: excessive thread fusing --- diff --git a/fd/736ee059348ad68be96cbc5decec746ab6755d b/fd/736ee059348ad68be96cbc5decec746ab6755d new file mode 100644 index 000000000..25ca705b1 --- /dev/null +++ b/fd/736ee059348ad68be96cbc5decec746ab6755d @@ -0,0 +1,281 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id 16050431FBD + for ; Sun, 20 Apr 2014 05:04:09 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: 3.001 +X-Spam-Level: *** +X-Spam-Status: No, score=3.001 tagged_above=-999 required=5 + tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_FROM=0.001, + FREEMAIL_REPLY=2.499, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_LOW=-0.7] + autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id Z8NZUlzkc1g8 for ; + Sun, 20 Apr 2014 05:04:05 -0700 (PDT) +Received: from mail2.qmul.ac.uk (mail2.qmul.ac.uk [138.37.6.6]) + (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id 71C30431FBC + for ; Sun, 20 Apr 2014 05:04:05 -0700 (PDT) +Received: from smtp.qmul.ac.uk ([138.37.6.40]) + by mail2.qmul.ac.uk with esmtp (Exim 4.71) + (envelope-from ) + id 1WbqTK-0001Zg-HT; Sun, 20 Apr 2014 13:03:59 +0100 +Received: from 188.29.253.189.threembb.co.uk ([188.29.253.189] helo=localhost) + by smtp.qmul.ac.uk with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.71) + (envelope-from ) + id 1WbqTJ-0008FP-52; Sun, 20 Apr 2014 13:03:46 +0100 +From: Mark Walters +To: Carl Worth , David Bremner , + notmuch +Subject: Re: [RFC PATCH] Re: excessive thread fusing +In-Reply-To: <87oazwjq1e.fsf@yoom.home.cworth.org> +References: <87ioq5mrbz.fsf@maritornes.cs.unb.ca> <87fvl8mpzj.fsf@qmul.ac.uk> + <87oazwjq1e.fsf@yoom.home.cworth.org> +User-Agent: Notmuch/0.15.2+615~g78e3a93 (http://notmuchmail.org) Emacs/23.4.1 + (x86_64-pc-linux-gnu) +Date: Sun, 20 Apr 2014 13:03:34 +0100 +Message-ID: <877g6kmcmh.fsf@qmul.ac.uk> +MIME-Version: 1.0 +Content-Type: text/plain; charset=us-ascii +X-Sender-Host-Address: 188.29.253.189 +X-QM-Geographic: According to ripencc, + this message was delivered by a machine in Britain (UK) (GB). +X-QM-SPAM-Info: Sender has good ham record. :) +X-QM-Body-MD5: 38660748c1edc2097058a623755a6d01 (of first 20000 bytes) +X-SpamAssassin-Score: 1.0 +X-SpamAssassin-SpamBar: + +X-SpamAssassin-Report: The QM spam filters have analysed this message to + determine if it is + spam. We require at least 5.0 points to mark a message as spam. + This message scored 1.0 points. Summary of the scoring: + * 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail + provider * (markwalters1009[at]gmail.com) + * 0.0 AC_HTML_NONSENSE_TAGS RAW: Many consecutive multi-letter HTML + tags, * likely nonsense/spam + * 1.0 FREEMAIL_REPLY From and body contain different freemails +X-QM-Scan-Virus: ClamAV says the message is clean +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Sun, 20 Apr 2014 12:04:09 -0000 + + +On Sun, 20 Apr 2014, Carl Worth wrote: +> Mark Walters writes: +>> I have done dome debugging of this. +> +> Thanks for looking closely, Mark! +> +>> There is a patch below which fixes this test case but who knows what +>> it breaks! Please DO NOT apply unless someone who knows this code says +>> it's OK. +> +> I wrote much of the original code being patched here, so hopefully I +> understand it and can say something useful. +> +> I agree that the patch should not be applied. I don't like to see one +> piece of code not trusting another in the same code base. If the +> parse_references() function doesn't deal well with a malformed header, +> then we should fix it, not step around it. + +> +> Meanwhile, not treating all potential referenced message IDs +> consistently could definitely make the notmuch algorithm more fragile +> and sensitive to the order of message indexing, etc. So let's not do +> that. + +I agree. This bug first came up in id:874nvcekjk.fsf@qmul.ac.uk; I think +that got mostly fixed by cf8aaafbad68 +(id:1361836225-17279-1-git-send-email-aaronecay@gmail.com and related +thread) so we may want to check whether that change is still wanted if +we fix the actual bug. + +> Instead, let's track down and fix the actual bug. +> +> Thanks for the idea of using two-digit names for these messages. That +> makes it much easier to inspect the relevant headers. +> +> Below, I've grepped out the actual References and In-Reply-To headers +> From the messages, and then simply substituted minimal, and +> easy-to-understand values for the message IDs. +> +> With these minimally modified headers, it's easy to manually inspect the +> relationships and see that messages 17 and 18 belong in one thread, and +> messages 32-52 belong in a separate thread. +> +> It's also quite easy to see the potential source of the bug. The +> In-Reply-To headers for messages 18, 32, and 52 all share a common +> string (an email address) formatted to look like a message-id, +> "". If notmuch looks at those headers, and treats +> that string as a message-id, then all of theses messages will be +> connected into a single thread. +> +> And since that's the reported behavior, it seems likely that +> "" is the cause of this bug. +> +>> I put some debug stuff in _notmuch_database_link_message_to_parents and +>> I think that the problem comes from the call to parse_references on line +>> 1767 which adds the malformed in-reply-to header to the hash table, so +>> this malformed line gets added as a potential parent. +> +> Am I correct that your debugging showed that "" is +> being added to the hash table? + +Yes that is correct. + +> My inspection of _parse_references() and parse_message_id() suggests +> that that's exactly what notmuch is doing, (treating both of the +> angle-bracketed portions ("" as well as the actual +> message-ID, "" or "" or "") as message IDs. +> +> So it seems like we need a new _parse_in_reply_to() function to use in +> place of _parse_references() and the new function will need a better +> heuristic for dealing with the unpredictability of In-Reply-To. +> +> The only real reason that we are trying to grab multiple message ID +> values from an In-Reply-To header is that RFC 2822 explicitly allows +> that, (to support a message simultaneously replying to multiple +> messages). I don't believe that that's common, but we might as well +> support it. At the same time, RFC 2822 also explicitly specifies that +> the In-Reply-To header will consist of nothing but message IDs. +> +> So perhaps the heuristic here could be to notice any characters outside +> of angle brackets, (like "Message" in the headers below), and in that +> case go to a strictly "not RFC 2822" mode and look for exactly one +> message ID. At that point, JWZ would recommend "the first <>-bracketed +> text found therein", but that would give precisely the wrong answer in +> this particular case. Here the correct Message ID appears in the last +> <>-bracketed text. I have not surveyed a large email corpus to determine +> how often "last <>-bracketed text" would fail as a heuristic. +> +> Another idea would be to trigger specifically on common forms. Judging +> From the samples in this particular thread, it seems like a workable +> heuristic would be: +> +> If the In-Reply-To header begins with '<': +> +> Parse that initial portion as a message ID +> +> Else if it ends with '>': +> +> Parse that final portion as a message ID +> +> Else +> +> Ignore this garbage-valued header. +> +> That's probably the best and most reliably thing to do here. +> +> Does anyone have any better ideas? + +Is there a case coming before all the above: if the In-Reply-To header +is correctly formed then parse as we do currently? (You sort of suggest +so above but I just wanted to check) + +>> As a clear example that I don't understand this code I don't know why +>> this no longer causes a problem if message 17 gets added too. +> +> I wanted to test my own knowledge of the code to see if I could explain +> this. But I didn't precisely follow your explanation of the behavior you +> saw. In cases (1) and (2) of your description, what order are you using +> to "add all messages" or "add all apart from 52"? + +I just untarred the tar file David posted. Then the messages get added +in the following order: + +45 39 47 33 31 18 42 51 41 46 37 44 35 36 34 49 40 48 38 52 17 50 32 43 + +which is the same as the order in the tar file. (I think this is notmuch +using inode based sort as it has not seen the directory before) + +In Case 2 I started with a fresh untar; then moved message 52 out of the +Maildir; ran notmuch new, then moved message 52 back +into the the Maildir tree and ran notmuch new again. + +> Then, for cases (3) and (4), what is done before adding the messages +> mentioned in these cases? Add all other messages? Again, in what order? + +In case 3 I started with a fresh untar. Moved all the message except 18 +elsewhere. ran notmuch new. moved message 52 back and ran notmuch new. + +In have checked case 4 carefully adding messages 1 at a time and running +notmuch new between each addition. + +If I add 18 17 52 I get 2 threads. +If I add 17 18 52 I get 1 thread + +> I haven't tracked through all the logic of the existing algorithm for +> this case. But I don't like hearing that notmuch constructs different +> threads for the same messages presented in different orders. This sounds +> like a bug separate from what we've discussed above. + +I agree but I don't know the logic well enough to be sure. + +Best wishes + +Mark + +> +> -Carl +> +> 18:References: +> 32:References: +> 33:References: +> 34:References: +> 35:References: +> 36:References: +> 37:References: +> 38:References: +> 39:References: +> 40:References: +> 41:References: +> 42:References: +> 43:References: +> 44:References: +> 45:References: +> 46:References: +> 47:References: +> 48:References: +> 49:References: +> 50:References: +> 51:References: +> 52:References: +> +> 18:In-reply-to: Message from Eric S Fraga of "Tue, 01 Mar 2011 15:25:38 GMT." +> 32:In-Reply-To: Message from Eric S Fraga of "Thu, 10 Mar 2011 21:00:16 GMT." +> 33:In-Reply-To: (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500") +> 34:In-Reply-To: +> 35:In-Reply-To: +> 36:In-Reply-To: (Carsten Dominik's message of "Sun, 13 Mar 2011 08:39:13 +0100") +> 37:In-Reply-To: +> 38:In-Reply-To: (Carsten Dominik's message of "Mon, 14 Mar 2011 08:40:33 +0100") +> 39:In-Reply-To: (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500") +> 40:In-Reply-To: +> 41:In-Reply-To: (Carsten Dominik's message of "Fri, 11 Mar 2011 12:36:13 +0100") +> 42:In-Reply-To: +> 43:In-Reply-To: +> 44:In-Reply-To: +> 45:In-reply-to: Message from Carsten Dominik of "Fri, 11 Mar 2011 12:36:13 +0100." +> 46:In-Reply-To: +> 47:In-reply-to: Message from Carsten Dominik of "Mon, 14 Mar 2011 11:21:36 BST." +> 48:In-Reply-To: +> 49:In-reply-to: Message from Carsten Dominik of "Mon, 14 Mar 2011 18:02:54 BST." +> 51:In-Reply-To: +> 52:In-reply-to: Message from Eric S Fraga of "Fri, 11 Mar 2011 08:47:58 GMT." +> +> -- +> carl.d.worth@intel.com