--- /dev/null
+Return-Path: <m.walters@qmul.ac.uk>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+ by olra.theworths.org (Postfix) with ESMTP id 16050431FBD\r
+ for <notmuch@notmuchmail.org>; Sun, 20 Apr 2014 05:04:09 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: 3.001\r
+X-Spam-Level: ***\r
+X-Spam-Status: No, score=3.001 tagged_above=-999 required=5\r
+ tests=[DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_FROM=0.001,\r
+ FREEMAIL_REPLY=2.499, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_LOW=-0.7]\r
+ autolearn=disabled\r
+Received: from olra.theworths.org ([127.0.0.1])\r
+ by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
+ with ESMTP id Z8NZUlzkc1g8 for <notmuch@notmuchmail.org>;\r
+ Sun, 20 Apr 2014 05:04:05 -0700 (PDT)\r
+Received: from mail2.qmul.ac.uk (mail2.qmul.ac.uk [138.37.6.6])\r
+ (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))\r
+ (No client certificate requested)\r
+ by olra.theworths.org (Postfix) with ESMTPS id 71C30431FBC\r
+ for <notmuch@notmuchmail.org>; Sun, 20 Apr 2014 05:04:05 -0700 (PDT)\r
+Received: from smtp.qmul.ac.uk ([138.37.6.40])\r
+ by mail2.qmul.ac.uk with esmtp (Exim 4.71)\r
+ (envelope-from <m.walters@qmul.ac.uk>)\r
+ id 1WbqTK-0001Zg-HT; Sun, 20 Apr 2014 13:03:59 +0100\r
+Received: from 188.29.253.189.threembb.co.uk ([188.29.253.189] helo=localhost)\r
+ by smtp.qmul.ac.uk with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.71)\r
+ (envelope-from <m.walters@qmul.ac.uk>)\r
+ id 1WbqTJ-0008FP-52; Sun, 20 Apr 2014 13:03:46 +0100\r
+From: Mark Walters <markwalters1009@gmail.com>\r
+To: Carl Worth <cworth@cworth.org>, David Bremner <david@tethera.net>,\r
+ notmuch <notmuch@notmuchmail.org>\r
+Subject: Re: [RFC PATCH] Re: excessive thread fusing\r
+In-Reply-To: <87oazwjq1e.fsf@yoom.home.cworth.org>\r
+References: <87ioq5mrbz.fsf@maritornes.cs.unb.ca> <87fvl8mpzj.fsf@qmul.ac.uk>\r
+ <87oazwjq1e.fsf@yoom.home.cworth.org>\r
+User-Agent: Notmuch/0.15.2+615~g78e3a93 (http://notmuchmail.org) Emacs/23.4.1\r
+ (x86_64-pc-linux-gnu)\r
+Date: Sun, 20 Apr 2014 13:03:34 +0100\r
+Message-ID: <877g6kmcmh.fsf@qmul.ac.uk>\r
+MIME-Version: 1.0\r
+Content-Type: text/plain; charset=us-ascii\r
+X-Sender-Host-Address: 188.29.253.189\r
+X-QM-Geographic: According to ripencc,\r
+ this message was delivered by a machine in Britain (UK) (GB).\r
+X-QM-SPAM-Info: Sender has good ham record. :)\r
+X-QM-Body-MD5: 38660748c1edc2097058a623755a6d01 (of first 20000 bytes)\r
+X-SpamAssassin-Score: 1.0\r
+X-SpamAssassin-SpamBar: +\r
+X-SpamAssassin-Report: The QM spam filters have analysed this message to\r
+ determine if it is\r
+ spam. We require at least 5.0 points to mark a message as spam.\r
+ This message scored 1.0 points. Summary of the scoring: \r
+ * 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail\r
+ provider * (markwalters1009[at]gmail.com)\r
+ * 0.0 AC_HTML_NONSENSE_TAGS RAW: Many consecutive multi-letter HTML\r
+ tags, * likely nonsense/spam\r
+ * 1.0 FREEMAIL_REPLY From and body contain different freemails\r
+X-QM-Scan-Virus: ClamAV says the message is clean\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.13\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+ <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Sun, 20 Apr 2014 12:04:09 -0000\r
+\r
+\r
+On Sun, 20 Apr 2014, Carl Worth <cworth@cworth.org> wrote:\r
+> Mark Walters <markwalters1009@gmail.com> writes:\r
+>> I have done dome debugging of this.\r
+>\r
+> Thanks for looking closely, Mark!\r
+>\r
+>> There is a patch below which fixes this test case but who knows what\r
+>> it breaks! Please DO NOT apply unless someone who knows this code says\r
+>> it's OK.\r
+>\r
+> I wrote much of the original code being patched here, so hopefully I\r
+> understand it and can say something useful.\r
+>\r
+> I agree that the patch should not be applied. I don't like to see one\r
+> piece of code not trusting another in the same code base. If the\r
+> parse_references() function doesn't deal well with a malformed header,\r
+> then we should fix it, not step around it.\r
+\r
+>\r
+> Meanwhile, not treating all potential referenced message IDs\r
+> consistently could definitely make the notmuch algorithm more fragile\r
+> and sensitive to the order of message indexing, etc. So let's not do\r
+> that.\r
+\r
+I agree. This bug first came up in id:874nvcekjk.fsf@qmul.ac.uk; I think\r
+that got mostly fixed by cf8aaafbad68\r
+(id:1361836225-17279-1-git-send-email-aaronecay@gmail.com and related\r
+thread) so we may want to check whether that change is still wanted if\r
+we fix the actual bug.\r
+\r
+> Instead, let's track down and fix the actual bug.\r
+>\r
+> Thanks for the idea of using two-digit names for these messages. That\r
+> makes it much easier to inspect the relevant headers.\r
+>\r
+> Below, I've grepped out the actual References and In-Reply-To headers\r
+> From the messages, and then simply substituted minimal, and\r
+> easy-to-understand values for the message IDs.\r
+>\r
+> With these minimally modified headers, it's easy to manually inspect the\r
+> relationships and see that messages 17 and 18 belong in one thread, and\r
+> messages 32-52 belong in a separate thread.\r
+>\r
+> It's also quite easy to see the potential source of the bug. The\r
+> In-Reply-To headers for messages 18, 32, and 52 all share a common\r
+> string (an email address) formatted to look like a message-id,\r
+> "<e.fraga@ucl.ac.uk>". If notmuch looks at those headers, and treats\r
+> that string as a message-id, then all of theses messages will be\r
+> connected into a single thread.\r
+>\r
+> And since that's the reported behavior, it seems likely that\r
+> "<e.fraga@ucl.ac.uk>" is the cause of this bug.\r
+>\r
+>> I put some debug stuff in _notmuch_database_link_message_to_parents and\r
+>> I think that the problem comes from the call to parse_references on line\r
+>> 1767 which adds the malformed in-reply-to header to the hash table, so\r
+>> this malformed line gets added as a potential parent. \r
+>\r
+> Am I correct that your debugging showed that "<e.fraga@ucl.ac.uk>" is\r
+> being added to the hash table?\r
+\r
+Yes that is correct.\r
+\r
+> My inspection of _parse_references() and parse_message_id() suggests\r
+> that that's exactly what notmuch is doing, (treating both of the\r
+> angle-bracketed portions ("<e.fraga@ucl.ac.uk>" as well as the actual\r
+> message-ID, "<ID17>" or "<ID31>" or "<ID39>") as message IDs.\r
+>\r
+> So it seems like we need a new _parse_in_reply_to() function to use in\r
+> place of _parse_references() and the new function will need a better\r
+> heuristic for dealing with the unpredictability of In-Reply-To.\r
+>\r
+> The only real reason that we are trying to grab multiple message ID\r
+> values from an In-Reply-To header is that RFC 2822 explicitly allows\r
+> that, (to support a message simultaneously replying to multiple\r
+> messages). I don't believe that that's common, but we might as well\r
+> support it. At the same time, RFC 2822 also explicitly specifies that\r
+> the In-Reply-To header will consist of nothing but message IDs.\r
+>\r
+> So perhaps the heuristic here could be to notice any characters outside\r
+> of angle brackets, (like "Message" in the headers below), and in that\r
+> case go to a strictly "not RFC 2822" mode and look for exactly one\r
+> message ID. At that point, JWZ would recommend "the first <>-bracketed\r
+> text found therein", but that would give precisely the wrong answer in\r
+> this particular case. Here the correct Message ID appears in the last\r
+> <>-bracketed text. I have not surveyed a large email corpus to determine\r
+> how often "last <>-bracketed text" would fail as a heuristic.\r
+>\r
+> Another idea would be to trigger specifically on common forms. Judging\r
+> From the samples in this particular thread, it seems like a workable\r
+> heuristic would be:\r
+>\r
+> If the In-Reply-To header begins with '<':\r
+>\r
+> Parse that initial portion as a message ID\r
+>\r
+> Else if it ends with '>':\r
+>\r
+> Parse that final portion as a message ID\r
+>\r
+> Else\r
+>\r
+> Ignore this garbage-valued header.\r
+>\r
+> That's probably the best and most reliably thing to do here.\r
+>\r
+> Does anyone have any better ideas?\r
+\r
+Is there a case coming before all the above: if the In-Reply-To header\r
+is correctly formed then parse as we do currently? (You sort of suggest\r
+so above but I just wanted to check)\r
+\r
+>> As a clear example that I don't understand this code I don't know why\r
+>> this no longer causes a problem if message 17 gets added too.\r
+>\r
+> I wanted to test my own knowledge of the code to see if I could explain\r
+> this. But I didn't precisely follow your explanation of the behavior you\r
+> saw. In cases (1) and (2) of your description, what order are you using\r
+> to "add all messages" or "add all apart from 52"?\r
+\r
+I just untarred the tar file David posted. Then the messages get added\r
+in the following order:\r
+\r
+45 39 47 33 31 18 42 51 41 46 37 44 35 36 34 49 40 48 38 52 17 50 32 43\r
+\r
+which is the same as the order in the tar file. (I think this is notmuch\r
+using inode based sort as it has not seen the directory before)\r
+\r
+In Case 2 I started with a fresh untar; then moved message 52 out of the\r
+Maildir; ran notmuch new, then moved message 52 back\r
+into the the Maildir tree and ran notmuch new again.\r
+\r
+> Then, for cases (3) and (4), what is done before adding the messages\r
+> mentioned in these cases? Add all other messages? Again, in what order?\r
+\r
+In case 3 I started with a fresh untar. Moved all the message except 18\r
+elsewhere. ran notmuch new. moved message 52 back and ran notmuch new.\r
+\r
+In have checked case 4 carefully adding messages 1 at a time and running\r
+notmuch new between each addition.\r
+\r
+If I add 18 17 52 I get 2 threads.\r
+If I add 17 18 52 I get 1 thread\r
+\r
+> I haven't tracked through all the logic of the existing algorithm for\r
+> this case. But I don't like hearing that notmuch constructs different\r
+> threads for the same messages presented in different orders. This sounds\r
+> like a bug separate from what we've discussed above. \r
+\r
+I agree but I don't know the logic well enough to be sure.\r
+\r
+Best wishes\r
+\r
+Mark\r
+\r
+>\r
+> -Carl\r
+>\r
+> 18:References: <ID17>\r
+> 32:References: <ID31>\r
+> 33:References: <ID31> <ID32>\r
+> 34:References: <ID31> <ID32> <ID33>\r
+> 35:References: <ID31> <ID32> <ID33>\r
+> 36:References: <ID31> <ID32> <ID33> <ID35>\r
+> 37:References: <ID31> <ID32> <ID33> <ID35> <ID36>\r
+> 38:References: <ID31> <ID32> <ID33> <ID35> <ID36> <ID37>\r
+> 39:References: <ID31> <ID32>\r
+> 40:References: <ID31> <ID32> <ID39>\r
+> 41:References: <ID31> <ID32> <ID39> <ID40>\r
+> 42:References: <ID31> <ID32> <ID39> <ID40> <ID41>\r
+> 43:References: <ID31> <ID32> <ID39> <ID40> <ID41> <ID42>\r
+> 44:References: <ID31> <ID32> <ID39> <ID40> <ID41> <ID42>\r
+> 45:References: <ID31> <ID32> <ID39> <ID40>\r
+> 46:References: <ID31> <ID32> <ID39> <ID40> <ID45>\r
+> 47:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46>\r
+> 48:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47>\r
+> 49:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48>\r
+> 50:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48> <ID49>\r
+> 51:References: <ID31> <ID32> <ID39> <ID40> <ID45> <ID46> <ID47> <ID48> <ID49> <ID50>\r
+> 52:References: <ID31> <ID32> <ID39>\r
+>\r
+> 18:In-reply-to: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Tue, 01 Mar 2011 15:25:38 GMT." <ID17>\r
+> 32:In-Reply-To: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Thu, 10 Mar 2011 21:00:16 GMT." <ID31>\r
+> 33:In-Reply-To: <ID32> (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500")\r
+> 34:In-Reply-To: <ID33>\r
+> 35:In-Reply-To: <ID33>\r
+> 36:In-Reply-To: <ID35> (Carsten Dominik's message of "Sun, 13 Mar 2011 08:39:13 +0100")\r
+> 37:In-Reply-To: <ID36>\r
+> 38:In-Reply-To: <ID37> (Carsten Dominik's message of "Mon, 14 Mar 2011 08:40:33 +0100")\r
+> 39:In-Reply-To: <ID32> (Nick Dokos's message of "Thu, 10 Mar 2011 18:06:33 -0500")\r
+> 40:In-Reply-To: <ID39>\r
+> 41:In-Reply-To: <ID40> (Carsten Dominik's message of "Fri, 11 Mar 2011 12:36:13 +0100")\r
+> 42:In-Reply-To: <ID41>\r
+> 43:In-Reply-To: <ID42>\r
+> 44:In-Reply-To: <ID42>\r
+> 45:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Fri, 11 Mar 2011 12:36:13 +0100." <ID40>\r
+> 46:In-Reply-To: <ID45>\r
+> 47:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Mon, 14 Mar 2011 11:21:36 BST." <ID46>\r
+> 48:In-Reply-To: <ID47>\r
+> 49:In-reply-to: Message from Carsten Dominik <carsten.dominik@gmail.com> of "Mon, 14 Mar 2011 18:02:54 BST." <ID48>\r
+> 51:In-Reply-To: <ID50>\r
+> 52:In-reply-to: Message from Eric S Fraga <e.fraga@ucl.ac.uk> of "Fri, 11 Mar 2011 08:47:58 GMT." <ID39>\r
+>\r
+> -- \r
+> carl.d.worth@intel.com\r