From: Austin Clements Date: Mon, 21 Apr 2014 16:20:58 +0000 (+2000) Subject: Re: [RFC PATCH] Re: excessive thread fusing X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=646ed6bf75cca62634c223bcb0c73af94b258a12;p=notmuch-archives.git Re: [RFC PATCH] Re: excessive thread fusing --- diff --git a/0c/a512d4523c549a0959a211c12440ec9076ba1d b/0c/a512d4523c549a0959a211c12440ec9076ba1d new file mode 100644 index 000000000..3335756b3 --- /dev/null +++ b/0c/a512d4523c549a0959a211c12440ec9076ba1d @@ -0,0 +1,182 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id C475B431FC0 + for ; Mon, 21 Apr 2014 09:21:10 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -0.7 +X-Spam-Level: +X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id 0zhyed3mCmm0 for ; + Mon, 21 Apr 2014 09:21:06 -0700 (PDT) +Received: from dmz-mailsec-scanner-7.mit.edu (dmz-mailsec-scanner-7.mit.edu + [18.7.68.36]) + (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id AFF99431FBC + for ; Mon, 21 Apr 2014 09:21:05 -0700 (PDT) +X-AuditID: 12074424-f79e26d000000c70-48-53554570e296 +Received: from mailhub-auth-1.mit.edu ( [18.9.21.35]) + (using TLS with cipher AES256-SHA (256/256 bits)) + (Client did not present a certificate) + by dmz-mailsec-scanner-7.mit.edu (Symantec Messaging Gateway) with SMTP + id 24.4D.03184.07545535; Mon, 21 Apr 2014 12:21:04 -0400 (EDT) +Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) + by mailhub-auth-1.mit.edu (8.13.8/8.9.2) with ESMTP id s3LGL2ds002667; + Mon, 21 Apr 2014 12:21:03 -0400 +Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91]) + (authenticated bits=0) + (User authenticated as amdragon@ATHENA.MIT.EDU) + by outgoing.mit.edu (8.13.8/8.12.4) with ESMTP id s3LGKxkl020572 + (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); + Mon, 21 Apr 2014 12:21:01 -0400 +Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.80) + (envelope-from ) + id 1WcGxn-0006vR-H6; Mon, 21 Apr 2014 12:20:59 -0400 +Date: Mon, 21 Apr 2014 12:20:58 -0400 +From: Austin Clements +To: Mark Walters +Subject: Re: [RFC PATCH] Re: excessive thread fusing +Message-ID: <20140421162058.GE25817@mit.edu> +References: <87ioq5mrbz.fsf@maritornes.cs.unb.ca> <87fvl8mpzj.fsf@qmul.ac.uk> + <87oazwjq1e.fsf@yoom.home.cworth.org> <877g6kmcmh.fsf@qmul.ac.uk> + <8738h7kv2q.fsf@qmul.ac.uk> +MIME-Version: 1.0 +Content-Type: text/plain; charset=iso-8859-1 +Content-Disposition: inline +Content-Transfer-Encoding: 8bit +In-Reply-To: <8738h7kv2q.fsf@qmul.ac.uk> +User-Agent: Mutt/1.5.21 (2010-09-15) +X-Brightmail-Tracker: + H4sIAAAAAAAAA+NgFlrHKsWRmVeSWpSXmKPExsUixCmqrFvgGhpssPGFrsXNn3PYLG60djNa + rJ7LY3H95kxmBxaP3ZsfsHjsnHWX3ePZqlvMHlsOvWcOYInisklJzcksSy3St0vgyjh5byNT + wRSNinXTfzM2MP6S72Lk4JAQMJE4tjy9i5ETyBSTuHBvPVsXIxeHkMBsJokTj+4zQTgbGSWu + PdkP5Zxmkmg4uowZwlkClJl/nw2kn0VAVaKp+RUTiM0moCGxbf9yRhBbREBH4vahBewg65gF + CiVOnawECQsDbb689AcziM0LVPJ/yzeomZsZJfrbNrFAJAQlTs58AmYzAxXt3HqHDWKOtMTy + fxwQYXmJ5q2zweZwAq29v+o02DmiAioSU05uY5vAKDwLyaRZSCbNQpg0C8mkBYwsqxhlU3Kr + dHMTM3OKU5N1i5MT8/JSi3TN9XIzS/RSU0o3MYKihd1FZQdj8yGlQ4wCHIxKPLwSBqHBQqyJ + ZcWVuYcYJTmYlER5fSyBQnxJ+SmVGYnFGfFFpTmpxYcYJTiYlUR412sC5XhTEiurUovyYVLS + HCxK4rxvra2ChQTSE0tSs1NTC1KLYLIyHBxKErwvXIAaBYtS01Mr0jJzShDSTBycIMN5gIY/ + AqnhLS5IzC3OTIfIn2JUlBLnVQBJCIAkMkrz4HphyewVozjQK8K8p0CqeICJEK77FdBgJqDB + T7aEgAwuSURISTUwmieHX6k00/ioMPO9evac4y4/p7WXTdM9/4qvp1CiWEp49751gaY/J7jz + 7zNg02LkqVl6yHRzf8GL42+9PnEeUvzN2aTfo3bnMkfp/0mHr4gtY1NOuHT3fqJUXz5ruYcV + e/MDpytzBcVE6lefSHcK8tzwpSxk5sLgRxLb38RxhzlLmQqVLfqkxFKckWioxVxUnAgApCb5 + MEEDAAA= +Cc: notmuch +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Mon, 21 Apr 2014 16:21:10 -0000 + +Quoth Mark Walters on Apr 21 at 8:20 am: +> +> >> I haven't tracked through all the logic of the existing algorithm for +> >> this case. But I don't like hearing that notmuch constructs different +> >> threads for the same messages presented in different orders. This sounds +> >> like a bug separate from what we've discussed above. +> +> I think I have now found this bug and it is separate from the malformed +> In-Reply-To problems. +> +> The problem is that when we merge threads we update all the thread-ids +> of documents in the loser thread. But we don't (if I understand the code +> correctly) update dangling "metadata" references to threads which don't +> (yet) have any documents. + +This exactly the problem I wrote +id:1395608456-9673-1-git-send-email-amdragon@mit.edu to test, but I +had convinced myself everything was okay because we link a message to +both its parents and all of its children. But that's only true if you +eventually receive the linking message (which in the test I made, you +do). In this case, you never receive the linking message, so even +though notmuch has enough information to bring the two threads +together, it doesn't. + +Maybe I should create a second variant of that test where all of the +messages reference their entire heritage (rather than just their +immediate parent) and test that they're *always* in one thread +regardless of receipt order (rather than only checking once they've +all been received)? I think that would weed out this case. + +> To make this explicit consider the 2 messages 17,18 in the set. +> +> Message 17 has id <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk> and has no +> references/in-reply-to so has no parents. +> +> Message 18 has a reference to <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk> +> and an in-reply-to to and +> <87wrkidfrh.fsf@pinto.chemeng.ucl.ac.uk> +> +> If you add 17 first then it gets thread-id 1 and then when you add 18 message 18 gets +> thread-id 2 as does the dangling link to the "unseen" message +> e.fraga@ucl.ac.uk, and then message 17 is moved to thread-id 2. +> +> However, if you add 18 first then it gets thread-id 1 and the dangling +> link gets id 1. When 17 is added it gets thread-id 2, message 18 gets +> thread-id updated to 2 *but* the dangling link to e.fraga@ucl.ac.uk does +> not get updated so stays thread-id 1. +> +> In particular when 52 comes along with its link to e.fraga@ucl.ac.uk +> then: +> +> In the first case it gets put in thread-id 3 and the other two +> messages get moved into thread 3. +> +> In the second case, message 52 gets put in thread 3 and thread 1 +> (the dangling link) gets moved into thread 3 leaving thread 2 +> (containing messages 17 and 18) untouched. + +So, there's an obvious, messy way to fix this: update the metadata +references when we do thread renumbering. This is messy because that +data *isn't indexed*. The only way to find the records we need to +update is to scan them all. This isn't completely terrible because +it's a sequential scan and we could cache it in memory, but it +certainly isn't going to help notmuch new's performance. (My database +has 6,749 of these, which takes ~1 second to scan on a cold cache, +though that's with an SSD [1]). + + +But let me propose an idea I've been kicking around for a while: ghost +message documents. Rather than using user metadata for tracking these +missing messages, use regular documents with the exact same terms we +use now for message IDs and thread IDs, but with a Tghost term instead +of a Tmail term to distinguish their type. This solves the problem +using infrastructure we already have in place, simplifies the message +linking code, and may even make it faster. It's a schema update, but +a simple and fast one. I think the hardest part is that things like +notmuch_database_find_message would need to distinguish ghosts and +regular messages (which may require pre-fetching the Tghost or Tmail +posting list to do efficiently). + +This also sets us up to do some cool things in the future, though +they're more invasive. If we have message-like documents for these +ghosts, we can store other message-like metadata as well. If we store +tags on them, then we can keep tags around for deleted messages and +*reapply them* if the message comes back. This would finally fix the +races we have now where, if a message is renamed or moved during a +notmuch new, we may think it's deleted only to reindex it with default +tags on the next run. We could also pre-tag messages that haven't +been indexed yet, say from procmail or when sending a message. This +could simplify or even obviate notmuch insert. If we add message +ctimes as proposed by Dave Mazières, this would give us a place to +store and query ctimes of deleted messages (otherwise it's unclear how +you find out about deletions without a full database scan). In +effect, the database becomes truly monotonic. + +[1] Curious? + yes n | xapian-inspect postlist.DB | \ + awk '!/Key/ {next} /Key: \\x00\\xc0thread_id_/ {N++} /Key: \\x00\\xd0/ {exit} END {print N}'