From 0ff8e168f0b78b727affe1372404d66b141438f7 Mon Sep 17 00:00:00 2001 From: David Bremner Date: Sun, 20 Apr 2014 21:59:26 +0900 Subject: [PATCH] Re: [RFC PATCH] Re: excessive thread fusing --- 32/6e489b2f8098a071a15b42d0b353708ff99b12 | 96 +++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 32/6e489b2f8098a071a15b42d0b353708ff99b12 diff --git a/32/6e489b2f8098a071a15b42d0b353708ff99b12 b/32/6e489b2f8098a071a15b42d0b353708ff99b12 new file mode 100644 index 000000000..425bcb76b --- /dev/null +++ b/32/6e489b2f8098a071a15b42d0b353708ff99b12 @@ -0,0 +1,96 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id 17791431FBD + for ; Sun, 20 Apr 2014 05:59:57 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: 0 +X-Spam-Level: +X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none] + autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id BaCWYoFZAnqA for ; + Sun, 20 Apr 2014 05:59:49 -0700 (PDT) +Received: from mx.xen14.node3324.gplhost.com (gitolite.debian.net + [87.98.215.224]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id EB126431FBC + for ; Sun, 20 Apr 2014 05:59:48 -0700 (PDT) +Received: from remotemail by mx.xen14.node3324.gplhost.com with local (Exim + 4.72) (envelope-from ) + id 1WbrLV-0005Wu-Od; Sun, 20 Apr 2014 12:59:45 +0000 +Received: (nullmailer pid 17456 invoked by uid 1000); Sun, 20 Apr 2014 + 12:59:26 -0000 +From: David Bremner +To: Carl Worth , Mark Walters , + notmuch +Subject: Re: [RFC PATCH] Re: excessive thread fusing +In-Reply-To: <87oazwjq1e.fsf@yoom.home.cworth.org> +References: <87ioq5mrbz.fsf@maritornes.cs.unb.ca> <87fvl8mpzj.fsf@qmul.ac.uk> + <87oazwjq1e.fsf@yoom.home.cworth.org> +User-Agent: Notmuch/0.17+202~gb65f328 (http://notmuchmail.org) Emacs/24.3.1 + (x86_64-pc-linux-gnu) +Date: Sun, 20 Apr 2014 21:59:26 +0900 +Message-ID: <87fvl8upg1.fsf@maritornes.cs.unb.ca> +MIME-Version: 1.0 +Content-Type: text/plain +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Sun, 20 Apr 2014 12:59:57 -0000 + +Carl Worth writes: +> +> Another idea would be to trigger specifically on common forms. Judging +> From the samples in this particular thread, it seems like a workable +> heuristic would be: +> +> If the In-Reply-To header begins with '<': +> +> Parse that initial portion as a message ID +> +> Else if it ends with '>': +> +> Parse that final portion as a message ID +> +> Else +> +> Ignore this garbage-valued header. +> + +using the hacky script below, I scanned my own mail collection of about +300k messages. I can make the following observations + +- I have some RFC compliant in-reply-to's with multiple ids +- I have have a non-trivial number of Message from $NAME
of $date +- I didn't see any cases where using the last angle bracketed thing + would fail. +- I did see some some cases where the header starts with '<' but the + matching '>' was missing +- I also noticed some rfc2047 encoding of in-reply-to headers. + + +###################################################################### +# hacky script follows +dir=$1 +echo Scanning $dir + +tempdir=$(mktemp -d) +echo Writing to ${tempdir} + +find $dir -exec sh -c "formail -c -xIn-reply-to < {}" \; \ + > ${tempdir}/ids + +sed -e 's/\t/ /' -e 's/ */ /g' -e 's/<[^ ]*>//g' -e 's/(.*)/(comment)/' < ${tempdir}/ids | sort | uniq | tee ${tempdir}/report -- 2.26.2