From: David Edmondson Date: Mon, 2 Jun 2014 17:25:42 +0000 (+0100) Subject: Re: Deduplication ? X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=33da69f970e357604d7a49505d3faf54a3abfd31;p=notmuch-archives.git Re: Deduplication ? --- diff --git a/3c/c26265e3f53c1426aad77f5c2d25fbef87b222 b/3c/c26265e3f53c1426aad77f5c2d25fbef87b222 new file mode 100644 index 000000000..5b6a6d6cc --- /dev/null +++ b/3c/c26265e3f53c1426aad77f5c2d25fbef87b222 @@ -0,0 +1,93 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id 1145A431FAF + for ; Mon, 2 Jun 2014 10:26:10 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -2.299 +X-Spam-Level: +X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001] + autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id RU0FL3VeG2sL for ; + Mon, 2 Jun 2014 10:26:04 -0700 (PDT) +X-Greylist: delayed 13336 seconds by postgrey-1.32 at olra; + Mon, 02 Jun 2014 10:26:04 PDT +Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) + (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id 90D4E431FAE + for ; Mon, 2 Jun 2014 10:26:04 -0700 (PDT) +Received: from ucsinet22.oracle.com (ucsinet22.oracle.com [156.151.31.94]) + by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with + ESMTP id s52HPkfJ009837 + (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); + Mon, 2 Jun 2014 17:25:47 GMT +Received: from aserz7022.oracle.com (aserz7022.oracle.com [141.146.126.231]) + by ucsinet22.oracle.com (8.14.5+Sun/8.14.5) with ESMTP id + s52HPi16022038 + (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); + Mon, 2 Jun 2014 17:25:45 GMT +Received: from abhmp0014.oracle.com (abhmp0014.oracle.com [141.146.116.20]) + by aserz7022.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id + s52HPiWt010428; Mon, 2 Jun 2014 17:25:44 GMT +Received: from localhost (/81.149.164.25) + by default (Oracle Beehive Gateway v4.0) + with ESMTP ; Mon, 02 Jun 2014 10:25:44 -0700 +To: Jani Nikula , Mark Walters , + Tomi Ollila , + Vladimir Marek , notmuch@notmuchmail.org +Subject: Re: Deduplication ? +In-Reply-To: <87y4xfz1fi.fsf@nikula.org> +References: <20140602123212.GA12639@virt.cz.oracle.com> + <87d2ers9mi.fsf@qmul.ac.uk> + <87ppirqtfa.fsf@qmul.ac.uk> <87y4xfz1fi.fsf@nikula.org> +User-Agent: Notmuch/0.18 (http://notmuchmail.org) Emacs/24.3.1 + (x86_64-pc-linux-gnu) +Sender: david.edmondson@oracle.com +From: David Edmondson +Date: Mon, 02 Jun 2014 18:25:42 +0100 +Message-ID: +MIME-Version: 1.0 +Content-Type: text/plain +X-Source-IP: ucsinet22.oracle.com [156.151.31.94] +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Mon, 02 Jun 2014 17:26:10 -0000 + +On Mon, Jun 02 2014, Jani Nikula wrote: +>>> One should also have some message content heuristics to determine that the +>>> content is indeed duplicate and not something totally different (not that +>>> we can see the different content anyway... but...) +>> +>> That would be nice. +> +> And quite hard. + +Thinking about this a bit... + +The headers are likely to be different, so you could remove them (get +rid of everything up to the first empty line). + +Various mailing lists add footers, so you would need to remove them (a +regular expression based approach would catch most of them easily). + +The remaining content should be the same for identical messages, so a +sensible hash (md5) could be used to compare. + +Although, some MTAs modify the body of the message when manipulating +encoding. I don't know how to address this.