Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 1145A431FAF for ; Mon, 2 Jun 2014 10:26:10 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -2.299 X-Spam-Level: X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RU0FL3VeG2sL for ; Mon, 2 Jun 2014 10:26:04 -0700 (PDT) X-Greylist: delayed 13336 seconds by postgrey-1.32 at olra; Mon, 02 Jun 2014 10:26:04 PDT Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 90D4E431FAE for ; Mon, 2 Jun 2014 10:26:04 -0700 (PDT) Received: from ucsinet22.oracle.com (ucsinet22.oracle.com [156.151.31.94]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s52HPkfJ009837 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 2 Jun 2014 17:25:47 GMT Received: from aserz7022.oracle.com (aserz7022.oracle.com [141.146.126.231]) by ucsinet22.oracle.com (8.14.5+Sun/8.14.5) with ESMTP id s52HPi16022038 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 2 Jun 2014 17:25:45 GMT Received: from abhmp0014.oracle.com (abhmp0014.oracle.com [141.146.116.20]) by aserz7022.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s52HPiWt010428; Mon, 2 Jun 2014 17:25:44 GMT Received: from localhost (/81.149.164.25) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 02 Jun 2014 10:25:44 -0700 To: Jani Nikula , Mark Walters , Tomi Ollila , Vladimir Marek , notmuch@notmuchmail.org Subject: Re: Deduplication ? In-Reply-To: <87y4xfz1fi.fsf@nikula.org> References: <20140602123212.GA12639@virt.cz.oracle.com> <87d2ers9mi.fsf@qmul.ac.uk> <87ppirqtfa.fsf@qmul.ac.uk> <87y4xfz1fi.fsf@nikula.org> User-Agent: Notmuch/0.18 (http://notmuchmail.org) Emacs/24.3.1 (x86_64-pc-linux-gnu) Sender: david.edmondson@oracle.com From: David Edmondson Date: Mon, 02 Jun 2014 18:25:42 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain X-Source-IP: ucsinet22.oracle.com [156.151.31.94] X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jun 2014 17:26:10 -0000 On Mon, Jun 02 2014, Jani Nikula wrote: >>> One should also have some message content heuristics to determine that the >>> content is indeed duplicate and not something totally different (not that >>> we can see the different content anyway... but...) >> >> That would be nice. > > And quite hard. Thinking about this a bit... The headers are likely to be different, so you could remove them (get rid of everything up to the first empty line). Various mailing lists add footers, so you would need to remove them (a regular expression based approach would catch most of them easily). The remaining content should be the same for identical messages, so a sensible hash (md5) could be used to compare. Although, some MTAs modify the body of the message when manipulating encoding. I don't know how to address this.