From: Jani Nikula Date: Mon, 2 Jun 2014 18:29:26 +0000 (+0300) Subject: Re: Deduplication ? X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=8511a0fdbcd843d917cf5975d57e287e8480974d;p=notmuch-archives.git Re: Deduplication ? --- diff --git a/d5/a8aece93c85a8478f881a2f18672bee6fee8a7 b/d5/a8aece93c85a8478f881a2f18672bee6fee8a7 new file mode 100644 index 000000000..0e95dc1da --- /dev/null +++ b/d5/a8aece93c85a8478f881a2f18672bee6fee8a7 @@ -0,0 +1,122 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id 116D1431FAF + for ; Mon, 2 Jun 2014 11:29:39 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -0.7 +X-Spam-Level: +X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id d5Hd1BqBWelG for ; + Mon, 2 Jun 2014 11:29:31 -0700 (PDT) +Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com + [74.125.82.42]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client + certificate requested) by olra.theworths.org (Postfix) with ESMTPS id + CB6A7431FAE for ; Mon, 2 Jun 2014 11:29:30 -0700 + (PDT) +Received: by mail-wg0-f42.google.com with SMTP id y10so5465822wgg.25 + for ; Mon, 02 Jun 2014 11:29:29 -0700 (PDT) +X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; + d=1e100.net; s=20130820; + h=x-gm-message-state:from:to:subject:in-reply-to:references + :user-agent:date:message-id:mime-version:content-type; + bh=94KOYEw+4phKzcpXqklJDclTdrYAbM7A50Le9JpUlLo=; + b=jVdK7IFpDTPGR0wA1mPjxFIcXVvBLd2lM6xtDRmMW5uFjbrHuMjikZX+EkFkX1miVz + rBqm1XkfJUZpALD0f3Jsjx8kZRfx1KDkY2OsSEM3iAFTZjJYJUGxDOyrJUm9yxFdWIlv + 4medO1LKEoY6/kMJRArOv87lNGsdrLptq4keljp+HxAXioElOghDPe2Jz2uu6XecILcz + Lqas6BWIxCpPXFsEyf+HbyQeI4MnKJGM0+n/z7KYkYK+h4fcLzAimwglpscCrXWkzNK6 + yayt1H7kfXaaDZv0q9wVn4/HttbNdas6Nx24lDNfIIEZeH7j8BycYVnQTr+xM9/2i6Xx + ALOw== +X-Gm-Message-State: + ALoCoQlXKq7TIgXGmphuTVSBlkAHKo/hWccCbMmzLsBh4XSCj1j5FStrtsY7mzDEIo00PuF4A1Rf +X-Received: by 10.180.73.66 with SMTP id j2mr1723831wiv.36.1401733769370; + Mon, 02 Jun 2014 11:29:29 -0700 (PDT) +Received: from localhost (dsl-hkibrasgw2-58c36f-91.dhcp.inet.fi. + [88.195.111.91]) by mx.google.com with ESMTPSA id + gp6sm34626931wib.12.2014.06.02.11.29.27 for + (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); + Mon, 02 Jun 2014 11:29:28 -0700 (PDT) +From: Jani Nikula +To: David Edmondson , + Mark Walters , Tomi Ollila , + Vladimir Marek , notmuch@notmuchmail.org +Subject: Re: Deduplication ? +In-Reply-To: +References: <20140602123212.GA12639@virt.cz.oracle.com> + <87d2ers9mi.fsf@qmul.ac.uk> + <87ppirqtfa.fsf@qmul.ac.uk> <87y4xfz1fi.fsf@nikula.org> + +User-Agent: Notmuch/0.18+24~gfe8cd90 (http://notmuchmail.org) Emacs/24.3.1 + (x86_64-pc-linux-gnu) +Date: Mon, 02 Jun 2014 21:29:26 +0300 +Message-ID: <87vbsjyxkp.fsf@nikula.org> +MIME-Version: 1.0 +Content-Type: text/plain +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Mon, 02 Jun 2014 18:29:39 -0000 + +On Mon, 02 Jun 2014, David Edmondson wrote: +> On Mon, Jun 02 2014, Jani Nikula wrote: +>>>> One should also have some message content heuristics to determine that the +>>>> content is indeed duplicate and not something totally different (not that +>>>> we can see the different content anyway... but...) +>>> +>>> That would be nice. +>> +>> And quite hard. +> +> Thinking about this a bit... +> +> The headers are likely to be different, so you could remove them (get +> rid of everything up to the first empty line). +> +> Various mailing lists add footers, so you would need to remove them (a +> regular expression based approach would catch most of them easily). + +This may work for text/plain messages, but for mime messages (and I +think text/html too) an extra layer of mime structure is usually +added. The problem becomes matching a subtree of mime structure, and +deciding the non-matching layer is noise that can be ignored. The +mailing list manager adding the extra layer may also decode and +reconstruct the existing parts instead of using them as-is. + +> The remaining content should be the same for identical messages, so a +> sensible hash (md5) could be used to compare. +> +> Although, some MTAs modify the body of the message when manipulating +> encoding. I don't know how to address this. + +Let's assume we can figure it all out and find the duplicates. The +question remains, which one to save and which ones to remove? For list +mail, perhaps you'd like to save the copy you received through the list +so you know it's list mail (and you could search for it using list-id: +header *cough* if we indexed that *cough*). Or perhaps you'd like to +save the copy you received directly because some lists let people have +their addresses filtered from cc: header before distributing. + +More useful would probably be raising some flags if the heuristics +detect messages with the same message-id that are clearly *different* +messages. (Perhaps that's what Tomi was after to begin with?) + +Finally, I personally wouldn't want any duplicates removed; rather I'd +like notmuch to index information across all duplicates, and provide UI +features to see the alternatives if desired. + +BR, +Jani.