Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 116D1431FAF for ; Mon, 2 Jun 2014 11:29:39 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id d5Hd1BqBWelG for ; Mon, 2 Jun 2014 11:29:31 -0700 (PDT) Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id CB6A7431FAE for ; Mon, 2 Jun 2014 11:29:30 -0700 (PDT) Received: by mail-wg0-f42.google.com with SMTP id y10so5465822wgg.25 for ; Mon, 02 Jun 2014 11:29:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:subject:in-reply-to:references :user-agent:date:message-id:mime-version:content-type; bh=94KOYEw+4phKzcpXqklJDclTdrYAbM7A50Le9JpUlLo=; b=jVdK7IFpDTPGR0wA1mPjxFIcXVvBLd2lM6xtDRmMW5uFjbrHuMjikZX+EkFkX1miVz rBqm1XkfJUZpALD0f3Jsjx8kZRfx1KDkY2OsSEM3iAFTZjJYJUGxDOyrJUm9yxFdWIlv 4medO1LKEoY6/kMJRArOv87lNGsdrLptq4keljp+HxAXioElOghDPe2Jz2uu6XecILcz Lqas6BWIxCpPXFsEyf+HbyQeI4MnKJGM0+n/z7KYkYK+h4fcLzAimwglpscCrXWkzNK6 yayt1H7kfXaaDZv0q9wVn4/HttbNdas6Nx24lDNfIIEZeH7j8BycYVnQTr+xM9/2i6Xx ALOw== X-Gm-Message-State: ALoCoQlXKq7TIgXGmphuTVSBlkAHKo/hWccCbMmzLsBh4XSCj1j5FStrtsY7mzDEIo00PuF4A1Rf X-Received: by 10.180.73.66 with SMTP id j2mr1723831wiv.36.1401733769370; Mon, 02 Jun 2014 11:29:29 -0700 (PDT) Received: from localhost (dsl-hkibrasgw2-58c36f-91.dhcp.inet.fi. [88.195.111.91]) by mx.google.com with ESMTPSA id gp6sm34626931wib.12.2014.06.02.11.29.27 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Jun 2014 11:29:28 -0700 (PDT) From: Jani Nikula To: David Edmondson , Mark Walters , Tomi Ollila , Vladimir Marek , notmuch@notmuchmail.org Subject: Re: Deduplication ? In-Reply-To: References: <20140602123212.GA12639@virt.cz.oracle.com> <87d2ers9mi.fsf@qmul.ac.uk> <87ppirqtfa.fsf@qmul.ac.uk> <87y4xfz1fi.fsf@nikula.org> User-Agent: Notmuch/0.18+24~gfe8cd90 (http://notmuchmail.org) Emacs/24.3.1 (x86_64-pc-linux-gnu) Date: Mon, 02 Jun 2014 21:29:26 +0300 Message-ID: <87vbsjyxkp.fsf@nikula.org> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jun 2014 18:29:39 -0000 On Mon, 02 Jun 2014, David Edmondson wrote: > On Mon, Jun 02 2014, Jani Nikula wrote: >>>> One should also have some message content heuristics to determine that the >>>> content is indeed duplicate and not something totally different (not that >>>> we can see the different content anyway... but...) >>> >>> That would be nice. >> >> And quite hard. > > Thinking about this a bit... > > The headers are likely to be different, so you could remove them (get > rid of everything up to the first empty line). > > Various mailing lists add footers, so you would need to remove them (a > regular expression based approach would catch most of them easily). This may work for text/plain messages, but for mime messages (and I think text/html too) an extra layer of mime structure is usually added. The problem becomes matching a subtree of mime structure, and deciding the non-matching layer is noise that can be ignored. The mailing list manager adding the extra layer may also decode and reconstruct the existing parts instead of using them as-is. > The remaining content should be the same for identical messages, so a > sensible hash (md5) could be used to compare. > > Although, some MTAs modify the body of the message when manipulating > encoding. I don't know how to address this. Let's assume we can figure it all out and find the duplicates. The question remains, which one to save and which ones to remove? For list mail, perhaps you'd like to save the copy you received through the list so you know it's list mail (and you could search for it using list-id: header *cough* if we indexed that *cough*). Or perhaps you'd like to save the copy you received directly because some lists let people have their addresses filtered from cc: header before distributing. More useful would probably be raising some flags if the heuristics detect messages with the same message-id that are clearly *different* messages. (Perhaps that's what Tomi was after to begin with?) Finally, I personally wouldn't want any duplicates removed; rather I'd like notmuch to index information across all duplicates, and provide UI features to see the alternatives if desired. BR, Jani.