Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 1BD75431FBC for ; Mon, 2 Jun 2014 06:43:59 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -2.299 X-Spam-Level: X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8QQ94+860zuW for ; Mon, 2 Jun 2014 06:43:51 -0700 (PDT) Received: from aserp1050.oracle.com (aserp1050.oracle.com [141.146.126.70]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 3C406431FAE for ; Mon, 2 Jun 2014 06:43:51 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) by aserp1050.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s52DhmN7029906 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 2 Jun 2014 13:43:49 GMT Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s52DhluS025358 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 2 Jun 2014 13:43:48 GMT Received: from userz7021.oracle.com (userz7021.oracle.com [156.151.31.85]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s52DhkNf003082 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Mon, 2 Jun 2014 13:43:47 GMT Received: from abhmp0013.oracle.com (abhmp0013.oracle.com [141.146.116.19]) by userz7021.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s52DhkdY003070 for ; Mon, 2 Jun 2014 13:43:46 GMT Received: from localhost (/81.149.164.25) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 02 Jun 2014 06:43:45 -0700 To: Vladimir Marek , notmuch@notmuchmail.org Subject: Re: Deduplication ? In-Reply-To: <20140602123212.GA12639@virt.cz.oracle.com> References: <20140602123212.GA12639@virt.cz.oracle.com> User-Agent: Notmuch/0.18 (http://notmuchmail.org) Emacs/24.3.1 (x86_64-pc-linux-gnu) Sender: david.edmondson@oracle.com From: David Edmondson Date: Mon, 02 Jun 2014 14:43:43 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain X-Source-IP: aserp1040.oracle.com [141.146.126.69] X-Mailman-Approved-At: Mon, 02 Jun 2014 07:13:33 -0700 X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jun 2014 13:43:59 -0000 On Mon, Jun 02 2014, Vladimir Marek wrote: > Hi, > > I want to import bigger chunk of archived messages into my notmuch > database. It's about 100k messages. The problem is, that I most probably > have quite a lot of those messages in the DB. Basically I would like to > add only those I don't have already. > > There are two possibilities > > a) I will add all the 100k messages and then remove the duplicities. > > b) I will write a script which will parse the message ID's of the > to-be-added messages and try to match them to the notmuch DB. Adding > only files I can't find already. > > Ad b) might be better option, but I started to play with the idea of > deduplication. I'm thinking about listing all the message IDs stored in > DB, listing all files belonging to the IDs and deleting all but one. > Also I'm thinking about implementing some simple algorithm telling me > whether the messages are really very similar. Just to be sure I don't > delete something I don't want to. > > Was anyone playing with the idea? notsync[1] used the (lack of) existence of a message id in the store to decide whether to add something from an IMAP server, but it is old, crufty, unused and unloved code. > -- > Vlad > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch Footnotes: [1] https://github.com/dme/notsync