Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 61266431FBC for ; Mon, 2 Jun 2014 06:22:48 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -2.299 X-Spam-Level: X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XIyWT63Mg4Er for ; Mon, 2 Jun 2014 06:22:42 -0700 (PDT) X-Greylist: delayed 3017 seconds by postgrey-1.32 at olra; Mon, 02 Jun 2014 06:22:41 PDT Received: from aserp1050.oracle.com (aserp1050.oracle.com [141.146.126.70]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 00227431FAE for ; Mon, 2 Jun 2014 06:22:41 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) by aserp1050.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s52CWOLT026192 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 2 Jun 2014 12:32:24 GMT Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s52CWKMI005824 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 2 Jun 2014 12:32:21 GMT Received: from userz7022.oracle.com (userz7022.oracle.com [156.151.31.86]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s52CWJ8M027689 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Mon, 2 Jun 2014 12:32:20 GMT Received: from abhmp0007.oracle.com (abhmp0007.oracle.com [141.146.116.13]) by userz7022.oracle.com (8.14.5+Sun/8.14.4) with ESMTP id s52CWHIY025494 for ; Mon, 2 Jun 2014 12:32:18 GMT Received: from virt.cz.oracle.com (/10.163.102.127) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 02 Jun 2014 05:32:16 -0700 Date: Mon, 2 Jun 2014 14:32:12 +0200 From: Vladimir Marek To: notmuch@notmuchmail.org Subject: Deduplication ? Message-ID: <20140602123212.GA12639@virt.cz.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline User-Agent: Mutt/1.5.22.1-rc1 (2013-10-16) X-Source-IP: aserp1040.oracle.com [141.146.126.69] X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jun 2014 13:22:48 -0000 Hi, I want to import bigger chunk of archived messages into my notmuch database. It's about 100k messages. The problem is, that I most probably have quite a lot of those messages in the DB. Basically I would like to add only those I don't have already. There are two possibilities a) I will add all the 100k messages and then remove the duplicities. b) I will write a script which will parse the message ID's of the to-be-added messages and try to match them to the notmuch DB. Adding only files I can't find already. Ad b) might be better option, but I started to play with the idea of deduplication. I'm thinking about listing all the message IDs stored in DB, listing all files belonging to the IDs and deleting all but one. Also I'm thinking about implementing some simple algorithm telling me whether the messages are really very similar. Just to be sure I don't delete something I don't want to. Was anyone playing with the idea? -- Vlad