From: Vladimir Marek Date: Mon, 2 Jun 2014 12:32:12 +0000 (+0200) Subject: Deduplication ? X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=e933ee732462507cfcfa527c22e66c34e6be9024;p=notmuch-archives.git Deduplication ? --- diff --git a/0f/c9dac1c24a2d1c2173a8d681e5bbdebdd77c4e b/0f/c9dac1c24a2d1c2173a8d681e5bbdebdd77c4e new file mode 100644 index 000000000..f8d26f781 --- /dev/null +++ b/0f/c9dac1c24a2d1c2173a8d681e5bbdebdd77c4e @@ -0,0 +1,96 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id 61266431FBC + for ; Mon, 2 Jun 2014 06:22:48 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -2.299 +X-Spam-Level: +X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001] + autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id XIyWT63Mg4Er for ; + Mon, 2 Jun 2014 06:22:42 -0700 (PDT) +X-Greylist: delayed 3017 seconds by postgrey-1.32 at olra; + Mon, 02 Jun 2014 06:22:41 PDT +Received: from aserp1050.oracle.com (aserp1050.oracle.com [141.146.126.70]) + (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id 00227431FAE + for ; Mon, 2 Jun 2014 06:22:41 -0700 (PDT) +Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) + by aserp1050.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with + ESMTP id s52CWOLT026192 + (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) + for ; Mon, 2 Jun 2014 12:32:24 GMT +Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) + by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with + ESMTP id s52CWKMI005824 + (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) + for ; Mon, 2 Jun 2014 12:32:21 GMT +Received: from userz7022.oracle.com (userz7022.oracle.com [156.151.31.86]) + by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id + s52CWJ8M027689 + (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) + for ; Mon, 2 Jun 2014 12:32:20 GMT +Received: from abhmp0007.oracle.com (abhmp0007.oracle.com [141.146.116.13]) + by userz7022.oracle.com (8.14.5+Sun/8.14.4) with ESMTP id + s52CWHIY025494 + for ; Mon, 2 Jun 2014 12:32:18 GMT +Received: from virt.cz.oracle.com (/10.163.102.127) + by default (Oracle Beehive Gateway v4.0) + with ESMTP ; Mon, 02 Jun 2014 05:32:16 -0700 +Date: Mon, 2 Jun 2014 14:32:12 +0200 +From: Vladimir Marek +To: notmuch@notmuchmail.org +Subject: Deduplication ? +Message-ID: <20140602123212.GA12639@virt.cz.oracle.com> +MIME-Version: 1.0 +Content-Type: text/plain; charset=utf-8 +Content-Disposition: inline +User-Agent: Mutt/1.5.22.1-rc1 (2013-10-16) +X-Source-IP: aserp1040.oracle.com [141.146.126.69] +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Mon, 02 Jun 2014 13:22:48 -0000 + +Hi, + +I want to import bigger chunk of archived messages into my notmuch +database. It's about 100k messages. The problem is, that I most probably +have quite a lot of those messages in the DB. Basically I would like to +add only those I don't have already. + +There are two possibilities + +a) I will add all the 100k messages and then remove the duplicities. + +b) I will write a script which will parse the message ID's of the + to-be-added messages and try to match them to the notmuch DB. Adding + only files I can't find already. + +Ad b) might be better option, but I started to play with the idea of +deduplication. I'm thinking about listing all the message IDs stored in +DB, listing all files belonging to the IDs and deleting all but one. +Also I'm thinking about implementing some simple algorithm telling me +whether the messages are really very similar. Just to be sure I don't +delete something I don't want to. + +Was anyone playing with the idea? + +-- + Vlad