1 Return-Path: <david.edmondson@oracle.com>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by olra.theworths.org (Postfix) with ESMTP id 1BD75431FBC
\r
6 for <notmuch@notmuchmail.org>; Mon, 2 Jun 2014 06:43:59 -0700 (PDT)
\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
\r
11 X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5
\r
12 tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001]
\r
14 Received: from olra.theworths.org ([127.0.0.1])
\r
15 by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
\r
16 with ESMTP id 8QQ94+860zuW for <notmuch@notmuchmail.org>;
\r
17 Mon, 2 Jun 2014 06:43:51 -0700 (PDT)
\r
18 Received: from aserp1050.oracle.com (aserp1050.oracle.com [141.146.126.70])
\r
19 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
\r
20 (No client certificate requested)
\r
21 by olra.theworths.org (Postfix) with ESMTPS id 3C406431FAE
\r
22 for <notmuch@notmuchmail.org>; Mon, 2 Jun 2014 06:43:51 -0700 (PDT)
\r
23 Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69])
\r
24 by aserp1050.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with
\r
25 ESMTP id s52DhmN7029906
\r
26 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
\r
27 for <notmuch@notmuchmail.org>; Mon, 2 Jun 2014 13:43:49 GMT
\r
28 Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93])
\r
29 by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with
\r
30 ESMTP id s52DhluS025358
\r
31 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
\r
32 for <notmuch@notmuchmail.org>; Mon, 2 Jun 2014 13:43:48 GMT
\r
33 Received: from userz7021.oracle.com (userz7021.oracle.com [156.151.31.85])
\r
34 by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id
\r
36 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL)
\r
37 for <notmuch@notmuchmail.org>; Mon, 2 Jun 2014 13:43:47 GMT
\r
38 Received: from abhmp0013.oracle.com (abhmp0013.oracle.com [141.146.116.19])
\r
39 by userz7021.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id
\r
41 for <notmuch@notmuchmail.org>; Mon, 2 Jun 2014 13:43:46 GMT
\r
42 Received: from localhost (/81.149.164.25)
\r
43 by default (Oracle Beehive Gateway v4.0)
\r
44 with ESMTP ; Mon, 02 Jun 2014 06:43:45 -0700
\r
45 To: Vladimir Marek <Vladimir.Marek@oracle.com>, notmuch@notmuchmail.org
\r
46 Subject: Re: Deduplication ?
\r
47 In-Reply-To: <20140602123212.GA12639@virt.cz.oracle.com>
\r
48 References: <20140602123212.GA12639@virt.cz.oracle.com>
\r
49 User-Agent: Notmuch/0.18 (http://notmuchmail.org) Emacs/24.3.1
\r
50 (x86_64-pc-linux-gnu)
\r
51 Sender: david.edmondson@oracle.com
\r
52 From: David Edmondson <david.edmondson@oracle.com>
\r
53 Date: Mon, 02 Jun 2014 14:43:43 +0100
\r
54 Message-ID: <cuniooj1l68.fsf@gargravarr.hh.sledj.net>
\r
56 Content-Type: text/plain
\r
57 X-Source-IP: aserp1040.oracle.com [141.146.126.69]
\r
58 X-Mailman-Approved-At: Mon, 02 Jun 2014 07:13:33 -0700
\r
59 X-BeenThere: notmuch@notmuchmail.org
\r
60 X-Mailman-Version: 2.1.13
\r
62 List-Id: "Use and development of the notmuch mail system."
\r
63 <notmuch.notmuchmail.org>
\r
64 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
65 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
66 List-Archive: <http://notmuchmail.org/pipermail/notmuch>
\r
67 List-Post: <mailto:notmuch@notmuchmail.org>
\r
68 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
69 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
70 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
71 X-List-Received-Date: Mon, 02 Jun 2014 13:43:59 -0000
\r
73 On Mon, Jun 02 2014, Vladimir Marek wrote:
\r
76 > I want to import bigger chunk of archived messages into my notmuch
\r
77 > database. It's about 100k messages. The problem is, that I most probably
\r
78 > have quite a lot of those messages in the DB. Basically I would like to
\r
79 > add only those I don't have already.
\r
81 > There are two possibilities
\r
83 > a) I will add all the 100k messages and then remove the duplicities.
\r
85 > b) I will write a script which will parse the message ID's of the
\r
86 > to-be-added messages and try to match them to the notmuch DB. Adding
\r
87 > only files I can't find already.
\r
89 > Ad b) might be better option, but I started to play with the idea of
\r
90 > deduplication. I'm thinking about listing all the message IDs stored in
\r
91 > DB, listing all files belonging to the IDs and deleting all but one.
\r
92 > Also I'm thinking about implementing some simple algorithm telling me
\r
93 > whether the messages are really very similar. Just to be sure I don't
\r
94 > delete something I don't want to.
\r
96 > Was anyone playing with the idea?
\r
98 notsync[1] used the (lack of) existence of a message id in the store to
\r
99 decide whether to add something from an IMAP server, but it is old,
\r
100 crufty, unused and unloved code.
\r
104 > _______________________________________________
\r
105 > notmuch mailing list
\r
106 > notmuch@notmuchmail.org
\r
107 > http://notmuchmail.org/mailman/listinfo/notmuch
\r
110 [1] https://github.com/dme/notsync
\r