From: Tomi Ollila <tomi.ollila@iki.fi>
Date: Mon, 2 Jun 2014 14:17:33 +0000 (+0300)
Subject: Re: Deduplication ?
X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=0598cba3b9cc98a5dd25a95b1988846a6622548b;p=notmuch-archives.git

Re: Deduplication ?
---

diff --git a/bd/4f05e435130a53c4ff6bb4bf59008f52f6fc0f b/bd/4f05e435130a53c4ff6bb4bf59008f52f6fc0f
new file mode 100644
index 000000000..d1c2471a3
--- /dev/null
+++ b/bd/4f05e435130a53c4ff6bb4bf59008f52f6fc0f
@@ -0,0 +1,122 @@
+Return-Path: <tomi.ollila@iki.fi>
+X-Original-To: notmuch@notmuchmail.org
+Delivered-To: notmuch@notmuchmail.org
+Received: from localhost (localhost [127.0.0.1])
+	by olra.theworths.org (Postfix) with ESMTP id 55757431FAF
+	for <notmuch@notmuchmail.org>; Mon,  2 Jun 2014 07:17:47 -0700 (PDT)
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
+X-Spam-Flag: NO
+X-Spam-Score: 0
+X-Spam-Level: 
+X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none]
+	autolearn=disabled
+Received: from olra.theworths.org ([127.0.0.1])
+	by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
+	with ESMTP id a6qh5T1ss-aH for <notmuch@notmuchmail.org>;
+	Mon,  2 Jun 2014 07:17:38 -0700 (PDT)
+Received: from guru.guru-group.fi (guru.guru-group.fi [46.183.73.34])
+	by olra.theworths.org (Postfix) with ESMTP id 6B477431FAE
+	for <notmuch@notmuchmail.org>; Mon,  2 Jun 2014 07:17:38 -0700 (PDT)
+Received: from guru.guru-group.fi (localhost [IPv6:::1])
+	by guru.guru-group.fi (Postfix) with ESMTP id D19C710005E;
+	Mon,  2 Jun 2014 17:17:33 +0300 (EEST)
+From: Tomi Ollila <tomi.ollila@iki.fi>
+To: Mark Walters <markwalters1009@gmail.com>,
+	Vladimir Marek <Vladimir.Marek@oracle.com>, notmuch@notmuchmail.org
+Subject: Re: Deduplication ?
+In-Reply-To: <87d2ers9mi.fsf@qmul.ac.uk>
+References: <20140602123212.GA12639@virt.cz.oracle.com>
+	<87d2ers9mi.fsf@qmul.ac.uk>
+User-Agent: Notmuch/0.18+28~gcecaba1 (http://notmuchmail.org) Emacs/24.3.1
+	(x86_64-unknown-linux-gnu)
+X-Face: HhBM'cA~<r"^Xv\KRN0P{vn'Y"Kd;zg_y3S[4)KSN~s?O\"QPoL
+	$[Xv_BD:i/F$WiEWax}R(MPS`^UaptOGD`*/=@\1lKoVa9tnrg0TW?"r7aRtgk[F
+	!)g;OY^,BjTbr)Np:%c_o'jj,Z
+Date: Mon, 02 Jun 2014 17:17:33 +0300
+Message-ID: <m2ppirs8ea.fsf@guru.guru-group.fi>
+MIME-Version: 1.0
+Content-Type: text/plain
+X-BeenThere: notmuch@notmuchmail.org
+X-Mailman-Version: 2.1.13
+Precedence: list
+List-Id: "Use and development of the notmuch mail system."
+	<notmuch.notmuchmail.org>
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
+	<mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>
+List-Post: <mailto:notmuch@notmuchmail.org>
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
+	<mailto:notmuch-request@notmuchmail.org?subject=subscribe>
+X-List-Received-Date: Mon, 02 Jun 2014 14:17:47 -0000
+
+On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:
+
+> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
+>
+>> Hi,
+>>
+>> I want to import bigger chunk of archived messages into my notmuch
+>> database. It's about 100k messages. The problem is, that I most probably
+>> have quite a lot of those messages in the DB. Basically I would like to
+>> add only those I don't have already.
+>>
+>> There are two possibilities
+>>
+>> a) I will add all the 100k messages and then remove the duplicities.
+>>
+>> b) I will write a script which will parse the message ID's of the
+>>    to-be-added messages and try to match them to the notmuch DB. Adding
+>>    only files I can't find already.
+>>
+>> Ad b) might be better option, but I started to play with the idea of
+>> deduplication. I'm thinking about listing all the message IDs stored in
+>> DB, listing all files belonging to the IDs and deleting all but one.
+>> Also I'm thinking about implementing some simple algorithm telling me
+>> whether the messages are really very similar. Just to be sure I don't
+>> delete something I don't want to.
+>>
+>> Was anyone playing with the idea?
+>
+> I am not sure what your use case is but notmuch automatically
+> deduplicates: that is if the message-id is one it has already seen no
+> further indexing takes place. The only thing that happens is the new
+> filename gets added to the list of filenames for the message.
+>
+> Thus importing should be almost as fast as if the message were not
+> there, and the database should be almost identical to what you would get
+> if you only imported the genuine new messages.
+>
+> If you want to save disk space then you could delete the duplicates
+> after with something like
+>
+> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
+> xargs -0
+
+What if there are 3 duplicates (or 4... ;)
+
+>
+> (but please test it carefully first!)
+
+One should also have some message content heuristics to determine that the
+content is indeed duplicate and not something totally different (not that
+we can see the different content anyway... but...)
+
+>
+> I would think something like this is better than trying to parse the
+> message-ids yourself.
+
+
+>
+> Best wishes
+>
+> Mark
+>
+
+Tomi
+
+
+>
+>>
+>> -- 
+>> 	Vlad