From: Mark Walters <markwalters1009@gmail.com>
To: Tomi Ollila <tomi.ollila@iki.fi>,
	Vladimir Marek <Vladimir.Marek@oracle.com>, notmuch@notmuchmail.org
Subject: Re: Deduplication ?
In-Reply-To: <m2ppirs8ea.fsf@guru.guru-group.fi>
References: <20140602123212.GA12639@virt.cz.oracle.com>
	<87d2ers9mi.fsf@qmul.ac.uk> <m2ppirs8ea.fsf@guru.guru-group.fi>
User-Agent: Notmuch/0.15.2+615~g78e3a93 (http://notmuchmail.org) Emacs/23.4.1
	(i486-pc-linux-gnu)
Date: Mon, 02 Jun 2014 15:26:17 +0100
Message-ID: <87ppirqtfa.fsf@qmul.ac.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Precedence: list


Tomi Ollila <tomi.ollila@iki.fi> writes:

> On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:
>
>> Vladimir Marek <Vladimir.Marek@oracle.com> writes:
>>
>>> Hi,
>>>
>>> I want to import bigger chunk of archived messages into my notmuch
>>> database. It's about 100k messages. The problem is, that I most probably
>>> have quite a lot of those messages in the DB. Basically I would like to
>>> add only those I don't have already.
>>>
>>> There are two possibilities
>>>
>>> a) I will add all the 100k messages and then remove the duplicities.
>>>
>>> b) I will write a script which will parse the message ID's of the
>>>    to-be-added messages and try to match them to the notmuch DB. Adding
>>>    only files I can't find already.
>>>
>>> Ad b) might be better option, but I started to play with the idea of
>>> deduplication. I'm thinking about listing all the message IDs stored in
>>> DB, listing all files belonging to the IDs and deleting all but one.
>>> Also I'm thinking about implementing some simple algorithm telling me
>>> whether the messages are really very similar. Just to be sure I don't
>>> delete something I don't want to.
>>>
>>> Was anyone playing with the idea?
>>
>> I am not sure what your use case is but notmuch automatically
>> deduplicates: that is if the message-id is one it has already seen no
>> further indexing takes place. The only thing that happens is the new
>> filename gets added to the list of filenames for the message.
>>
>> Thus importing should be almost as fast as if the message were not
>> there, and the database should be almost identical to what you would get
>> if you only imported the genuine new messages.
>>
>> If you want to save disk space then you could delete the duplicates
>> after with something like
>>
>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to
>> xargs -0
>
> What if there are 3 duplicates (or 4... ;)

I was assuming that it was merging 2 duplicate-free bunches of messages,
but I guess the new 100000 might not be. In that case running the above
repeatedly (ie until it is a no-op) would be fine. 

>
>>
>> (but please test it carefully first!)
>
> One should also have some message content heuristics to determine that the
> content is indeed duplicate and not something totally different (not that
> we can see the different content anyway... but...)

That would be nice.

Best wishes

Mark


>>
>> I would think something like this is better than trying to parse the
>> message-ids yourself.
>
>
>>
>> Best wishes
>>
>> Mark
>>
>
> Tomi
>
>
>>
>>>
>>> -- 
>>> 	Vlad