From: Eirik Byrkjeflot Anonsen <eirik@eirikba.org>
To: notmuch@notmuchmail.org
Subject: Re: Automatic suppression of non-duplicate messages
References: <87mwyz3s9d.fsf@star.eba> <87390qxvb4.fsf@maritornes.cs.unb.ca>
Date: Sun, 04 Nov 2012 11:06:18 +0100
In-Reply-To: <87390qxvb4.fsf@maritornes.cs.unb.ca> (David Bremner's message of
	"Sat, 03 Nov 2012 16:53:19 -0400")
Message-ID: <87wqy1u1gl.fsf@star.eba>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Precedence: list

David Bremner <david@tethera.net> writes:

> Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:
>
>> That's not what I see.  If I search for a term that only appears in
>> one of the "copies", none of the copies are included in the search
>> result.
>
> The offending code is at line 1813 of lib/database.cc; the message is
> only indexed if the message-id is new.
>
> It might be sensible to move _notmuch_message_index_file into the other
> branch of the if, but even if that works fine, something more
> sophisticated is needed for the call to
> __notmuch_message_set_header_values; the invariant that each message has
> a single subject seems reasonable.

Hmm, depends.  Assuming indexing is intended to be used for searching,
one might want to search for something that occurs in one subject but
not the other.  In practice I doubt it matters.


> Offhand I'm not sure of a good method of automatically deciding what is
> the same message (with e.g. headers and footer text added by a mailing
> list).

I don't think the real problem here is the duplicate detection algorithm
itself.  It is rather that notmuch forces a particular duplicate
detection algorithm on its users.  Duplicate detection should really be
delegated to a different application, thus allowing people to experiment
with whatever algorithm works best for them.  (Just like notmuch
delegates the choice of initial tags on messages to an external
application.)

But first notmuch must be modified so it can sensibly treat multiple
instances having the same message-id as separate messages.  That seems
to me to be the hard part.  (And some way for external applications to
join and split copies, of course.)




However, if you want an algorithm that is likely to get rid of most
duplicates while keeping most non-duplicates separate, here's a quick
suggestion:


Just to clarify: The goal is to suppress most copies of the same message
while not suppressing a single instance of a different message.  It
isn't important if a few duplicate messages makes it through, but it is
imperative that no "real" message is dropped.

To check whether two instances are duplicates, I suspect something like
this algorithm would be "good enough":

- Message-Id must be the same.  This isn't actually necessary, but it
  makes sense to require it anyway.

- From and Date must be the same.  These form important context that may
  change the meaning of the message (e.g. "me too" depends heavily on
  From, and "let's meet tomorrow" depends heavily on Date).  (Are there
  more context-supplying headers we should worry about?)

- If Subject and body are also the same, the instances are duplicates.

- Otherwise, if neither of the messages come from a mailing list,
  they're probably not duplicates.

- Otherwise, grab a few other (recent) mails from the same mailing list.
  If all the bodies end with the same text, ignore that text when
  comparing the bodies.

- For the Subject, again use a few other (recent) mails from the same
  mailing list for comparison.  But this time only look for one of the
  well-known common patterns.  If all the mails matches the same
  pattern, ignore that pattern when comparing the Subject.

- For both of the above, it would be good to pick messages from
  different threads, to avoid accidental similarities.  I suspect this
  is more important for subjects than bodies, though.

- Also, leading and trailing whitespace should probably be dropped.

- (Some other transformations may make sense, such as reflowing text or
  converting between character sets.  In practice I doubt that will make
  much of a difference.)

- If the "canonicalized" body and Subject are the same, the messages are
  duplicates.  At least there's now pretty much no chance that there is
  anything interesting that will be missed by dropping one of the
  messages.


(I'm assuming that identifying mailing lists are usually
straightforward, e.g. using the List-Id header).

eirik