Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 20B74431FAF for ; Sun, 4 Nov 2012 02:06:22 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nrU++yEQ6QzM for ; Sun, 4 Nov 2012 02:06:20 -0800 (PST) Received: from atmail.labs2.com (atmail.labs2.com [93.182.166.49]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 0B655431FAE for ; Sun, 4 Nov 2012 02:06:20 -0800 (PST) Received: from [178.74.1.248] (helo=star.eba) by atmail.labs2.com with esmtps (TLSv1:AES128-SHA:128) (Exim 4.77) (envelope-from ) id 1TUx5q-0006GM-QA for notmuch@notmuchmail.org; Sun, 04 Nov 2012 11:06:14 +0100 Received: from eirik by star.eba with local (Exim 4.80) (envelope-from ) id 1TUx5u-00015r-34 for notmuch@notmuchmail.org; Sun, 04 Nov 2012 11:06:18 +0100 From: Eirik Byrkjeflot Anonsen To: notmuch@notmuchmail.org Subject: Re: Automatic suppression of non-duplicate messages References: <87mwyz3s9d.fsf@star.eba> <87390qxvb4.fsf@maritornes.cs.unb.ca> Date: Sun, 04 Nov 2012 11:06:18 +0100 In-Reply-To: <87390qxvb4.fsf@maritornes.cs.unb.ca> (David Bremner's message of "Sat, 03 Nov 2012 16:53:19 -0400") Message-ID: <87wqy1u1gl.fsf@star.eba> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-ACL-Warn: Authenticated as: Sent as: eirik@eirikba.org X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 04 Nov 2012 10:06:22 -0000 David Bremner writes: > Eirik Byrkjeflot Anonsen writes: > >> That's not what I see. If I search for a term that only appears in >> one of the "copies", none of the copies are included in the search >> result. > > The offending code is at line 1813 of lib/database.cc; the message is > only indexed if the message-id is new. > > It might be sensible to move _notmuch_message_index_file into the other > branch of the if, but even if that works fine, something more > sophisticated is needed for the call to > __notmuch_message_set_header_values; the invariant that each message has > a single subject seems reasonable. Hmm, depends. Assuming indexing is intended to be used for searching, one might want to search for something that occurs in one subject but not the other. In practice I doubt it matters. > Offhand I'm not sure of a good method of automatically deciding what is > the same message (with e.g. headers and footer text added by a mailing > list). I don't think the real problem here is the duplicate detection algorithm itself. It is rather that notmuch forces a particular duplicate detection algorithm on its users. Duplicate detection should really be delegated to a different application, thus allowing people to experiment with whatever algorithm works best for them. (Just like notmuch delegates the choice of initial tags on messages to an external application.) But first notmuch must be modified so it can sensibly treat multiple instances having the same message-id as separate messages. That seems to me to be the hard part. (And some way for external applications to join and split copies, of course.) However, if you want an algorithm that is likely to get rid of most duplicates while keeping most non-duplicates separate, here's a quick suggestion: Just to clarify: The goal is to suppress most copies of the same message while not suppressing a single instance of a different message. It isn't important if a few duplicate messages makes it through, but it is imperative that no "real" message is dropped. To check whether two instances are duplicates, I suspect something like this algorithm would be "good enough": - Message-Id must be the same. This isn't actually necessary, but it makes sense to require it anyway. - From and Date must be the same. These form important context that may change the meaning of the message (e.g. "me too" depends heavily on From, and "let's meet tomorrow" depends heavily on Date). (Are there more context-supplying headers we should worry about?) - If Subject and body are also the same, the instances are duplicates. - Otherwise, if neither of the messages come from a mailing list, they're probably not duplicates. - Otherwise, grab a few other (recent) mails from the same mailing list. If all the bodies end with the same text, ignore that text when comparing the bodies. - For the Subject, again use a few other (recent) mails from the same mailing list for comparison. But this time only look for one of the well-known common patterns. If all the mails matches the same pattern, ignore that pattern when comparing the Subject. - For both of the above, it would be good to pick messages from different threads, to avoid accidental similarities. I suspect this is more important for subjects than bodies, though. - Also, leading and trailing whitespace should probably be dropped. - (Some other transformations may make sense, such as reflowing text or converting between character sets. In practice I doubt that will make much of a difference.) - If the "canonicalized" body and Subject are the same, the messages are duplicates. At least there's now pretty much no chance that there is anything interesting that will be missed by dropping one of the messages. (I'm assuming that identifying mailing lists are usually straightforward, e.g. using the List-Id header). eirik