8d/2e1e963c0680a2f82bd937d611d343d972793a

   1 Return-Path: <eirik@eirikba.org>\r
   2 X-Original-To: notmuch@notmuchmail.org\r
   3 Delivered-To: notmuch@notmuchmail.org\r
   4 Received: from localhost (localhost [127.0.0.1])\r
   5         by olra.theworths.org (Postfix) with ESMTP id 20B74431FAF\r
   6         for <notmuch@notmuchmail.org>; Sun,  4 Nov 2012 02:06:22 -0800 (PST)\r
   7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
   8 X-Spam-Flag: NO\r
   9 X-Spam-Score: -0.7\r
  10 X-Spam-Level: \r
  11 X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5\r
  12         tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled\r
  13 Received: from olra.theworths.org ([127.0.0.1])\r
  14         by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
  15         with ESMTP id nrU++yEQ6QzM for <notmuch@notmuchmail.org>;\r
  16         Sun,  4 Nov 2012 02:06:20 -0800 (PST)\r
  17 Received: from atmail.labs2.com (atmail.labs2.com [93.182.166.49])\r
  18         (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))\r
  19         (No client certificate requested)\r
  20         by olra.theworths.org (Postfix) with ESMTPS id 0B655431FAE\r
  21         for <notmuch@notmuchmail.org>; Sun,  4 Nov 2012 02:06:20 -0800 (PST)\r
  22 Received: from [178.74.1.248] (helo=star.eba)\r
  23         by atmail.labs2.com with esmtps (TLSv1:AES128-SHA:128) (Exim 4.77)\r
  24         (envelope-from <eirik@eirikba.org>) id 1TUx5q-0006GM-QA\r
  25         for notmuch@notmuchmail.org; Sun, 04 Nov 2012 11:06:14 +0100\r
  26 Received: from eirik by star.eba with local (Exim 4.80)\r
  27         (envelope-from <eirik@eirikba.org>) id 1TUx5u-00015r-34\r
  28         for notmuch@notmuchmail.org; Sun, 04 Nov 2012 11:06:18 +0100\r
  29 From: Eirik Byrkjeflot Anonsen <eirik@eirikba.org>\r
  30 To: notmuch@notmuchmail.org\r
  31 Subject: Re: Automatic suppression of non-duplicate messages\r
  32 References: <87mwyz3s9d.fsf@star.eba> <87390qxvb4.fsf@maritornes.cs.unb.ca>\r
  33 Date: Sun, 04 Nov 2012 11:06:18 +0100\r
  34 In-Reply-To: <87390qxvb4.fsf@maritornes.cs.unb.ca> (David Bremner's message of\r
  35         "Sat, 03 Nov 2012 16:53:19 -0400")\r
  36 Message-ID: <87wqy1u1gl.fsf@star.eba>\r
  37 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)\r
  38 MIME-Version: 1.0\r
  39 Content-Type: text/plain; charset=us-ascii\r
  40 X-ACL-Warn: Authenticated as:  Sent as: eirik@eirikba.org\r
  41 X-BeenThere: notmuch@notmuchmail.org\r
  42 X-Mailman-Version: 2.1.13\r
  43 Precedence: list\r
  44 List-Id: "Use and development of the notmuch mail system."\r
  45         <notmuch.notmuchmail.org>\r
  46 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
  47         <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
  48 List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
  49 List-Post: <mailto:notmuch@notmuchmail.org>\r
  50 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
  51 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
  52         <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
  53 X-List-Received-Date: Sun, 04 Nov 2012 10:06:22 -0000\r
  54 \r
  55 David Bremner <david@tethera.net> writes:\r
  56 \r
  57 > Eirik Byrkjeflot Anonsen <eirik@eirikba.org> writes:\r
  58 >\r
  59 >> That's not what I see.  If I search for a term that only appears in\r
  60 >> one of the "copies", none of the copies are included in the search\r
  61 >> result.\r
  62 >\r
  63 > The offending code is at line 1813 of lib/database.cc; the message is\r
  64 > only indexed if the message-id is new.\r
  65 >\r
  66 > It might be sensible to move _notmuch_message_index_file into the other\r
  67 > branch of the if, but even if that works fine, something more\r
  68 > sophisticated is needed for the call to\r
  69 > __notmuch_message_set_header_values; the invariant that each message has\r
  70 > a single subject seems reasonable.\r
  71 \r
  72 Hmm, depends.  Assuming indexing is intended to be used for searching,\r
  73 one might want to search for something that occurs in one subject but\r
  74 not the other.  In practice I doubt it matters.\r
  75 \r
  76 \r
  77 > Offhand I'm not sure of a good method of automatically deciding what is\r
  78 > the same message (with e.g. headers and footer text added by a mailing\r
  79 > list).\r
  80 \r
  81 I don't think the real problem here is the duplicate detection algorithm\r
  82 itself.  It is rather that notmuch forces a particular duplicate\r
  83 detection algorithm on its users.  Duplicate detection should really be\r
  84 delegated to a different application, thus allowing people to experiment\r
  85 with whatever algorithm works best for them.  (Just like notmuch\r
  86 delegates the choice of initial tags on messages to an external\r
  87 application.)\r
  88 \r
  89 But first notmuch must be modified so it can sensibly treat multiple\r
  90 instances having the same message-id as separate messages.  That seems\r
  91 to me to be the hard part.  (And some way for external applications to\r
  92 join and split copies, of course.)\r
  93 \r
  94 \r
  95 \r
  96 \r
  97 However, if you want an algorithm that is likely to get rid of most\r
  98 duplicates while keeping most non-duplicates separate, here's a quick\r
  99 suggestion:\r
 100 \r
 101 \r
 102 Just to clarify: The goal is to suppress most copies of the same message\r
 103 while not suppressing a single instance of a different message.  It\r
 104 isn't important if a few duplicate messages makes it through, but it is\r
 105 imperative that no "real" message is dropped.\r
 106 \r
 107 To check whether two instances are duplicates, I suspect something like\r
 108 this algorithm would be "good enough":\r
 109 \r
 110 - Message-Id must be the same.  This isn't actually necessary, but it\r
 111   makes sense to require it anyway.\r
 112 \r
 113 - From and Date must be the same.  These form important context that may\r
 114   change the meaning of the message (e.g. "me too" depends heavily on\r
 115   From, and "let's meet tomorrow" depends heavily on Date).  (Are there\r
 116   more context-supplying headers we should worry about?)\r
 117 \r
 118 - If Subject and body are also the same, the instances are duplicates.\r
 119 \r
 120 - Otherwise, if neither of the messages come from a mailing list,\r
 121   they're probably not duplicates.\r
 122 \r
 123 - Otherwise, grab a few other (recent) mails from the same mailing list.\r
 124   If all the bodies end with the same text, ignore that text when\r
 125   comparing the bodies.\r
 126 \r
 127 - For the Subject, again use a few other (recent) mails from the same\r
 128   mailing list for comparison.  But this time only look for one of the\r
 129   well-known common patterns.  If all the mails matches the same\r
 130   pattern, ignore that pattern when comparing the Subject.\r
 131 \r
 132 - For both of the above, it would be good to pick messages from\r
 133   different threads, to avoid accidental similarities.  I suspect this\r
 134   is more important for subjects than bodies, though.\r
 135 \r
 136 - Also, leading and trailing whitespace should probably be dropped.\r
 137 \r
 138 - (Some other transformations may make sense, such as reflowing text or\r
 139   converting between character sets.  In practice I doubt that will make\r
 140   much of a difference.)\r
 141 \r
 142 - If the "canonicalized" body and Subject are the same, the messages are\r
 143   duplicates.  At least there's now pretty much no chance that there is\r
 144   anything interesting that will be missed by dropping one of the\r
 145   messages.\r
 146 \r
 147 \r
 148 (I'm assuming that identifying mailing lists are usually\r
 149 straightforward, e.g. using the List-Id header).\r
 150 \r
 151 eirik\r