c5/9bdd27dcfab98b5922adb659805e7abbcf5e44

   1 Return-Path: <james@jameswestby.net>\r
   2 X-Original-To: notmuch@notmuchmail.org\r
   3 Delivered-To: notmuch@notmuchmail.org\r
   4 Received: from localhost (localhost [127.0.0.1])\r
   5         by olra.theworths.org (Postfix) with ESMTP id C01E1431FC0\r
   6         for <notmuch@notmuchmail.org>; Fri, 18 Dec 2009 11:53:21 -0800 (PST)\r
   7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
   8 Received: from olra.theworths.org ([127.0.0.1])\r
   9         by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
  10         with ESMTP id dayfWdd0jCLY for <notmuch@notmuchmail.org>;\r
  11         Fri, 18 Dec 2009 11:53:20 -0800 (PST)\r
  12 X-Greylist: delayed 3052 seconds by postgrey-1.32 at olra;\r
  13         Fri, 18 Dec 2009 11:53:20 PST\r
  14 Received: from jameswestby.net (jameswestby.net [89.145.97.141])\r
  15         by olra.theworths.org (Postfix) with ESMTP id A71EA431FBF\r
  16         for <notmuch@notmuchmail.org>; Fri, 18 Dec 2009 11:53:20 -0800 (PST)\r
  17 Received: from cpc4-aztw22-2-0-cust59.aztw.cable.virginmedia.com\r
  18         ([94.169.116.60] helo=flash)\r
  19         by jameswestby.net with esmtpa (Exim 4.69)\r
  20         (envelope-from <james@jameswestby.net>)\r
  21         id 1NLit5-0005sl-3a; Fri, 18 Dec 2009 19:53:19 +0000\r
  22 Received: by flash (Postfix, from userid 1000)\r
  23         id B42AE6E546A; Fri, 18 Dec 2009 19:53:13 +0000 (GMT)\r
  24 From: James Westby <jw+debian@jameswestby.net>\r
  25 To: Carl Worth <cworth@cworth.org>, notmuch@notmuchmail.org\r
  26 In-Reply-To: <874onoysrl.fsf@yoom.home.cworth.org>\r
  27 References: <87oclwrtqa.fsf@jameswestby.net>\r
  28         <874onoysrl.fsf@yoom.home.cworth.org>\r
  29 Date: Fri, 18 Dec 2009 19:53:13 +0000\r
  30 Message-ID: <87my1grrdi.fsf@jameswestby.net>\r
  31 MIME-Version: 1.0\r
  32 Content-Type: text/plain; charset=us-ascii\r
  33 Subject: Re: [notmuch] Missing messages breaking threads\r
  34 X-BeenThere: notmuch@notmuchmail.org\r
  35 X-Mailman-Version: 2.1.12\r
  36 Precedence: list\r
  37 List-Id: "Use and development of the notmuch mail system."\r
  38         <notmuch.notmuchmail.org>\r
  39 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
  40         <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
  41 List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
  42 List-Post: <mailto:notmuch@notmuchmail.org>\r
  43 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
  44 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
  45         <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
  46 X-List-Received-Date: Fri, 18 Dec 2009 19:53:21 -0000\r
  47 \r
  48 On Fri, 18 Dec 2009 11:41:18 -0800, Carl Worth <cworth@cworth.org> wrote:\r
  49 > On Fri, 18 Dec 2009 19:02:21 +0000, James Westby <jw+debian@jameswestby.net> wrote:\r
  50 > > Therefore I'd like to fix this. The obvious way is to\r
  51 > > introduce documents in to the db for each id we see, and\r
  52 > > threading should then naturally work better.\r
  53 > \r
  54 > That sounds like a fine idea.\r
  55 \r
  56 Good, at least I'm not totally off the map.\r
  57  \r
  58 > > The only issue I see with doing this is with mail delays.\r
  59 > > Once we do this we will sometimes receive a message that\r
  60 > > already has a dummy document. What happens currently with\r
  61 > > message-id collisions?\r
  62 > \r
  63 > The current message-ID collision logic is pretty brain-dead. It just\r
  64 > says "Oh, I've seen a file with this message before, so I'll skip this\r
  65 > additional file".\r
  66 > \r
  67 > But I'm just putting the finishing touches on a patch that instead does:\r
  68 > \r
  69 >       Oh, and here's an additional filename for that message ID. Add\r
  70 >       that too, please.\r
  71 > \r
  72 > Beyond that, all we would need to do as well is to also index the new\r
  73 > content. I don't want to do useless re-indexing when files just get\r
  74 > renamed. So maybe all we need to do is to save the filesize of the\r
  75 > last-indexed file for a document and then when we encounter a file with\r
  76 > the same message ID and a larger file size, then index it as well?\r
  77 \r
  78 I would say different file size, but I imagine larger is the majority\r
  79 of interesting cases.\r
  80 \r
  81 > That would even take care of providing the opportunity to index\r
  82 > additional mailing-list-added content for messages also sent directly\r
  83 > via CC.\r
  84 > \r
  85 > The file-size heuristic wouldn't be perfect for these other cases. I\r
  86 > guess we save a list of sha-1 sums for indexed files or so, (assuming\r
  87 > that's cheaper than just re-indexing---before the Xapian Defect 250 fix\r
  88 > I'm sure it is, but after I'm not sure---we maybe should just always\r
  89 > re-index---but I think I have seen the TermGenerator appear in profiles\r
  90 > of indexing runs.)\r
  91 \r
  92 I'm not sure this is needed too much, but would obviously be\r
  93 correct.\r
  94 \r
  95 On Xapian 250, I have a very slow spinning disk, and it was hitting\r
  96 me hard, making processing my inbox far too slow. I built Xapian SVN\r
  97 with the patch from the bug and it is now lightning fast, so\r
  98 consider this another endorsement. I also tried the supplemental\r
  99 patch and it showed no further improvement for notmuch tag.\r
 100 \r
 101 > >   * When we get a message-id conflict check for dummy:True\r
 102 > >     and replace the document if it is there.\r
 103 > > \r
 104 > > How does this sound?\r
 105 > \r
 106 > That sounds fine. It's the same as what I propose above with\r
 107 > "filesize:0" instead of "dummy:true".\r
 108 \r
 109 That works. However, we would want the old content to go away\r
 110 in these cases wouldn't we.\r
 111 \r
 112 Or do we not index whatever dummy text we add? Or do we not\r
 113 even put it in? Or not even show it at all? I was just thinking\r
 114 of having "Missing messages..." showing up as the start of\r
 115 the thread, but maybe it's no needed.\r
 116 \r
 117 > > There could be an issue with synthesising too many threads\r
 118 > > and then ending up having to try and put a message in two\r
 119 > > threads? I see there is code for merging threads, would that\r
 120 > > handle this?\r
 121 > \r
 122 > It should, yes.\r
 123 > \r
 124 > The current logic is that a message can only appear in a single\r
 125 > thread. So if a message has children or parents with distinct thread IDs\r
 126 > then those threads are merged.\r
 127 > \r
 128 > I can imagine some strange cross-posting scenario where one could argue\r
 129 > that the merging shouldn't happen, but I'm not sure we want to try to\r
 130 > respect that.\r
 131 \r
 132 Fair enough.\r
 133 \r
 134 So, to summarise, I should first look at storing filesizes, then\r
 135 the collision code to make it index further when the filesize grows,\r
 136 and then finally the code to add documents for missing messages?\r
 137 \r
 138 The only thing I am unclear on is how to handle existing databases?\r
 139 Do we have any concept of versioning? Or should I just assume that\r
 140 filesize: may not be in the document and act appropriately?\r
 141 \r
 142 Thanks,\r
 143 \r
 144 James\r
 145 \r