From: James Westby <jw+debian@jameswestby.net>
To: Carl Worth <cworth@cworth.org>, notmuch@notmuchmail.org
In-Reply-To: <874onoysrl.fsf@yoom.home.cworth.org>
References: <87oclwrtqa.fsf@jameswestby.net>
	<874onoysrl.fsf@yoom.home.cworth.org>
Date: Fri, 18 Dec 2009 19:53:13 +0000
Message-ID: <87my1grrdi.fsf@jameswestby.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Subject: Re: [notmuch] Missing messages breaking threads
Precedence: list

On Fri, 18 Dec 2009 11:41:18 -0800, Carl Worth <cworth@cworth.org> wrote:
> On Fri, 18 Dec 2009 19:02:21 +0000, James Westby <jw+debian@jameswestby.net> wrote:
> > Therefore I'd like to fix this. The obvious way is to
> > introduce documents in to the db for each id we see, and
> > threading should then naturally work better.
> 
> That sounds like a fine idea.

Good, at least I'm not totally off the map.
 
> > The only issue I see with doing this is with mail delays.
> > Once we do this we will sometimes receive a message that
> > already has a dummy document. What happens currently with
> > message-id collisions?
> 
> The current message-ID collision logic is pretty brain-dead. It just
> says "Oh, I've seen a file with this message before, so I'll skip this
> additional file".
> 
> But I'm just putting the finishing touches on a patch that instead does:
> 
> 	Oh, and here's an additional filename for that message ID. Add
> 	that too, please.
> 
> Beyond that, all we would need to do as well is to also index the new
> content. I don't want to do useless re-indexing when files just get
> renamed. So maybe all we need to do is to save the filesize of the
> last-indexed file for a document and then when we encounter a file with
> the same message ID and a larger file size, then index it as well?

I would say different file size, but I imagine larger is the majority
of interesting cases.

> That would even take care of providing the opportunity to index
> additional mailing-list-added content for messages also sent directly
> via CC.
> 
> The file-size heuristic wouldn't be perfect for these other cases. I
> guess we save a list of sha-1 sums for indexed files or so, (assuming
> that's cheaper than just re-indexing---before the Xapian Defect 250 fix
> I'm sure it is, but after I'm not sure---we maybe should just always
> re-index---but I think I have seen the TermGenerator appear in profiles
> of indexing runs.)

I'm not sure this is needed too much, but would obviously be
correct.

On Xapian 250, I have a very slow spinning disk, and it was hitting
me hard, making processing my inbox far too slow. I built Xapian SVN
with the patch from the bug and it is now lightning fast, so
consider this another endorsement. I also tried the supplemental
patch and it showed no further improvement for notmuch tag.

> >   * When we get a message-id conflict check for dummy:True
> >     and replace the document if it is there.
> > 
> > How does this sound?
> 
> That sounds fine. It's the same as what I propose above with
> "filesize:0" instead of "dummy:true".

That works. However, we would want the old content to go away
in these cases wouldn't we.

Or do we not index whatever dummy text we add? Or do we not
even put it in? Or not even show it at all? I was just thinking
of having "Missing messages..." showing up as the start of
the thread, but maybe it's no needed.

> > There could be an issue with synthesising too many threads
> > and then ending up having to try and put a message in two
> > threads? I see there is code for merging threads, would that
> > handle this?
> 
> It should, yes.
> 
> The current logic is that a message can only appear in a single
> thread. So if a message has children or parents with distinct thread IDs
> then those threads are merged.
> 
> I can imagine some strange cross-posting scenario where one could argue
> that the merging shouldn't happen, but I'm not sure we want to try to
> respect that.

Fair enough.

So, to summarise, I should first look at storing filesizes, then
the collision code to make it index further when the filesize grows,
and then finally the code to add documents for missing messages?

The only thing I am unclear on is how to handle existing databases?
Do we have any concept of versioning? Or should I just assume that
filesize: may not be in the document and act appropriately?

Thanks,

James