From: Carl Worth <cworth@cworth.org>
To: Brett Viren <brett.viren@gmail.com>
In-Reply-To: <46263c600911211436s5826015eqc5fc18a4164245cb@mail.gmail.com>
References: <20091121145111.GB19397@excalibur.local>
	<87fx874xj5.fsf@yoom.home.cworth.org>
	<46263c600911211436s5826015eqc5fc18a4164245cb@mail.gmail.com>
Date: Sun, 22 Nov 2009 04:28:18 +0100
Message-ID: <87hbsn2q7h.fsf@yoom.home.cworth.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: notmuch@notmuchmail.org
Subject: Re: [notmuch] 25 minutes load time with emacs -f notmuch
Precedence: list

On Sat, 21 Nov 2009 17:36:18 -0500, Brett Viren <brett.viren@gmail.com> wrote:
> Processed 130871 total files in 38m 7s (57 files/sec.).
> Added 102723 new messages to the database (not much, really).

Just be glad that you have so little mail. ;-)

> This was ~2GB of mail on a 2.5GHz CPU.  That seems pretty reasonable
> to me but I'd like to rerun the "notmuch new" under google perftools
> to see if there are any obvious bottlenecks that might be cleaned up.

To me, here are the obvious things to fix after looking at a profile:

  1. We're spending a *lot* of time searching in the Xapian database.

But our initial indexing operation should only be *writing* data into
the database, so what's this searching about?

Well, at each new message, we're looking up the ID from it's In-Reply-To
header to find a thread-ID to link to, and then we're looking up all of
the IDs from its References header to find thread IDs that need to be
merged with ours. So both parent and child lookups.

And since those are taking a bunch of time, I think it might make sense
to just keep a hashtable mapping message-ID -> thread-ID and do lookups
in that, (should have plenty of memory on current machines even with
lots of mail).

  2. We're hitting the slow Xapian document updates for thread-ID
  merging.

Whenever we find a child that was already in the database with one
thread ID that should have ours, we simply want to set its thread ID to
ours. But as we've talked about recently, Xapian has a bug (defect 250)
that makes it much more expensive than it should be to update a single
term.

So, we could do a first pass over the messages to find all their thread
IDs and get them to settle down before doing any indexing in a separate,
second pass.

Step (2) should help even if we don't do step (1), but clearly we can do
both.

It would be great if anyone wants to take a look at either or both of
these, otherwise I will when I can.

-Carl