From: Carl Worth <cworth@cworth.org>
To: Jan Janak <jan@ryngle.com>, Not Much Mail <notmuch@notmuchmail.org>
In-Reply-To: <f35dbb950911211437q34923ee8w14b1ef65a204b09f@mail.gmail.com>
References: <f35dbb950911211437q34923ee8w14b1ef65a204b09f@mail.gmail.com>
Date: Sun, 22 Nov 2009 05:21:52 +0100
Message-ID: <878wdz2nq7.fsf@yoom.home.cworth.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Subject: Re: [notmuch] RFC: Multiple filenames for email messages
Precedence: list

On Sat, 21 Nov 2009 23:37:24 +0100, Jan Janak <jan@ryngle.com> wrote:
> The comment of _notmuch_message_set_filename says:
> 
>    XXX: We should still figure out if we think it's important to store
>    multiple filenames for email messages with identical message IDs.
...
> I'd like to propose that we store all filenames for email messages in
> the database, not just one per message. I'd be happy to work on it and
> submit a patch if others think that this would be good to have.

Oh, sure. As soon as we start using filenames for searches, then that
makes a lot of sense.

Currently, notmuch isn't storing any filename that way, but should be,
(need to just add a prefix to the table at the top of lib/database.cc,
document it, and then make the indexing stage generate terms from the
filename with that prefix).

The term generator and query parser should do the right thing, which is
to split the filename into individual terms at each '/', store position
data with each, and then turn a search like:

	filename:some/filename/segment

into a phrase search that looks for the terms "some", "filename", and
"segment", each with the filename prefix you choose and each in
sequential position. Note that if you compile notmuch with CFLAGS
including -DDEBUG then you'll see a nice report of the post-parsed query
that's useful for debugging stuff like this.

The reason for my comment was related to the other use of the filename,
(that is, the only one we're currently using). This is with regard to
querying the database for the actual filename, rather than searching on
it. For this, we don't use terms, but instead use the "data" field of
the document. I was wondering if in the presentation of an email message
it would ever be important to have access to the multiple files.

Can anyone think of a case where they would need that? That is, a case
where you care about the distinct content of two messages that have the
same message ID?

I suppose that in the case of getting a message by two paths, (say
through a mailing list and also via CC), one might want to inspect the
different headers in the two versions. So maybe we'll need to break down
and provide this information to the interfaces.

Also, if we're going to support file deletion well, then I suppose we
really will need to store all the filenames, (so if one disappears we
can still point to the others). Also, we'll need to be able to
accurately update the filename terms when a message disappears, so that
means having all of the complete filenames around.

So I guess I'm convincing myself that we really should store all the
filenames, and also provide an interface to get a list of filenames for
a message, (but also expect that many users of the API will only want to
look at the first filename in the list).

-Carl