Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id AC85B431FBF; Sat, 21 Nov 2009 20:22:05 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OQN3GZ6Fahom; Sat, 21 Nov 2009 20:22:05 -0800 (PST) Received: from cworth.org (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 9C241431FAE; Sat, 21 Nov 2009 20:22:04 -0800 (PST) From: Carl Worth To: Jan Janak , Not Much Mail In-Reply-To: References: Date: Sun, 22 Nov 2009 05:21:52 +0100 Message-ID: <878wdz2nq7.fsf@yoom.home.cworth.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [notmuch] RFC: Multiple filenames for email messages X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Nov 2009 04:22:05 -0000 On Sat, 21 Nov 2009 23:37:24 +0100, Jan Janak wrote: > The comment of _notmuch_message_set_filename says: > > XXX: We should still figure out if we think it's important to store > multiple filenames for email messages with identical message IDs. ... > I'd like to propose that we store all filenames for email messages in > the database, not just one per message. I'd be happy to work on it and > submit a patch if others think that this would be good to have. Oh, sure. As soon as we start using filenames for searches, then that makes a lot of sense. Currently, notmuch isn't storing any filename that way, but should be, (need to just add a prefix to the table at the top of lib/database.cc, document it, and then make the indexing stage generate terms from the filename with that prefix). The term generator and query parser should do the right thing, which is to split the filename into individual terms at each '/', store position data with each, and then turn a search like: filename:some/filename/segment into a phrase search that looks for the terms "some", "filename", and "segment", each with the filename prefix you choose and each in sequential position. Note that if you compile notmuch with CFLAGS including -DDEBUG then you'll see a nice report of the post-parsed query that's useful for debugging stuff like this. The reason for my comment was related to the other use of the filename, (that is, the only one we're currently using). This is with regard to querying the database for the actual filename, rather than searching on it. For this, we don't use terms, but instead use the "data" field of the document. I was wondering if in the presentation of an email message it would ever be important to have access to the multiple files. Can anyone think of a case where they would need that? That is, a case where you care about the distinct content of two messages that have the same message ID? I suppose that in the case of getting a message by two paths, (say through a mailing list and also via CC), one might want to inspect the different headers in the two versions. So maybe we'll need to break down and provide this information to the interfaces. Also, if we're going to support file deletion well, then I suppose we really will need to store all the filenames, (so if one disappears we can still point to the others). Also, we'll need to be able to accurately update the filename terms when a message disappears, so that means having all of the complete filenames around. So I guess I'm convincing myself that we really should store all the filenames, and also provide an interface to get a list of filenames for a message, (but also expect that many users of the API will only want to look at the first filename in the list). -Carl