From: David Bremner <david@tethera.net>
To: Daniel Kahn Gillmor <dkg@fifthhorseman.net>, notmuch@notmuchmail.org
Subject: Re: [RFC2 Patch 5/5] lib: iterator API for message properties
In-Reply-To: <87eg8ht2sb.fsf@alice.fifthhorseman.net>
References: <1463927339-5441-1-git-send-email-david@tethera.net>
 <1464608999-14774-1-git-send-email-david@tethera.net>
 <1464608999-14774-6-git-send-email-david@tethera.net>
 <8760tthfuy.fsf@zancas.localnet> <87pos1u14p.fsf@alice.fifthhorseman.net>
 <87eg8ht2sb.fsf@alice.fifthhorseman.net>
User-Agent: Notmuch/0.22+28~gb9bf3f4 (http://notmuchmail.org) Emacs/24.5.1
 (x86_64-pc-linux-gnu)
Date: Wed, 01 Jun 2016 20:29:59 -0300
Message-ID: <87lh2ofpxk.fsf@zancas.localnet>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: list

Daniel Kahn Gillmor <dkg@fifthhorseman.net> writes:

> On Tue 2016-05-31 21:52:06 -0400, Daniel Kahn Gillmor wrote:
>
> do we actually need this abstraction?  If we're aiming to build specific
> new features (the two i'm thinking of are cryptographic-session-keys and
> reference-adjustments), couldn't we implement those features explicitly
> in xapian with their own special prefix, rather than treating them as a
> generic "property"?

Sure, it's certainly possible.

I guess if you don't care about the possibility of iterating all pairs
with given key prefix (which I admit makes more sense for the config
API), then the code could be simplified to look more like the tag list
handling code.  C is pretty crap at generics, but I guess looking at
tags.c, it's really about iterators for notmuch_string_list_t. So it
could probably be generalized to serve here.

For each such prefix, one would need to roughly duplicate patches 1/5
and 3/5.  It took me a little while to figure 1/5 out, but now that I
know, it would be less trouble.  I guess my thinking here was that I
would provide a low level interface that people using the C API or
bindings could use without hacking xapian.

> We already have a bit of an uncomfortable fit with tags and special
> flags (encrypted, signed, attachment, etc), where some are expected to
> be set and cleared automagically and some are expected to be manipulated
> directly by the user.  Are we setting ourselves up for more of the same,
> or is there a principled way that a user can know which properties it's
> kosher for them to set and clear, and which ones they should leave
> alone?

XPROPERTY is an internal prefix, which means it isn't added to the query
parser.  As it happens, I didn't plan on CLI access to these terms
either. Both of those choices are tradeoffs to say that these are
internal metadata, suitable for manipulation by programs. Such programs
could be scripts using python or ruby.

> If we add new specific features, we could potentially augment the dump
> format explicitly for them, without having the property abstraction.

We could, but I think should change the dump format quite rarely, since
we risk breaking people's scripts. So if we did it for one prefix, I'd
like to do in an extensible way so that adding new prefixes is somewhat
transparent. It also means some duplication of effort/code in notmuch
dump/restore to dump/restore each new prefix.

It's probably true that per-prefix dump format would be more compact,
since the keys would be implicit, rather than repeated for every pair.

> We already have some explicit features for each message (subject,
> from, to, attachment, mimetype, thread id, etc), and most of them are
> derived from the message itself, with the hope that it could be
> re-derived given just the message body.  Is there a distinction
> between properties that can be derived from the message body and
> properties that need to be additionally derived from some other data?

As Tomi always says, naming is the hardest thing; properties is a bit
generic. I'm not sure the distinction you make between the "message" and
the "message body" here. I think most of our derived terms are from the
message header.  My intent here is that "properties" are used for things
that cannot be derived from the message (header or body).

TL;DR:

     - per prefix requires new code in the library and dump/restore
       for every prefix
     + the dump format might be more compact if done in a per prefix way.
     + this code would be simpler than the generic properties code,
       mainly because it would not need key value pairs,
     - the library and dump/restore are parts of notmuch that have the
       potential to "break the world".  Not too many people are
       comfortable hacking on them.
     - changing the dump format is something like an ABI change for
       people whose scripts rely on dump / restore.