From: Jani Nikula <jani@nikula.org>
To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org
Subject: Re: [PATCH v2 2/7] lib: add a date/time parser module
In-Reply-To: <877gtdmqol.fsf@zancas.localnet>
References: <cover.1344065790.git.jani@nikula.org>
	<133d16fa9b63e4cd91aa2b8816a6ad7285b3bd4c.1344065790.git.jani@nikula.org>
	<877gtdmqol.fsf@zancas.localnet>
User-Agent: Notmuch/0.13.2+125~ga3f01dd (http://notmuchmail.org) Emacs/23.3.1
	(i686-pc-linux-gnu)
Date: Mon, 06 Aug 2012 00:43:18 +0300
Message-ID: <87628xnhft.fsf@nikula.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Precedence: list


Hi David, thanks for the review!

On Sun, 05 Aug 2012, David Bremner <david@tethera.net> wrote:
> Jani Nikula <jani@nikula.org> writes:
>
>> +
>> +static enum field
>> +abs_to_rel_field (enum field field)
>> +{
>> +    assert (field <= TM_ABS_YEAR);
>> +
>> +    /* note: depends on the enum ordering */
>> +    return field + (TM_REL_SEC - TM_ABS_SEC);
>> +}
>> +
>
> I wonder if this would be slightly nicer of you defined a TM_FIRST_REL
> or so as a synonym like TM_NONE and TM_SIZE

Good idea.

>> +/* get zero value for field */
>> +static int
>> +field_zero (enum field field)
>> +{
>> +    if (field == TM_ABS_MDAY || field == TM_ABS_MON)
>> +	return 1;
>> +    else if (field == TM_ABS_YEAR)
>> +	return 1970;
>> +    else
>> +	return 0;
>> +}
>
> what do you think about using the word "epoch" instead of zero here?

As a non-native speaker, I'll just take your word for it if you think it
would be better. :)

>> +static bool
>> +get_postponed_number (struct state *state, int *v, int *n, char *d)
>> +{
>
> I found the 1 letter names not quite obvious here.

True. I think v and n are used fairly consistently throughout the source
file, so I'll have to consider longer names vs. documentation comment
for them.

> At this point reading the code, I have not trouble understanding each
> line/function, but I feel like I'm missing the big picture a bit. 

Yeah, perhaps the code reads better if you follow from the top level
parse_time_string() down. Which is at the very end. A "big picture"
documentation comment would do no harm.

> What is a postponed number?

I see that you grasped that later, but I'll describe anyway. Parsing is
done from left to right, in a greedy fashion (i.e. match the longest
possible known expression). If after that we encounter a number that we
don't know what to do with yet, postpone it until we move on to the next
expression. The parser for that might eat the preceding "postponed"
number (for example "5 August", 5 would be postponed because we don't
know what to do with yet, but then "August" would be the context to make
it day of month). If that is not the case, the number will be parsed by
parse_postponed_number() as a lonely, single number between there. (Yes,
this should be documented better with the "big picture".)

>
>> +    /*
>> +     * REVISIT: There could be a "next_field" that would be set from
>> +     * "field" for the duration of the handle_postponed_number() call,
>> +     * so it has more information to work with.
>> +     */
>
> The notmuch convention seems to be to use XXX: for this. I'm not sure
> I'd bother changing, especially if we can't decide how to package this.

Okay, can be changed if needed.

>
>> +/* Time set helper. No input checking. Use UNSET (-1) to leave unset. */
>> +static int
>> +set_abs_time (struct state *state, int hour, int min, int sec)
>> +{
>> +    int r;
>> +
>> +    if (hour != UNSET) {
>> +	if ((r = set_field (state, TM_ABS_HOUR, hour)))
>> +	    return r;
>> +    }
>
> So for this function and the next, the first match wins? I don't really
> see the motivation for this, maybe you can explain a bit.

The whole parser tries to be as unambiguous as possible. If the input
leads to a situation in which any absolute time field (see enum field)
is attempted to set twice, it means it appears twice in the input. For
example, "2012-08-06 August", where the month is set twice. By design,
that is not allowed, and set_field() fails, even if it's the same
value. The only semi-exception is having redundant weekday there, for
example "Monday, August 6".

>
>
>> +    /* timezone codes: offset in minutes. FIXME: add more codes. */
>
> Did you think about trying to delegate the list of timezones to the
> system?

No. :) I'll look into it.

>
>> + * Compare strings s and keyword. Return number of matching chars on
>> + * match, 0 for no match. Match must be at least n chars (n == 0 all
>> + * of keyword), otherwise it's not a match. Use match_case for case
>> + * sensitive matching.
>> + */
>
> I guess that's fine, and it is internal, but maybe -1 for whole string
> would be slightly nicer (although I can't imagine what good matching 0
> length strings is at the moment).

Can be changed.

>
>> +	/* Minimum match length. */
>> +	p = strchr (keyword, '|');
>> +	if (p) {
>> +	    minlen = p - keyword;
>> +	    memmove (p, p + 1, strlen (p + 1) + 1);
>> +	}
>
> Something about that memmove creeps me out, but I trust you that it's
> correct. Alternatively I guess you could represent keywords as pairs of
> strings, which is probably more of a pain.

I didn't bother to double check it now, but I remember thinking it over
very carefully. :) I agree it could use more clarity to be more
obviously correct.

(Initially the minlen was coded as an int in the table, but the above
allows the localization to decide how long the match must be.)

>
>
>> +
>> +/* Parse a single number. Typically postpone parsing until later. */
>
> OK, so I finally start to understand what a postponed number is :)
> I understand the compiler likes bottom up declarations, but some
> top down declarations/comments are needed I think.

I agree more comments would be in order.

>
>> +static int
>> +parse_date (struct state *state, char sep,
>> +	    unsigned long v1, unsigned long v2, unsigned long v3,
>> +	    size_t n1, size_t n2, size_t n3)
>> +{
>> +    int year = UNSET, mon = UNSET, mday = UNSET;
>> +
>> +    assert (is_date_sep (sep));
>> +
>> +    switch (sep) {
>> +    case '/': /* Date: M[M]/D[D][/YY[YY]] or M[M]/YYYY */
>
> If I understand correctly, this chooses between American (?) month, day,
> year ordering and "sensible" day, month, year ordering by delimiter. I
> never thought about this as a way to tell (I often write D/M/Y), but
> that could be just me. I agree it's fine as a convention.

You understand correctly. It's obviously not a reliable way to tell
unless you make it the convention, which I chose to do. Some parsers,
notably the one in git, also look at the values to see if it could be
D/M/Y if M/D/Y is not possible, but I don't like the ambiguity that
introduces.

>
>> +/*
>> + * Parse delimiter(s). Return < 0 on error, number of parsed chars on
>> + * success.
>> + */
>
> So 1:-2 will parse as 1-2 ?, i.e. last delimiter wins? Maybe better to
> say so explicitly.

Agreed. It just throws out any extra delimiters. Perhaps this should
also be more strict about the allowed delimiters (now anything is
allowed).

>
>> +/* Combine absolute and relative fields, and round. */
>> +static int
>> +create_output (struct state *state, time_t *t_out, const time_t *tnow,
>> +	       int round)
>> +{
>
> It seems like most of non-obvious logic like (when is "wednesday") is
> encoded here. From a maintenence point of view, it would be nice to be
> able to seperate out the heuristic stuff from the mechanical, to the
> degree that it is possible.

Agreed. Basically this is step 2, turning all information parsed in step
1 (function parse_input()) into a sensible result. Figuring out the
current time is also done here.


BR,
Jani.