Date: Sun, 28 Oct 2012 18:52:04 -0400
From: Austin Clements <amdragon@MIT.EDU>
To: Jani Nikula <jani@nikula.org>
Subject: Re: [PATCH v5 2/9] parse-time-string: add a date/time parser to
	notmuch
Message-ID: <20121028225204.GD15377@mit.edu>
References: <cover.1350854171.git.jani@nikula.org>
	<a90d3b687895a26f765539d6c0420038a74ee42f.1350854171.git.jani@nikula.org>
	<20121022081444.GM14861@mit.edu> <87lieqkz4t.fsf@nikula.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87lieqkz4t.fsf@nikula.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: notmuch@notmuchmail.org
Precedence: list

Quoth Jani Nikula on Oct 29 at 12:30 am:
> On Mon, 22 Oct 2012, Austin Clements <amdragon@MIT.EDU> wrote:
> >> +/*
> >> + * Accepted keywords.
> >> + *
> >> + * A keyword may optionally contain a '|' to indicate the minimum
> >> + * match length. Without one, full match is required. It's advisable
> >> + * to keep the minimum match parts unique across all keywords.
> >> + *
> >> + * If keyword begins with upper case letter, then the matching will be
> >> + * case sensitive. Otherwise the matching is case insensitive.
> >> + *
> >> + * If setter is NULL, set_default will be used.
> >> + *
> >> + * Note: Order matters. Matching is greedy, longest match is used, but
> >> + * of equal length matches the first one is used, unless there's an
> >> + * equal length case sensitive match which trumps case insensitive
> >> + * matches.
> >
> > If you do have a tokenizer (or disallow mashing keywords together),
> > then all of complexity arising from longest match goes away because
> > the keyword token either will or won't match a keyword.  If you also
> > eliminate the rule for case sensitivity and put case-sensitive things
> > before conflicting case-insensitive things (so put "M" before
> > "m|inutes"), then you can simply use the first match.
> 
> At least one reason for going through the whole table is that if this
> ever gets i18n support, the conflicting things might be different. While
> order matters in principle, you should create the table so that it
> really doesn't matter.

While that's true, if the input keyword has to be syntactically
delimited, there's still no such thing as a "longest match", since the
length of any match will be the length of the input.  You may still
want to scan the whole table, but if you find multiple matches, it's a
bug in the table indicating that |ed prefixes aren't unique.  Hence,
if you're not interested in finding bugs in the table, you can just
find the first match.

Or you could remove the |'s from the table, scan the whole table, and
consider the input string ambiguous if it matches multiple table
entries (being careful with case sensitivity), just like you do now if
the input string is shorter than the |ed prefixes.  That would
simplify your table, your matching logic, and possibly your scanning
logic.

> >
> >> + */
> >> +static struct keyword keywords[] = {
> >> +    /* Weekdays. */
> >> +    { N_("sun|day"),	TM_ABS_WDAY,	0,	NULL },
> >> +    { N_("mon|day"),	TM_ABS_WDAY,	1,	NULL },
> >> +    { N_("tue|sday"),	TM_ABS_WDAY,	2,	NULL },
> >> +    { N_("wed|nesday"),	TM_ABS_WDAY,	3,	NULL },
> >> +    { N_("thu|rsday"),	TM_ABS_WDAY,	4,	NULL },
> >> +    { N_("fri|day"),	TM_ABS_WDAY,	5,	NULL },
> >> +    { N_("sat|urday"),	TM_ABS_WDAY,	6,	NULL },
> >> +
> >> +    /* Months. */
> >> +    { N_("jan|uary"),	TM_ABS_MON,	1,	kw_set_month },
> >> +    { N_("feb|ruary"),	TM_ABS_MON,	2,	kw_set_month },
> >> +    { N_("mar|ch"),	TM_ABS_MON,	3,	kw_set_month },
> >> +    { N_("apr|il"),	TM_ABS_MON,	4,	kw_set_month },
> >> +    { N_("may"),	TM_ABS_MON,	5,	kw_set_month },
> >> +    { N_("jun|e"),	TM_ABS_MON,	6,	kw_set_month },
> >> +    { N_("jul|y"),	TM_ABS_MON,	7,	kw_set_month },
> >> +    { N_("aug|ust"),	TM_ABS_MON,	8,	kw_set_month },
> >> +    { N_("sep|tember"),	TM_ABS_MON,	9,	kw_set_month },
> >> +    { N_("oct|ober"),	TM_ABS_MON,	10,	kw_set_month },
> >> +    { N_("nov|ember"),	TM_ABS_MON,	11,	kw_set_month },
> >> +    { N_("dec|ember"),	TM_ABS_MON,	12,	kw_set_month },
> >> +
> >> +    /* Durations. */
> >> +    { N_("y|ears"),	TM_REL_YEAR,	1,	kw_set_rel },
> >> +    { N_("w|eeks"),	TM_REL_WEEK,	1,	kw_set_rel },
> >> +    { N_("d|ays"),	TM_REL_DAY,	1,	kw_set_rel },
> >> +    { N_("h|ours"),	TM_REL_HOUR,	1,	kw_set_rel },
> >> +    { N_("hr|s"),	TM_REL_HOUR,	1,	kw_set_rel },
> >> +    { N_("m|inutes"),	TM_REL_MIN,	1,	kw_set_rel },
> >> +    /* M=months, m=minutes */
> >> +    { N_("M"),		TM_REL_MON,	1,	kw_set_rel },
> >> +    { N_("mins"),	TM_REL_MIN,	1,	kw_set_rel },
> >> +    { N_("mo|nths"),	TM_REL_MON,	1,	kw_set_rel },
> >> +    { N_("s|econds"),	TM_REL_SEC,	1,	kw_set_rel },
> >> +    { N_("secs"),	TM_REL_SEC,	1,	kw_set_rel },
> >> +
> >> +    /* Numbers. */
> >> +    { N_("one"),	TM_NONE,	1,	kw_set_number },
> >> +    { N_("two"),	TM_NONE,	2,	kw_set_number },
> >> +    { N_("three"),	TM_NONE,	3,	kw_set_number },
> >> +    { N_("four"),	TM_NONE,	4,	kw_set_number },
> >> +    { N_("five"),	TM_NONE,	5,	kw_set_number },
> >> +    { N_("six"),	TM_NONE,	6,	kw_set_number },
> >> +    { N_("seven"),	TM_NONE,	7,	kw_set_number },
> >> +    { N_("eight"),	TM_NONE,	8,	kw_set_number },
> >> +    { N_("nine"),	TM_NONE,	9,	kw_set_number },
> >> +    { N_("ten"),	TM_NONE,	10,	kw_set_number },
> >> +    { N_("dozen"),	TM_NONE,	12,	kw_set_number },
> >> +    { N_("hundred"),	TM_NONE,	100,	kw_set_number },
> >> +
> >> +    /* Special number forms. */
> >> +    { N_("this"),	TM_NONE,	0,	kw_set_number },
> >> +    { N_("last"),	TM_NONE,	1,	kw_set_number },
> >> +
> >> +    /* Other special keywords. */
> >> +    { N_("yesterday"),	TM_REL_DAY,	1,	kw_set_rel },
> >> +    { N_("today"),	TM_NONE,	0,	kw_set_today },
> >> +    { N_("now"),	TM_NONE,	0,	kw_set_now },
> >> +    { N_("noon"),	TM_NONE,	12,	kw_set_timeofday },
> >> +    { N_("midnight"),	TM_NONE,	0,	kw_set_timeofday },
> >> +    { N_("am"),		TM_AMPM,	0,	kw_set_ampm },
> >> +    { N_("a.m."),	TM_AMPM,	0,	kw_set_ampm },
> >> +    { N_("pm"),		TM_AMPM,	1,	kw_set_ampm },
> >> +    { N_("p.m."),	TM_AMPM,	1,	kw_set_ampm },
> >> +    { N_("st"),		TM_NONE,	0,	kw_set_ordinal },
> >> +    { N_("nd"),		TM_NONE,	0,	kw_set_ordinal },
> >> +    { N_("rd"),		TM_NONE,	0,	kw_set_ordinal },
> >> +    { N_("th"),		TM_NONE,	0,	kw_set_ordinal },
> >> +
> >> +    /* Timezone codes: offset in minutes. XXX: Add more codes. */
> >> +    { N_("pst"),	TM_TZ,		-8*60,	NULL },
> >> +    { N_("mst"),	TM_TZ,		-7*60,	NULL },
> >> +    { N_("cst"),	TM_TZ,		-6*60,	NULL },
> >> +    { N_("est"),	TM_TZ,		-5*60,	NULL },
> >> +    { N_("ast"),	TM_TZ,		-4*60,	NULL },
> >> +    { N_("nst"),	TM_TZ,		-(3*60+30),	NULL },
> >> +
> >> +    { N_("gmt"),	TM_TZ,		0,	NULL },
> >> +    { N_("utc"),	TM_TZ,		0,	NULL },
> >> +
> >> +    { N_("wet"),	TM_TZ,		0,	NULL },
> >> +    { N_("cet"),	TM_TZ,		1*60,	NULL },
> >> +    { N_("eet"),	TM_TZ,		2*60,	NULL },
> >> +    { N_("fet"),	TM_TZ,		3*60,	NULL },
> >> +
> >> +    { N_("wat"),	TM_TZ,		1*60,	NULL },
> >> +    { N_("cat"),	TM_TZ,		2*60,	NULL },
> >> +    { N_("eat"),	TM_TZ,		3*60,	NULL },
> >> +};
> >> +
> >> +/*
> >> + * Compare strings s and keyword. Return number of matching chars on
> >> + * match, 0 for no match. Match must be at least n chars, or all of
> >> + * keyword if n < 0, otherwise it's not a match. Use match_case for
> >> + * case sensitive matching.
> >> + */
> >> +static size_t
> >> +match_keyword (const char *s, const char *keyword, ssize_t n, bool match_case)
> >> +{
> >> +    ssize_t i;
> >> +
> >> +    if (!n)
> >> +	return 0;
> >> +
> >> +    for (i = 0; *s && *keyword; i++, s++, keyword++) {
> >> +	if (match_case) {
> >> +	    if (*s != *keyword)
> >
> > The pointer arithmetic doesn't seem to buy anything here.  What about
> > just looping over i and using s[i] and keyword[i]?
> 
> The pointer arithmetic will be useful when I implement your other
> suggestion of handling '|' here. ;) Otherwise, I'd need two index
> variables.

Fair enough.

> >
> >> +		break;
> >> +	} else {
> >> +	    if (tolower ((unsigned char) *s) !=
> >> +		tolower ((unsigned char) *keyword))
> >
> > I don't think the cast to unsigned char is necessary.
> 
> As discussed on IRC, pedantically it is necessary, as ctype.h functions
> accept an int that must have the value of an unsigned char or EOF, and
> char might be signed.

It wouldn't be C without the pedantic.

> >> +/* Combine absolute and relative fields, and round. */
> >> +static int
> >> +create_output (struct state *state, time_t *t_out, const time_t *ref,
> >> +	       int round)
> >> +{
> >> +    struct tm tm = { .tm_isdst = -1 };
> >> +    struct tm now;
> >> +    time_t t;
> >> +    enum field f;
> >> +    int r;
> >> +    int week_round = PARSE_TIME_NO_ROUND;
> >> +
> >> +    r = initialize_now (state, &now, ref);
> >> +    if (r)
> >> +	return r;
> >> +
> >> +    /* Initialize fields flagged as "now" to reference time. */
> >> +    for (f = TM_ABS_SEC; f != TM_NONE; f = next_abs_field (f)) {
> >> +	if (state->set[f] == FIELD_NOW) {
> >> +	    state->tm[f] = tm_get_field (&now, f);
> >> +	    state->set[f] = FIELD_SET;
> >> +	}
> >> +    }
> >> +
> >> +    /*
> >> +     * If WDAY is set but MDAY is not, we consider WDAY relative
> >> +     *
> >> +     * XXX: This fails on stuff like "two months monday" because two
> >> +     * months ago wasn't the same day as today. Postpone until we know
> >> +     * date?
> >> +     */
> >> +    if (is_field_set (state, TM_ABS_WDAY) &&
> >> +	!is_field_set (state, TM_ABS_MDAY)) {
> >> +	int wday = get_field (state, TM_ABS_WDAY);
> >> +	int today = tm_get_field (&now, TM_ABS_WDAY);
> >> +	int rel_days;
> >> +
> >> +	if (today > wday)
> >> +	    rel_days = today - wday;
> >> +	else
> >> +	    rel_days = today + 7 - wday;
> >> +
> >> +	/* This also prevents special week rounding from happening. */
> >> +	mod_field (state, TM_REL_DAY, rel_days);
> >> +
> >> +	unset_field (state, TM_ABS_WDAY);
> >> +    }
> >> +
> >> +    r = fixup_ampm (state);
> >> +    if (r)
> >> +	return r;
> >> +
> >> +    /*
> >> +     * Iterate fields from most accurate to least accurate, and set
> >> +     * unset fields according to requested rounding.
> >> +     */
> >> +    for (f = TM_ABS_SEC; f != TM_NONE; f = next_abs_field (f)) {
> >> +	if (round != PARSE_TIME_NO_ROUND) {
> >> +	    enum field r = abs_to_rel_field (f);
> >> +
> >> +	    if (is_field_set (state, f) || is_field_set (state, r)) {
> >> +		if (round >= PARSE_TIME_ROUND_UP && f != TM_ABS_SEC) {
> >> +		    mod_field (state, r, -1);
> >
> > Crazy.  This could use a comment.  It took me a while to figure out
> > why this was -1, though maybe that's just because it's late.
> 
> Will do.
> 
> /* You're not expected to understand this */ ;)

Hah.  You're not allowed to use that on me!  I *do* understand the
code that comment is originally from.  ]:--8)