From: Jani Nikula <jani@nikula.org>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Subject: Re: [PATCH] lib: Add a new prefix "list" to the search-terms syntax
In-Reply-To: <20131217180322.GA9272@node.dhcp.inet.fi>
References: <20130409083010.GA27675@raorn.name>
	<1365549369-12776-1-git-send-email-raorn@raorn.name>
	<87bo2ougmb.fsf@nikula.org>
	<20131217180322.GA9272@node.dhcp.inet.fi>
User-Agent: Notmuch/0.17~rc2+18~g39a67a6 (http://notmuchmail.org) Emacs/24.3.1
	(x86_64-pc-linux-gnu)
Date: Wed, 01 Jan 2014 14:04:58 +0200
Message-ID: <87y52z29hx.fsf@nikula.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: notmuch@notmuchmail.org, "Alexey I. Froloff" <raorn@raorn.name>
Precedence: list

On Tue, 17 Dec 2013, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> On Thu, Oct 17, 2013 at 05:17:00PM +0300, Jani Nikula wrote:
>> On Wed, 10 Apr 2013, "Alexey I. Froloff" <raorn@raorn.name> wrote:
>> > From: "Alexey I. Froloff" <raorn@raorn.name>
>> >
>> > Add support for indexing and searching the message's List-Id header.
>> > This is useful when matching all the messages belonging to a particular
>> > mailing list.
>>=20
>> There's an issue with our duplicate message-id handling that is likely
>> to cause confusion with List-Id: searches. If you receive several
>> duplicates of the same message (judged by the message-id), only the
>> first one of them gets indexed, and the rest are ignored. This means
>> that for messages you receive both directly and through a list, it will
>> be arbitrary whether the List-Id: gets indexed or not. Therefore a list:
>> search might not return all the messages you'd expect.
>
> I've tried to address this. The patch also adds few tests for the feature.
>
> There's still missing functionality: re-indexing existing messages for
> list-id, handling message removal, etc.
>
> Any comments?

Hi Kirill, sorry it took me so long to get to this!

I've looked into our duplicate message-id handling and indexing before,
and it's not very good.

First, we should pay more attention to checking whether the messages
really are duplicates or not. This is not trivial, but we should go a
bit further than just comparing the message-ids. Sadly, handling the
case of colliding message-ids on clearly different messages is not
trivial either, as we rely on the message-ids being unique all around.

Second, we should be more clever about indexing duplicates that we think
are the same message. This is orthogonal to the first point. Currently,
only the first duplicate gets indexed, and will remain indexed even if
it's deleted and other copies remain. A message that matches a search
might end up not having the matching search terms, for example. A
rebuild of the database might index a different duplicate from the last
time.

Having said that (partially just to write the thoughts down somewhere!),
I think your basic approach of indexing the list-id for duplicates is
sane, and we can grow more smarts to _notmuch_message_index_file() for
duplicate =3D=3D true in the future, checking more headers etc. One thing I
wonder about though: what if more than one duplicate has list-id, and
_index_list_id() gets called multiple times on a message? (CC Austin, he
probably has more clues on this than me.)

For merging, you should also address the previous comments to the
original patch. There's been plenty of dropping the ball here it
seems... I think we've also agreed (perhaps only on IRC, I forget) that
we should use "listid" as the prefix, not "list" (sadly hyphens are not
allowed). Splitting the patch to code, test, and man parts might be a
good idea too.

BR,
Jani.


>
> diff --git a/lib/database.cc b/lib/database.cc
> index f395061e3a73..196243e15d1a 100644
> --- a/lib/database.cc
> +++ b/lib/database.cc
> @@ -205,6 +205,7 @@ static prefix_t BOOLEAN_PREFIX_INTERNAL[] =3D {
>  };
>=20=20
>  static prefix_t BOOLEAN_PREFIX_EXTERNAL[] =3D {
> +    { "list",			"XLIST"},
>      { "thread",			"G" },
>      { "tag",			"K" },
>      { "is",			"K" },
> @@ -2025,10 +2026,13 @@ notmuch_database_add_message (notmuch_database_t =
*notmuch,
>  	    date =3D notmuch_message_file_get_header (message_file, "date");
>  	    _notmuch_message_set_header_values (message, date, from, subject);
>=20=20
> -	    ret =3D _notmuch_message_index_file (message, filename);
> +	    ret =3D _notmuch_message_index_file (message, filename, false);
>  	    if (ret)
>  		goto DONE;
>  	} else {
> +	    ret =3D _notmuch_message_index_file (message, filename, true);
> +	    if (ret)
> +		goto DONE;
>  	    ret =3D NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
>  	}
>=20=20
> diff --git a/lib/index.cc b/lib/index.cc
> index 78c18cf36d10..9fe1ad6502ed 100644
> --- a/lib/index.cc
> +++ b/lib/index.cc
> @@ -304,6 +304,47 @@ _index_address_list (notmuch_message_t *message,
>      }
>  }
>=20=20
> +static void
> +_index_list_id (notmuch_message_t *message,
> +               const char *list_id_header)
> +{
> +    const char *begin_list_id, *end_list_id, *list_id;
> +    void *local;
> +
> +    if (list_id_header =3D=3D NULL)
> +	return;
> +
> +    /* RFC2919 says that the list-id is found at the end of the header
> +     * and enclosed between angle brackets. If we cannot find a
> +     * matching pair of brackets containing at least one character,
> +     * we ignore the list id header. */
> +    begin_list_id =3D strrchr (list_id_header, '<');
> +    if (!begin_list_id) {
> +	fprintf (stderr, "Warning: Not indexing mailformed List-Id tag.\n");
> +	return;
> +    }
> +
> +    end_list_id =3D strrchr(begin_list_id, '>');
> +    if (!end_list_id || (end_list_id - begin_list_id < 2)) {
> +	fprintf (stderr, "Warning: Not indexing mailformed List-Id tag.\n");
> +	return;
> +    }
> +
> +    local =3D talloc_new (message);
> +
> +    /* We extract the list id between the angle brackets */
> +    list_id =3D talloc_strndup (local, begin_list_id + 1,
> +			      end_list_id - begin_list_id - 1);
> +
> +    /* _notmuch_message_add_term() may return
> +     * NOTMUCH_PRIVATE_STATUS_TERM_TOO_LONG here.  We can't fix it, but
> +     * this is not a reason to exit with error... */
> +    if (_notmuch_message_add_term (message, "list", list_id))
> +	fprintf (stderr, "Warning: Not indexing List-Id: <%s>\n", list_id);
> +
> +    talloc_free (local);
> +}
> +
>  /* Callback to generate terms for each mime part of a message. */
>  static void
>  _index_mime_part (notmuch_message_t *message,
> @@ -425,14 +466,15 @@ _index_mime_part (notmuch_message_t *message,
>=20=20
>  notmuch_status_t
>  _notmuch_message_index_file (notmuch_message_t *message,
> -			     const char *filename)
> +			     const char *filename,
> +			     notmuch_bool_t duplicate)
>  {
>      GMimeStream *stream =3D NULL;
>      GMimeParser *parser =3D NULL;
>      GMimeMessage *mime_message =3D NULL;
>      InternetAddressList *addresses;
>      FILE *file =3D NULL;
> -    const char *from, *subject;
> +    const char *from, *subject, *list_id;
>      notmuch_status_t ret =3D NOTMUCH_STATUS_SUCCESS;
>      static int initialized =3D 0;
>      char from_buf[5];
> @@ -485,6 +527,9 @@ mboxes is deprecated and may be removed in the future=
.\n", filename);
>=20=20
>      from =3D g_mime_message_get_sender (mime_message);
>=20=20
> +    if (duplicate)
> +	goto DUP;
> +
>      addresses =3D internet_address_list_parse_string (from);
>      if (addresses) {
>  	_index_address_list (message, "from", addresses);
> @@ -502,6 +547,10 @@ mboxes is deprecated and may be removed in the futur=
e.\n", filename);
>=20=20
>      _index_mime_part (message, g_mime_message_get_mime_part (mime_messag=
e));
>=20=20
> +  DUP:
> +    list_id =3D g_mime_object_get_header (GMIME_OBJECT (mime_message), "=
List-Id");
> +    _index_list_id (message, list_id);
> +
>    DONE:
>      if (mime_message)
>  	g_object_unref (mime_message);
> diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
> index af185c7c5ba8..138dfa58efc8 100644
> --- a/lib/notmuch-private.h
> +++ b/lib/notmuch-private.h
> @@ -322,7 +322,8 @@ notmuch_message_get_author (notmuch_message_t *messag=
e);
>=20=20
>  notmuch_status_t
>  _notmuch_message_index_file (notmuch_message_t *message,
> -			     const char *filename);
> +			     const char *filename,
> +			     notmuch_bool_t duplicate);
>=20=20
>  /* message-file.c */
>=20=20
> diff --git a/man/man7/notmuch-search-terms.7 b/man/man7/notmuch-search-te=
rms.7
> index f1627b3488f8..29b30b7b0b00 100644
> --- a/man/man7/notmuch-search-terms.7
> +++ b/man/man7/notmuch-search-terms.7
> @@ -52,6 +52,8 @@ terms to match against specific portions of an email, (=
where
>=20=20
>  	thread:<thread-id>
>=20=20
> +	list:<list-id>
> +
>  	folder:<directory-path>
>=20=20
>  	date:<since>..<until>
> @@ -109,6 +111,12 @@ within a matching directory. Only the directory comp=
onents below the
>  top-level mail database path are available to be searched.
>=20=20
>  The
> +.BR list: ,
> +is used to match mailing list ID of an email message \- contents of the
> +List\-Id: header without the '<', '>' delimiters or decoded list
> +description.
> +
> +The
>  .B date:
>  prefix can be used to restrict the results to only messages within a
>  particular time range (based on the Date: header) with a range syntax
> diff --git a/test/corpus/cur/18:2, b/test/corpus/cur/18:2,
> index f522f69eb933..2b54925bd5d1 100644
> --- a/test/corpus/cur/18:2,
> +++ b/test/corpus/cur/18:2,
> @@ -3,6 +3,7 @@ To: notmuch@notmuchmail.org
>  Date: Tue, 17 Nov 2009 18:21:38 -0500
>  Subject: [notmuch] archive
>  Message-ID: <20091117232137.GA7669@griffis1.net>
> +List-Id: <test1.example.com>
>=20=20
>  Just subscribed, I'd like to catch up on the previous postings,
>  but the archive link seems to be bogus?
> diff --git a/test/corpus/cur/51:2, b/test/corpus/cur/51:2,
> index f522f69eb933..b155e6ee64a5 100644
> --- a/test/corpus/cur/51:2,
> +++ b/test/corpus/cur/51:2,
> @@ -3,6 +3,7 @@ To: notmuch@notmuchmail.org
>  Date: Tue, 17 Nov 2009 18:21:38 -0500
>  Subject: [notmuch] archive
>  Message-ID: <20091117232137.GA7669@griffis1.net>
> +List-Id: <test2.example.com>
>=20=20
>  Just subscribed, I'd like to catch up on the previous postings,
>  but the archive link seems to be bogus?
> diff --git a/test/search b/test/search
> index a7a0b18d2e48..bef42971226c 100755
> --- a/test/search
> +++ b/test/search
> @@ -129,4 +129,28 @@ add_message '[subject]=3D"utf8-message-body-subject"=
' '[date]=3D"Sat, 01 Jan 2000 12
>  output=3D$(notmuch search "b=C3=B6d=C3=BD" | notmuch_search_sanitize)
>  test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test =
Suite; utf8-message-body-subject (inbox unread)"
>=20=20
> +test_begin_subtest "Search by List-Id"
> +notmuch search list:notmuch.notmuchmail.org | notmuch_search_sanitize > =
OUTPUT
> +cat <<EOF >EXPECTED
> +thread:XXX   2009-11-18 [2/2] Lars Kellogg-Stedman; [notmuch] "notmuch h=
elp" outputs to stderr? (attachment inbox signed unread)
> +thread:XXX   2009-11-18 [4/7] Lars Kellogg-Stedman, Mikhail Gusarov| Kei=
th Packard, Carl Worth; [notmuch] Working with Maildir storage? (inbox sign=
ed unread)
> +thread:XXX   2009-11-18 [1/2] Alex Botero-Lowry| Carl Worth; [notmuch] [=
PATCH] Error out if no query is supplied to search instead of going into an=
 infinite loop (attachment inbox unread)
> +thread:XXX   2009-11-17 [1/3] Adrian Perez de Castro| Keith Packard, Car=
l Worth; [notmuch] Introducing myself (inbox signed unread)
> +thread:XXX   2009-11-17 [1/2] Alex Botero-Lowry| Carl Worth; [notmuch] p=
reliminary FreeBSD support (attachment inbox unread)
> +EOF
> +test_expect_equal_file OUTPUT EXPECTED
> +
> +test_begin_subtest "Search by List-Id, duplicated messages, step 1"
> +notmuch search list:test1.example.com | notmuch_search_sanitize > OUTPUT
> +cat <<EOF >EXPECTED
> +thread:XXX   2009-11-17 [1/3] Aron Griffis| Keith Packard, Carl Worth; [=
notmuch] archive (inbox unread)
> +EOF
> +test_expect_equal_file OUTPUT EXPECTED
> +
> +test_begin_subtest "Search by List-Id, duplicated messages, step 2"
> +notmuch search list:test2.example.com | notmuch_search_sanitize > OUTPUT
> +cat <<EOF >EXPECTED
> +thread:XXX   2009-11-17 [1/3] Aron Griffis| Keith Packard, Carl Worth; [=
notmuch] archive (inbox unread)
> +EOF
> +test_expect_equal_file OUTPUT EXPECTED
>  test_done
> diff --git a/test/test-lib.sh b/test/test-lib.sh
> index d8e0d9115a69..981bde4a4004 100644
> --- a/test/test-lib.sh
> +++ b/test/test-lib.sh
> @@ -576,9 +576,9 @@ test_expect_equal_json () {
>      # The test suite forces LC_ALL=3DC, but this causes Python 3 to
>      # decode stdin as ASCII.  We need to read JSON in UTF-8, so
>      # override Python's stdio encoding defaults.
> -    output=3D$(echo "$1" | PYTHONIOENCODING=3Dutf-8 python -mjson.tool \
> +    output=3D$(echo "$1" | PYTHONIOENCODING=3Dutf-8 python2 -mjson.tool \
>          || echo "$1")
> -    expected=3D$(echo "$2" | PYTHONIOENCODING=3Dutf-8 python -mjson.tool=
 \
> +    expected=3D$(echo "$2" | PYTHONIOENCODING=3Dutf-8 python2 -mjson.too=
l \
>          || echo "$2")
>      shift 2
>      test_expect_equal "$output" "$expected" "$@"
> --=20
>  Kirill A. Shutemov