From: Jani Nikula Date: Wed, 1 Jan 2014 12:04:58 +0000 (+0200) Subject: Re: [PATCH] lib: Add a new prefix "list" to the search-terms syntax X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=78563e494ea65a49c88b7b01c098e3ee3f40e31e;p=notmuch-archives.git Re: [PATCH] lib: Add a new prefix "list" to the search-terms syntax --- diff --git a/91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c b/91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c new file mode 100644 index 000000000..fe9e77fad --- /dev/null +++ b/91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c @@ -0,0 +1,394 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id 5E878431FC2 + for ; Wed, 1 Jan 2014 04:05:14 -0800 (PST) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -0.7 +X-Spam-Level: +X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id Ig-jlEt3qmPX for ; + Wed, 1 Jan 2014 04:05:04 -0800 (PST) +Received: from mail-ea0-f176.google.com (mail-ea0-f176.google.com + [209.85.215.176]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id 88C6F431FC0 + for ; Wed, 1 Jan 2014 04:05:04 -0800 (PST) +Received: by mail-ea0-f176.google.com with SMTP id h14so5812179eaj.35 + for ; Wed, 01 Jan 2014 04:05:03 -0800 (PST) +X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; + d=1e100.net; s=20130820; + h=x-gm-message-state:from:to:cc:subject:in-reply-to:references + :user-agent:date:message-id:mime-version:content-type + :content-transfer-encoding; + bh=yfCvxMG62xd9D+0aOngvooYflgtc9V7expqjW04XtAM=; + b=lJwLjMZlEUgVIMmUmYca9nz2y/hu8IlcHL0md+cawt99xQdUSFnpgaUzjuSmcWS9Th + cNajHWdwuD2bPqg0IWn/9mpA5LfgWNRiYCQSMjPH5/UUY8K7WUsAoMNG1cBuxVwRcSwF + k7nNgCx4+v8D0VtTp9Ep5n3KVLPdRCPj9td+9arqtYuce3AXhAIft3g6qpScm+Ea0bCV + jTBSzqhU8ctFDPHS7YpKi1xmQ9KU63ze+1QORgM3HyuTEU761nKUj9Pn4pyQ1mtxsbs2 + hQ+4+lNJKIVghmbnbVqF4rJowBSrJOEnxXi7P2rW/oh1nne+MmNxGI8uYsYTMyKbEaO7 + vfUg== +X-Gm-Message-State: + ALoCoQnS9e9hvjCtjKKOGiLxrL1Pc9fdvaOc/0pVvg2haWgkI0QC8bFDGdw8M6s+bLbkPmKUJ0Lq +X-Received: by 10.14.69.200 with SMTP id n48mr10888063eed.54.1388577901865; + Wed, 01 Jan 2014 04:05:01 -0800 (PST) +Received: from localhost (dsl-hkibrasgw2-58c36f-91.dhcp.inet.fi. + [88.195.111.91]) + by mx.google.com with ESMTPSA id m1sm126542611eeg.0.2014.01.01.04.04.59 + for + (version=TLSv1.2 cipher=RC4-SHA bits=128/128); + Wed, 01 Jan 2014 04:05:00 -0800 (PST) +From: Jani Nikula +To: "Kirill A. Shutemov" +Subject: Re: [PATCH] lib: Add a new prefix "list" to the search-terms syntax +In-Reply-To: <20131217180322.GA9272@node.dhcp.inet.fi> +References: <20130409083010.GA27675@raorn.name> + <1365549369-12776-1-git-send-email-raorn@raorn.name> + <87bo2ougmb.fsf@nikula.org> + <20131217180322.GA9272@node.dhcp.inet.fi> +User-Agent: Notmuch/0.17~rc2+18~g39a67a6 (http://notmuchmail.org) Emacs/24.3.1 + (x86_64-pc-linux-gnu) +Date: Wed, 01 Jan 2014 14:04:58 +0200 +Message-ID: <87y52z29hx.fsf@nikula.org> +MIME-Version: 1.0 +Content-Type: text/plain; charset=utf-8 +Content-Transfer-Encoding: quoted-printable +Cc: notmuch@notmuchmail.org, "Alexey I. Froloff" +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Wed, 01 Jan 2014 12:05:14 -0000 + +On Tue, 17 Dec 2013, "Kirill A. Shutemov" wrote: +> On Thu, Oct 17, 2013 at 05:17:00PM +0300, Jani Nikula wrote: +>> On Wed, 10 Apr 2013, "Alexey I. Froloff" wrote: +>> > From: "Alexey I. Froloff" +>> > +>> > Add support for indexing and searching the message's List-Id header. +>> > This is useful when matching all the messages belonging to a particular +>> > mailing list. +>>=20 +>> There's an issue with our duplicate message-id handling that is likely +>> to cause confusion with List-Id: searches. If you receive several +>> duplicates of the same message (judged by the message-id), only the +>> first one of them gets indexed, and the rest are ignored. This means +>> that for messages you receive both directly and through a list, it will +>> be arbitrary whether the List-Id: gets indexed or not. Therefore a list: +>> search might not return all the messages you'd expect. +> +> I've tried to address this. The patch also adds few tests for the feature. +> +> There's still missing functionality: re-indexing existing messages for +> list-id, handling message removal, etc. +> +> Any comments? + +Hi Kirill, sorry it took me so long to get to this! + +I've looked into our duplicate message-id handling and indexing before, +and it's not very good. + +First, we should pay more attention to checking whether the messages +really are duplicates or not. This is not trivial, but we should go a +bit further than just comparing the message-ids. Sadly, handling the +case of colliding message-ids on clearly different messages is not +trivial either, as we rely on the message-ids being unique all around. + +Second, we should be more clever about indexing duplicates that we think +are the same message. This is orthogonal to the first point. Currently, +only the first duplicate gets indexed, and will remain indexed even if +it's deleted and other copies remain. A message that matches a search +might end up not having the matching search terms, for example. A +rebuild of the database might index a different duplicate from the last +time. + +Having said that (partially just to write the thoughts down somewhere!), +I think your basic approach of indexing the list-id for duplicates is +sane, and we can grow more smarts to _notmuch_message_index_file() for +duplicate =3D=3D true in the future, checking more headers etc. One thing I +wonder about though: what if more than one duplicate has list-id, and +_index_list_id() gets called multiple times on a message? (CC Austin, he +probably has more clues on this than me.) + +For merging, you should also address the previous comments to the +original patch. There's been plenty of dropping the ball here it +seems... I think we've also agreed (perhaps only on IRC, I forget) that +we should use "listid" as the prefix, not "list" (sadly hyphens are not +allowed). Splitting the patch to code, test, and man parts might be a +good idea too. + +BR, +Jani. + + +> +> diff --git a/lib/database.cc b/lib/database.cc +> index f395061e3a73..196243e15d1a 100644 +> --- a/lib/database.cc +> +++ b/lib/database.cc +> @@ -205,6 +205,7 @@ static prefix_t BOOLEAN_PREFIX_INTERNAL[] =3D { +> }; +>=20=20 +> static prefix_t BOOLEAN_PREFIX_EXTERNAL[] =3D { +> + { "list", "XLIST"}, +> { "thread", "G" }, +> { "tag", "K" }, +> { "is", "K" }, +> @@ -2025,10 +2026,13 @@ notmuch_database_add_message (notmuch_database_t = +*notmuch, +> date =3D notmuch_message_file_get_header (message_file, "date"); +> _notmuch_message_set_header_values (message, date, from, subject); +>=20=20 +> - ret =3D _notmuch_message_index_file (message, filename); +> + ret =3D _notmuch_message_index_file (message, filename, false); +> if (ret) +> goto DONE; +> } else { +> + ret =3D _notmuch_message_index_file (message, filename, true); +> + if (ret) +> + goto DONE; +> ret =3D NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID; +> } +>=20=20 +> diff --git a/lib/index.cc b/lib/index.cc +> index 78c18cf36d10..9fe1ad6502ed 100644 +> --- a/lib/index.cc +> +++ b/lib/index.cc +> @@ -304,6 +304,47 @@ _index_address_list (notmuch_message_t *message, +> } +> } +>=20=20 +> +static void +> +_index_list_id (notmuch_message_t *message, +> + const char *list_id_header) +> +{ +> + const char *begin_list_id, *end_list_id, *list_id; +> + void *local; +> + +> + if (list_id_header =3D=3D NULL) +> + return; +> + +> + /* RFC2919 says that the list-id is found at the end of the header +> + * and enclosed between angle brackets. If we cannot find a +> + * matching pair of brackets containing at least one character, +> + * we ignore the list id header. */ +> + begin_list_id =3D strrchr (list_id_header, '<'); +> + if (!begin_list_id) { +> + fprintf (stderr, "Warning: Not indexing mailformed List-Id tag.\n"); +> + return; +> + } +> + +> + end_list_id =3D strrchr(begin_list_id, '>'); +> + if (!end_list_id || (end_list_id - begin_list_id < 2)) { +> + fprintf (stderr, "Warning: Not indexing mailformed List-Id tag.\n"); +> + return; +> + } +> + +> + local =3D talloc_new (message); +> + +> + /* We extract the list id between the angle brackets */ +> + list_id =3D talloc_strndup (local, begin_list_id + 1, +> + end_list_id - begin_list_id - 1); +> + +> + /* _notmuch_message_add_term() may return +> + * NOTMUCH_PRIVATE_STATUS_TERM_TOO_LONG here. We can't fix it, but +> + * this is not a reason to exit with error... */ +> + if (_notmuch_message_add_term (message, "list", list_id)) +> + fprintf (stderr, "Warning: Not indexing List-Id: <%s>\n", list_id); +> + +> + talloc_free (local); +> +} +> + +> /* Callback to generate terms for each mime part of a message. */ +> static void +> _index_mime_part (notmuch_message_t *message, +> @@ -425,14 +466,15 @@ _index_mime_part (notmuch_message_t *message, +>=20=20 +> notmuch_status_t +> _notmuch_message_index_file (notmuch_message_t *message, +> - const char *filename) +> + const char *filename, +> + notmuch_bool_t duplicate) +> { +> GMimeStream *stream =3D NULL; +> GMimeParser *parser =3D NULL; +> GMimeMessage *mime_message =3D NULL; +> InternetAddressList *addresses; +> FILE *file =3D NULL; +> - const char *from, *subject; +> + const char *from, *subject, *list_id; +> notmuch_status_t ret =3D NOTMUCH_STATUS_SUCCESS; +> static int initialized =3D 0; +> char from_buf[5]; +> @@ -485,6 +527,9 @@ mboxes is deprecated and may be removed in the future= +.\n", filename); +>=20=20 +> from =3D g_mime_message_get_sender (mime_message); +>=20=20 +> + if (duplicate) +> + goto DUP; +> + +> addresses =3D internet_address_list_parse_string (from); +> if (addresses) { +> _index_address_list (message, "from", addresses); +> @@ -502,6 +547,10 @@ mboxes is deprecated and may be removed in the futur= +e.\n", filename); +>=20=20 +> _index_mime_part (message, g_mime_message_get_mime_part (mime_messag= +e)); +>=20=20 +> + DUP: +> + list_id =3D g_mime_object_get_header (GMIME_OBJECT (mime_message), "= +List-Id"); +> + _index_list_id (message, list_id); +> + +> DONE: +> if (mime_message) +> g_object_unref (mime_message); +> diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h +> index af185c7c5ba8..138dfa58efc8 100644 +> --- a/lib/notmuch-private.h +> +++ b/lib/notmuch-private.h +> @@ -322,7 +322,8 @@ notmuch_message_get_author (notmuch_message_t *messag= +e); +>=20=20 +> notmuch_status_t +> _notmuch_message_index_file (notmuch_message_t *message, +> - const char *filename); +> + const char *filename, +> + notmuch_bool_t duplicate); +>=20=20 +> /* message-file.c */ +>=20=20 +> diff --git a/man/man7/notmuch-search-terms.7 b/man/man7/notmuch-search-te= +rms.7 +> index f1627b3488f8..29b30b7b0b00 100644 +> --- a/man/man7/notmuch-search-terms.7 +> +++ b/man/man7/notmuch-search-terms.7 +> @@ -52,6 +52,8 @@ terms to match against specific portions of an email, (= +where +>=20=20 +> thread: +>=20=20 +> + list: +> + +> folder: +>=20=20 +> date:.. +> @@ -109,6 +111,12 @@ within a matching directory. Only the directory comp= +onents below the +> top-level mail database path are available to be searched. +>=20=20 +> The +> +.BR list: , +> +is used to match mailing list ID of an email message \- contents of the +> +List\-Id: header without the '<', '>' delimiters or decoded list +> +description. +> + +> +The +> .B date: +> prefix can be used to restrict the results to only messages within a +> particular time range (based on the Date: header) with a range syntax +> diff --git a/test/corpus/cur/18:2, b/test/corpus/cur/18:2, +> index f522f69eb933..2b54925bd5d1 100644 +> --- a/test/corpus/cur/18:2, +> +++ b/test/corpus/cur/18:2, +> @@ -3,6 +3,7 @@ To: notmuch@notmuchmail.org +> Date: Tue, 17 Nov 2009 18:21:38 -0500 +> Subject: [notmuch] archive +> Message-ID: <20091117232137.GA7669@griffis1.net> +> +List-Id: +>=20=20 +> Just subscribed, I'd like to catch up on the previous postings, +> but the archive link seems to be bogus? +> diff --git a/test/corpus/cur/51:2, b/test/corpus/cur/51:2, +> index f522f69eb933..b155e6ee64a5 100644 +> --- a/test/corpus/cur/51:2, +> +++ b/test/corpus/cur/51:2, +> @@ -3,6 +3,7 @@ To: notmuch@notmuchmail.org +> Date: Tue, 17 Nov 2009 18:21:38 -0500 +> Subject: [notmuch] archive +> Message-ID: <20091117232137.GA7669@griffis1.net> +> +List-Id: +>=20=20 +> Just subscribed, I'd like to catch up on the previous postings, +> but the archive link seems to be bogus? +> diff --git a/test/search b/test/search +> index a7a0b18d2e48..bef42971226c 100755 +> --- a/test/search +> +++ b/test/search +> @@ -129,4 +129,28 @@ add_message '[subject]=3D"utf8-message-body-subject"= +' '[date]=3D"Sat, 01 Jan 2000 12 +> output=3D$(notmuch search "b=C3=B6d=C3=BD" | notmuch_search_sanitize) +> test_expect_equal "$output" "thread:XXX 2000-01-01 [1/1] Notmuch Test = +Suite; utf8-message-body-subject (inbox unread)" +>=20=20 +> +test_begin_subtest "Search by List-Id" +> +notmuch search list:notmuch.notmuchmail.org | notmuch_search_sanitize > = +OUTPUT +> +cat <EXPECTED +> +thread:XXX 2009-11-18 [2/2] Lars Kellogg-Stedman; [notmuch] "notmuch h= +elp" outputs to stderr? (attachment inbox signed unread) +> +thread:XXX 2009-11-18 [4/7] Lars Kellogg-Stedman, Mikhail Gusarov| Kei= +th Packard, Carl Worth; [notmuch] Working with Maildir storage? (inbox sign= +ed unread) +> +thread:XXX 2009-11-18 [1/2] Alex Botero-Lowry| Carl Worth; [notmuch] [= +PATCH] Error out if no query is supplied to search instead of going into an= + infinite loop (attachment inbox unread) +> +thread:XXX 2009-11-17 [1/3] Adrian Perez de Castro| Keith Packard, Car= +l Worth; [notmuch] Introducing myself (inbox signed unread) +> +thread:XXX 2009-11-17 [1/2] Alex Botero-Lowry| Carl Worth; [notmuch] p= +reliminary FreeBSD support (attachment inbox unread) +> +EOF +> +test_expect_equal_file OUTPUT EXPECTED +> + +> +test_begin_subtest "Search by List-Id, duplicated messages, step 1" +> +notmuch search list:test1.example.com | notmuch_search_sanitize > OUTPUT +> +cat <EXPECTED +> +thread:XXX 2009-11-17 [1/3] Aron Griffis| Keith Packard, Carl Worth; [= +notmuch] archive (inbox unread) +> +EOF +> +test_expect_equal_file OUTPUT EXPECTED +> + +> +test_begin_subtest "Search by List-Id, duplicated messages, step 2" +> +notmuch search list:test2.example.com | notmuch_search_sanitize > OUTPUT +> +cat <EXPECTED +> +thread:XXX 2009-11-17 [1/3] Aron Griffis| Keith Packard, Carl Worth; [= +notmuch] archive (inbox unread) +> +EOF +> +test_expect_equal_file OUTPUT EXPECTED +> test_done +> diff --git a/test/test-lib.sh b/test/test-lib.sh +> index d8e0d9115a69..981bde4a4004 100644 +> --- a/test/test-lib.sh +> +++ b/test/test-lib.sh +> @@ -576,9 +576,9 @@ test_expect_equal_json () { +> # The test suite forces LC_ALL=3DC, but this causes Python 3 to +> # decode stdin as ASCII. We need to read JSON in UTF-8, so +> # override Python's stdio encoding defaults. +> - output=3D$(echo "$1" | PYTHONIOENCODING=3Dutf-8 python -mjson.tool \ +> + output=3D$(echo "$1" | PYTHONIOENCODING=3Dutf-8 python2 -mjson.tool \ +> || echo "$1") +> - expected=3D$(echo "$2" | PYTHONIOENCODING=3Dutf-8 python -mjson.tool= + \ +> + expected=3D$(echo "$2" | PYTHONIOENCODING=3Dutf-8 python2 -mjson.too= +l \ +> || echo "$2") +> shift 2 +> test_expect_equal "$output" "$expected" "$@" +> --=20 +> Kirill A. Shutemov