From 78563e494ea65a49c88b7b01c098e3ee3f40e31e Mon Sep 17 00:00:00 2001
From: Jani Nikula <jani@nikula.org>
Date: Wed, 1 Jan 2014 14:04:58 +0200
Subject: [PATCH] Re: [PATCH] lib: Add a new prefix "list" to the search-terms
 syntax

---
 91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c | 394 ++++++++++++++++++++++
 1 file changed, 394 insertions(+)
 create mode 100644 91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c
diff --git a/91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c b/91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c
new file mode 100644
index 000000000..fe9e77fad
--- /dev/null
+++ b/91/f87fd49a2f1c0c4102d9cc516cc59bcce7f29c
@@ -0,0 +1,394 @@
+Return-Path: <jani@nikula.org>
+X-Original-To: notmuch@notmuchmail.org
+Delivered-To: notmuch@notmuchmail.org
+Received: from localhost (localhost [127.0.0.1])
+	by olra.theworths.org (Postfix) with ESMTP id 5E878431FC2
+	for <notmuch@notmuchmail.org>; Wed,  1 Jan 2014 04:05:14 -0800 (PST)
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
+X-Spam-Flag: NO
+X-Spam-Score: -0.7
+X-Spam-Level: 
+X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5
+	tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled
+Received: from olra.theworths.org ([127.0.0.1])
+	by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
+	with ESMTP id Ig-jlEt3qmPX for <notmuch@notmuchmail.org>;
+	Wed,  1 Jan 2014 04:05:04 -0800 (PST)
+Received: from mail-ea0-f176.google.com (mail-ea0-f176.google.com
+	[209.85.215.176]) (using TLSv1 with cipher RC4-SHA (128/128 bits))
+	(No client certificate requested)
+	by olra.theworths.org (Postfix) with ESMTPS id 88C6F431FC0
+	for <notmuch@notmuchmail.org>; Wed,  1 Jan 2014 04:05:04 -0800 (PST)
+Received: by mail-ea0-f176.google.com with SMTP id h14so5812179eaj.35
+	for <notmuch@notmuchmail.org>; Wed, 01 Jan 2014 04:05:03 -0800 (PST)
+X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
+	d=1e100.net; s=20130820;
+	h=x-gm-message-state:from:to:cc:subject:in-reply-to:references
+	:user-agent:date:message-id:mime-version:content-type
+	:content-transfer-encoding;
+	bh=yfCvxMG62xd9D+0aOngvooYflgtc9V7expqjW04XtAM=;
+	b=lJwLjMZlEUgVIMmUmYca9nz2y/hu8IlcHL0md+cawt99xQdUSFnpgaUzjuSmcWS9Th
+	cNajHWdwuD2bPqg0IWn/9mpA5LfgWNRiYCQSMjPH5/UUY8K7WUsAoMNG1cBuxVwRcSwF
+	k7nNgCx4+v8D0VtTp9Ep5n3KVLPdRCPj9td+9arqtYuce3AXhAIft3g6qpScm+Ea0bCV
+	jTBSzqhU8ctFDPHS7YpKi1xmQ9KU63ze+1QORgM3HyuTEU761nKUj9Pn4pyQ1mtxsbs2
+	hQ+4+lNJKIVghmbnbVqF4rJowBSrJOEnxXi7P2rW/oh1nne+MmNxGI8uYsYTMyKbEaO7
+	vfUg==
+X-Gm-Message-State:
+ ALoCoQnS9e9hvjCtjKKOGiLxrL1Pc9fdvaOc/0pVvg2haWgkI0QC8bFDGdw8M6s+bLbkPmKUJ0Lq
+X-Received: by 10.14.69.200 with SMTP id n48mr10888063eed.54.1388577901865;
+	Wed, 01 Jan 2014 04:05:01 -0800 (PST)
+Received: from localhost (dsl-hkibrasgw2-58c36f-91.dhcp.inet.fi.
+	[88.195.111.91])
+	by mx.google.com with ESMTPSA id m1sm126542611eeg.0.2014.01.01.04.04.59
+	for <multiple recipients>
+	(version=TLSv1.2 cipher=RC4-SHA bits=128/128);
+	Wed, 01 Jan 2014 04:05:00 -0800 (PST)
+From: Jani Nikula <jani@nikula.org>
+To: "Kirill A. Shutemov" <kirill@shutemov.name>
+Subject: Re: [PATCH] lib: Add a new prefix "list" to the search-terms syntax
+In-Reply-To: <20131217180322.GA9272@node.dhcp.inet.fi>
+References: <20130409083010.GA27675@raorn.name>
+	<1365549369-12776-1-git-send-email-raorn@raorn.name>
+	<87bo2ougmb.fsf@nikula.org>
+	<20131217180322.GA9272@node.dhcp.inet.fi>
+User-Agent: Notmuch/0.17~rc2+18~g39a67a6 (http://notmuchmail.org) Emacs/24.3.1
+	(x86_64-pc-linux-gnu)
+Date: Wed, 01 Jan 2014 14:04:58 +0200
+Message-ID: <87y52z29hx.fsf@nikula.org>
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Transfer-Encoding: quoted-printable
+Cc: notmuch@notmuchmail.org, "Alexey I. Froloff" <raorn@raorn.name>
+X-BeenThere: notmuch@notmuchmail.org
+X-Mailman-Version: 2.1.13
+Precedence: list
+List-Id: "Use and development of the notmuch mail system."
+	<notmuch.notmuchmail.org>
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
+	<mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>
+List-Post: <mailto:notmuch@notmuchmail.org>
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
+	<mailto:notmuch-request@notmuchmail.org?subject=subscribe>
+X-List-Received-Date: Wed, 01 Jan 2014 12:05:14 -0000
+
+On Tue, 17 Dec 2013, "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
+> On Thu, Oct 17, 2013 at 05:17:00PM +0300, Jani Nikula wrote:
+>> On Wed, 10 Apr 2013, "Alexey I. Froloff" <raorn@raorn.name> wrote:
+>> > From: "Alexey I. Froloff" <raorn@raorn.name>
+>> >
+>> > Add support for indexing and searching the message's List-Id header.
+>> > This is useful when matching all the messages belonging to a particular
+>> > mailing list.
+>>=20
+>> There's an issue with our duplicate message-id handling that is likely
+>> to cause confusion with List-Id: searches. If you receive several
+>> duplicates of the same message (judged by the message-id), only the
+>> first one of them gets indexed, and the rest are ignored. This means
+>> that for messages you receive both directly and through a list, it will
+>> be arbitrary whether the List-Id: gets indexed or not. Therefore a list:
+>> search might not return all the messages you'd expect.
+>
+> I've tried to address this. The patch also adds few tests for the feature.
+>
+> There's still missing functionality: re-indexing existing messages for
+> list-id, handling message removal, etc.
+>
+> Any comments?
+
+Hi Kirill, sorry it took me so long to get to this!
+
+I've looked into our duplicate message-id handling and indexing before,
+and it's not very good.
+
+First, we should pay more attention to checking whether the messages
+really are duplicates or not. This is not trivial, but we should go a
+bit further than just comparing the message-ids. Sadly, handling the
+case of colliding message-ids on clearly different messages is not
+trivial either, as we rely on the message-ids being unique all around.
+
+Second, we should be more clever about indexing duplicates that we think
+are the same message. This is orthogonal to the first point. Currently,
+only the first duplicate gets indexed, and will remain indexed even if
+it's deleted and other copies remain. A message that matches a search
+might end up not having the matching search terms, for example. A
+rebuild of the database might index a different duplicate from the last
+time.
+
+Having said that (partially just to write the thoughts down somewhere!),
+I think your basic approach of indexing the list-id for duplicates is
+sane, and we can grow more smarts to _notmuch_message_index_file() for
+duplicate =3D=3D true in the future, checking more headers etc. One thing I
+wonder about though: what if more than one duplicate has list-id, and
+_index_list_id() gets called multiple times on a message? (CC Austin, he
+probably has more clues on this than me.)
+
+For merging, you should also address the previous comments to the
+original patch. There's been plenty of dropping the ball here it
+seems... I think we've also agreed (perhaps only on IRC, I forget) that
+we should use "listid" as the prefix, not "list" (sadly hyphens are not
+allowed). Splitting the patch to code, test, and man parts might be a
+good idea too.
+
+BR,
+Jani.
+
+
+>
+> diff --git a/lib/database.cc b/lib/database.cc
+> index f395061e3a73..196243e15d1a 100644
+> --- a/lib/database.cc
+> +++ b/lib/database.cc
+> @@ -205,6 +205,7 @@ static prefix_t BOOLEAN_PREFIX_INTERNAL[] =3D {
+>  };
+>=20=20
+>  static prefix_t BOOLEAN_PREFIX_EXTERNAL[] =3D {
+> +    { "list",			"XLIST"},
+>      { "thread",			"G" },
+>      { "tag",			"K" },
+>      { "is",			"K" },
+> @@ -2025,10 +2026,13 @@ notmuch_database_add_message (notmuch_database_t =
+*notmuch,
+>  	    date =3D notmuch_message_file_get_header (message_file, "date");
+>  	    _notmuch_message_set_header_values (message, date, from, subject);
+>=20=20
+> -	    ret =3D _notmuch_message_index_file (message, filename);
+> +	    ret =3D _notmuch_message_index_file (message, filename, false);
+>  	    if (ret)
+>  		goto DONE;
+>  	} else {
+> +	    ret =3D _notmuch_message_index_file (message, filename, true);
+> +	    if (ret)
+> +		goto DONE;
+>  	    ret =3D NOTMUCH_STATUS_DUPLICATE_MESSAGE_ID;
+>  	}
+>=20=20
+> diff --git a/lib/index.cc b/lib/index.cc
+> index 78c18cf36d10..9fe1ad6502ed 100644
+> --- a/lib/index.cc
+> +++ b/lib/index.cc
+> @@ -304,6 +304,47 @@ _index_address_list (notmuch_message_t *message,
+>      }
+>  }
+>=20=20
+> +static void
+> +_index_list_id (notmuch_message_t *message,
+> +               const char *list_id_header)
+> +{
+> +    const char *begin_list_id, *end_list_id, *list_id;
+> +    void *local;
+> +
+> +    if (list_id_header =3D=3D NULL)
+> +	return;
+> +
+> +    /* RFC2919 says that the list-id is found at the end of the header
+> +     * and enclosed between angle brackets. If we cannot find a
+> +     * matching pair of brackets containing at least one character,
+> +     * we ignore the list id header. */
+> +    begin_list_id =3D strrchr (list_id_header, '<');
+> +    if (!begin_list_id) {
+> +	fprintf (stderr, "Warning: Not indexing mailformed List-Id tag.\n");
+> +	return;
+> +    }
+> +
+> +    end_list_id =3D strrchr(begin_list_id, '>');
+> +    if (!end_list_id || (end_list_id - begin_list_id < 2)) {
+> +	fprintf (stderr, "Warning: Not indexing mailformed List-Id tag.\n");
+> +	return;
+> +    }
+> +
+> +    local =3D talloc_new (message);
+> +
+> +    /* We extract the list id between the angle brackets */
+> +    list_id =3D talloc_strndup (local, begin_list_id + 1,
+> +			      end_list_id - begin_list_id - 1);
+> +
+> +    /* _notmuch_message_add_term() may return
+> +     * NOTMUCH_PRIVATE_STATUS_TERM_TOO_LONG here.  We can't fix it, but
+> +     * this is not a reason to exit with error... */
+> +    if (_notmuch_message_add_term (message, "list", list_id))
+> +	fprintf (stderr, "Warning: Not indexing List-Id: <%s>\n", list_id);
+> +
+> +    talloc_free (local);
+> +}
+> +
+>  /* Callback to generate terms for each mime part of a message. */
+>  static void
+>  _index_mime_part (notmuch_message_t *message,
+> @@ -425,14 +466,15 @@ _index_mime_part (notmuch_message_t *message,
+>=20=20
+>  notmuch_status_t
+>  _notmuch_message_index_file (notmuch_message_t *message,
+> -			     const char *filename)
+> +			     const char *filename,
+> +			     notmuch_bool_t duplicate)
+>  {
+>      GMimeStream *stream =3D NULL;
+>      GMimeParser *parser =3D NULL;
+>      GMimeMessage *mime_message =3D NULL;
+>      InternetAddressList *addresses;
+>      FILE *file =3D NULL;
+> -    const char *from, *subject;
+> +    const char *from, *subject, *list_id;
+>      notmuch_status_t ret =3D NOTMUCH_STATUS_SUCCESS;
+>      static int initialized =3D 0;
+>      char from_buf[5];
+> @@ -485,6 +527,9 @@ mboxes is deprecated and may be removed in the future=
+.\n", filename);
+>=20=20
+>      from =3D g_mime_message_get_sender (mime_message);
+>=20=20
+> +    if (duplicate)
+> +	goto DUP;
+> +
+>      addresses =3D internet_address_list_parse_string (from);
+>      if (addresses) {
+>  	_index_address_list (message, "from", addresses);
+> @@ -502,6 +547,10 @@ mboxes is deprecated and may be removed in the futur=
+e.\n", filename);
+>=20=20
+>      _index_mime_part (message, g_mime_message_get_mime_part (mime_messag=
+e));
+>=20=20
+> +  DUP:
+> +    list_id =3D g_mime_object_get_header (GMIME_OBJECT (mime_message), "=
+List-Id");
+> +    _index_list_id (message, list_id);
+> +
+>    DONE:
+>      if (mime_message)
+>  	g_object_unref (mime_message);
+> diff --git a/lib/notmuch-private.h b/lib/notmuch-private.h
+> index af185c7c5ba8..138dfa58efc8 100644
+> --- a/lib/notmuch-private.h
+> +++ b/lib/notmuch-private.h
+> @@ -322,7 +322,8 @@ notmuch_message_get_author (notmuch_message_t *messag=
+e);
+>=20=20
+>  notmuch_status_t
+>  _notmuch_message_index_file (notmuch_message_t *message,
+> -			     const char *filename);
+> +			     const char *filename,
+> +			     notmuch_bool_t duplicate);
+>=20=20
+>  /* message-file.c */
+>=20=20
+> diff --git a/man/man7/notmuch-search-terms.7 b/man/man7/notmuch-search-te=
+rms.7
+> index f1627b3488f8..29b30b7b0b00 100644
+> --- a/man/man7/notmuch-search-terms.7
+> +++ b/man/man7/notmuch-search-terms.7
+> @@ -52,6 +52,8 @@ terms to match against specific portions of an email, (=
+where
+>=20=20
+>  	thread:<thread-id>
+>=20=20
+> +	list:<list-id>
+> +
+>  	folder:<directory-path>
+>=20=20
+>  	date:<since>..<until>
+> @@ -109,6 +111,12 @@ within a matching directory. Only the directory comp=
+onents below the
+>  top-level mail database path are available to be searched.
+>=20=20
+>  The
+> +.BR list: ,
+> +is used to match mailing list ID of an email message \- contents of the
+> +List\-Id: header without the '<', '>' delimiters or decoded list
+> +description.
+> +
+> +The
+>  .B date:
+>  prefix can be used to restrict the results to only messages within a
+>  particular time range (based on the Date: header) with a range syntax
+> diff --git a/test/corpus/cur/18:2, b/test/corpus/cur/18:2,
+> index f522f69eb933..2b54925bd5d1 100644
+> --- a/test/corpus/cur/18:2,
+> +++ b/test/corpus/cur/18:2,
+> @@ -3,6 +3,7 @@ To: notmuch@notmuchmail.org
+>  Date: Tue, 17 Nov 2009 18:21:38 -0500
+>  Subject: [notmuch] archive
+>  Message-ID: <20091117232137.GA7669@griffis1.net>
+> +List-Id: <test1.example.com>
+>=20=20
+>  Just subscribed, I'd like to catch up on the previous postings,
+>  but the archive link seems to be bogus?
+> diff --git a/test/corpus/cur/51:2, b/test/corpus/cur/51:2,
+> index f522f69eb933..b155e6ee64a5 100644
+> --- a/test/corpus/cur/51:2,
+> +++ b/test/corpus/cur/51:2,
+> @@ -3,6 +3,7 @@ To: notmuch@notmuchmail.org
+>  Date: Tue, 17 Nov 2009 18:21:38 -0500
+>  Subject: [notmuch] archive
+>  Message-ID: <20091117232137.GA7669@griffis1.net>
+> +List-Id: <test2.example.com>
+>=20=20
+>  Just subscribed, I'd like to catch up on the previous postings,
+>  but the archive link seems to be bogus?
+> diff --git a/test/search b/test/search
+> index a7a0b18d2e48..bef42971226c 100755
+> --- a/test/search
+> +++ b/test/search
+> @@ -129,4 +129,28 @@ add_message '[subject]=3D"utf8-message-body-subject"=
+' '[date]=3D"Sat, 01 Jan 2000 12
+>  output=3D$(notmuch search "b=C3=B6d=C3=BD" | notmuch_search_sanitize)
+>  test_expect_equal "$output" "thread:XXX   2000-01-01 [1/1] Notmuch Test =
+Suite; utf8-message-body-subject (inbox unread)"
+>=20=20
+> +test_begin_subtest "Search by List-Id"
+> +notmuch search list:notmuch.notmuchmail.org | notmuch_search_sanitize > =
+OUTPUT
+> +cat <<EOF >EXPECTED
+> +thread:XXX   2009-11-18 [2/2] Lars Kellogg-Stedman; [notmuch] "notmuch h=
+elp" outputs to stderr? (attachment inbox signed unread)
+> +thread:XXX   2009-11-18 [4/7] Lars Kellogg-Stedman, Mikhail Gusarov| Kei=
+th Packard, Carl Worth; [notmuch] Working with Maildir storage? (inbox sign=
+ed unread)
+> +thread:XXX   2009-11-18 [1/2] Alex Botero-Lowry| Carl Worth; [notmuch] [=
+PATCH] Error out if no query is supplied to search instead of going into an=
+ infinite loop (attachment inbox unread)
+> +thread:XXX   2009-11-17 [1/3] Adrian Perez de Castro| Keith Packard, Car=
+l Worth; [notmuch] Introducing myself (inbox signed unread)
+> +thread:XXX   2009-11-17 [1/2] Alex Botero-Lowry| Carl Worth; [notmuch] p=
+reliminary FreeBSD support (attachment inbox unread)
+> +EOF
+> +test_expect_equal_file OUTPUT EXPECTED
+> +
+> +test_begin_subtest "Search by List-Id, duplicated messages, step 1"
+> +notmuch search list:test1.example.com | notmuch_search_sanitize > OUTPUT
+> +cat <<EOF >EXPECTED
+> +thread:XXX   2009-11-17 [1/3] Aron Griffis| Keith Packard, Carl Worth; [=
+notmuch] archive (inbox unread)
+> +EOF
+> +test_expect_equal_file OUTPUT EXPECTED
+> +
+> +test_begin_subtest "Search by List-Id, duplicated messages, step 2"
+> +notmuch search list:test2.example.com | notmuch_search_sanitize > OUTPUT
+> +cat <<EOF >EXPECTED
+> +thread:XXX   2009-11-17 [1/3] Aron Griffis| Keith Packard, Carl Worth; [=
+notmuch] archive (inbox unread)
+> +EOF
+> +test_expect_equal_file OUTPUT EXPECTED
+>  test_done
+> diff --git a/test/test-lib.sh b/test/test-lib.sh
+> index d8e0d9115a69..981bde4a4004 100644
+> --- a/test/test-lib.sh
+> +++ b/test/test-lib.sh
+> @@ -576,9 +576,9 @@ test_expect_equal_json () {
+>      # The test suite forces LC_ALL=3DC, but this causes Python 3 to
+>      # decode stdin as ASCII.  We need to read JSON in UTF-8, so
+>      # override Python's stdio encoding defaults.
+> -    output=3D$(echo "$1" | PYTHONIOENCODING=3Dutf-8 python -mjson.tool \
+> +    output=3D$(echo "$1" | PYTHONIOENCODING=3Dutf-8 python2 -mjson.tool \
+>          || echo "$1")
+> -    expected=3D$(echo "$2" | PYTHONIOENCODING=3Dutf-8 python -mjson.tool=
+ \
+> +    expected=3D$(echo "$2" | PYTHONIOENCODING=3Dutf-8 python2 -mjson.too=
+l \
+>          || echo "$2")
+>      shift 2
+>      test_expect_equal "$output" "$expected" "$@"
+> --=20
+>  Kirill A. Shutemov
-- 
2.26.2