From: Jani Nikula Date: Thu, 28 Feb 2013 19:41:16 +0000 (+0200) Subject: Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=170b4623ba2cac4f22e722ac8b66fd2819fb5854;p=notmuch-archives.git Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated --- diff --git a/46/e0a2727858db26f0acdf9e541061c3e8de15b8 b/46/e0a2727858db26f0acdf9e541061c3e8de15b8 new file mode 100644 index 000000000..8609ae4d6 --- /dev/null +++ b/46/e0a2727858db26f0acdf9e541061c3e8de15b8 @@ -0,0 +1,241 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id D7467431FBC + for ; Thu, 28 Feb 2013 11:41:21 -0800 (PST) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -0.7 +X-Spam-Level: +X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id CVWGnYYK4NWU for ; + Thu, 28 Feb 2013 11:41:19 -0800 (PST) +Received: from mail-la0-f41.google.com (mail-la0-f41.google.com + [209.85.215.41]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id DF529431FAF + for ; Thu, 28 Feb 2013 11:41:18 -0800 (PST) +Received: by mail-la0-f41.google.com with SMTP id fo12so2188940lab.28 + for ; Thu, 28 Feb 2013 11:41:16 -0800 (PST) +X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; + d=google.com; s=20120113; + h=x-received:from:to:subject:in-reply-to:references:user-agent:date + :message-id:mime-version:content-type:x-gm-message-state; + bh=k9EuZu6+hVJPIbkf59XdmyV9Y+qxB8aX3oOJkHI8feQ=; + b=VWH/+NmnDNbbC/KNLHlxOW2KRyxqsbQyDuZFiQSu7AZFXF0/ItApTtMFNw/C3ZpoBd + bkoutZiQB6y+a5FXJnfCUIHi6nzvZJBQaYZtvoOewEwtS20ssMgNA8pLYmoQr7JuVPOy + j8mjrcOpBb3tauD19VTYAKM78ceaN2lfG4Gf7clmgO7RazGpyZD2kCdoOd2sTr+IwdPA + Bfolh7SpgyGycz/heqCn8CEw2/RQCJT5vv+9Gy0FlwMCTLThxXWMJ1WRRbwOIt3iRGuV + 9b8fqUfJvnNwM+i4SZ3uj489jAS7W6WsAAeTUqZK/948l1x9VoPIVuZraK5dv+/ltAiI + H2JQ== +X-Received: by 10.112.13.200 with SMTP id j8mr4057894lbc.68.1362080475852; + Thu, 28 Feb 2013 11:41:15 -0800 (PST) +Received: from localhost (dsl-hkibrasgw4-50df51-27.dhcp.inet.fi. + [80.223.81.27]) + by mx.google.com with ESMTPS id fm8sm3283414lbb.17.2013.02.28.11.41.13 + (version=TLSv1.2 cipher=RC4-SHA bits=128/128); + Thu, 28 Feb 2013 11:41:14 -0800 (PST) +From: Jani Nikula +To: Aaron Ecay , notmuch@notmuchmail.org +Subject: Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message + is calculated +In-Reply-To: <1361836225-17279-1-git-send-email-aaronecay@gmail.com> +References: <1361836225-17279-1-git-send-email-aaronecay@gmail.com> +User-Agent: Notmuch/0.15.2+33~g98253a3 (http://notmuchmail.org) Emacs/24.2.1 + (x86_64-pc-linux-gnu) +Date: Thu, 28 Feb 2013 21:41:16 +0200 +Message-ID: <87621cteeb.fsf@nikula.org> +MIME-Version: 1.0 +Content-Type: text/plain +X-Gm-Message-State: + ALoCoQmRb+br0tqWNRdxc1RyXji58iziwxTGZotvTWLSVA5AvRzDUu7LRxiZyORYP7S9dIcpuQDM +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Thu, 28 Feb 2013 19:41:22 -0000 + + +Hi Aaron - + +On Tue, 26 Feb 2013, Aaron Ecay wrote: +> Presently, the code which finds the parent of a message as it is being +> added to the database assumes that the first Message-ID-like substring +> of the In-Reply-To header is the parent Message ID. Some mail clients, +> however, put stuff other than the Message-ID of the parent in the +> In-Reply-To header, such as the email address of the sender of the +> parent. This can fool notmuch. + +I think the background is that RFC 822 defines In-Reply-To (and +References too for that matter) as *(phrase / msg-id), while RFC 2822 +defines them as 1*msg-id. I'd like something about RFC 822 being +mentioned in the commit message. + +The problem in the gmane message you link to in +id:87liaa3luc.fsf@gmail.com is likely related to the FAQ item 05.26 "How +do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ +http://www.newt.com/faq/mh.html. + +> The updated algorithm prefers the last Message ID in the References +> header. The References header lists messages oldest-first, so the last +> Message ID is the parent (RFC2822, p. 24). The References header is +> also less likely to be in a non-standard +> syntax (http://cr.yp.to/immhf/thread.html, +> http://www.jwz.org/doc/threading.html). In case the References header +> is not to be found, fall back to the old behavior. +> --- +> +> I especially notice this problem on public mailing lists, where +> certain people's messages always cause an "out-dent" of the threading, +> instead of being nested under whichever message they are replies to. +> +> Technically, putting non-Message-ID crud in the In-Reply-To field is a +> violation of RFC2822, but it appears that in practice the References +> header is respected more often than the In-Reply-To one. +> +> lib/database.cc | 30 ++++++++++++++++++++++-------- +> 1 file changed, 22 insertions(+), 8 deletions(-) +> +> diff --git a/lib/database.cc b/lib/database.cc +> index 91d4329..cbf33ae 100644 +> --- a/lib/database.cc +> +++ b/lib/database.cc +> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next) +> * 'message_id' in the result (to avoid mass confusion when a single +> * message references itself cyclically---and yes, mail messages are +> * not infrequent in the wild that do this---don't ask me why). +> + * +> + * Return the last reference parsed. +> */ +> -static void +> +static char * +> parse_references (void *ctx, +> const char *message_id, +> GHashTable *hash, +> @@ -511,7 +513,7 @@ parse_references (void *ctx, +> char *ref; +> +> if (refs == NULL || *refs == '\0') +> - return; +> + return NULL; +> +> while (*refs) { +> ref = _parse_message_id (ctx, refs, &refs); +> @@ -519,6 +521,8 @@ parse_references (void *ctx, +> if (ref && strcmp (ref, message_id)) +> g_hash_table_insert (hash, ref, NULL); +> } +> + +> + return ref; + +As the comment for the function says, we explicitly avoid including +self-references. I think I'd err on the safe side and return NULL if the +last ref equals message-id. + +> } +> +> notmuch_status_t +> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch) +> notmuch->last_doc_id++; +> +> if (notmuch->last_doc_id == 0) +> - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); +> + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); + +I don't know how you got this non-change hunk here, but please remove +it. :) + +> +> return notmuch->last_doc_id; +> } +> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, +> const char **thread_id) +> { +> GHashTable *parents = NULL; +> - const char *refs, *in_reply_to, *in_reply_to_message_id; +> + const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id; +> GList *l, *keys = NULL; +> notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS; +> +> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, +> _my_talloc_free_for_g_hash, NULL); +> +> refs = notmuch_message_file_get_header (message_file, "references"); +> - parse_references (message, notmuch_message_get_message_id (message), +> - parents, refs); +> + last_ref_message_id = parse_references (message, +> + notmuch_message_get_message_id (message), +> + parents, refs); +> +> in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to"); +> parse_references (message, notmuch_message_get_message_id (message), +> parents, in_reply_to); +> +> - /* Carefully avoid adding any self-referential in-reply-to term. */ +> in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL); + +I wonder if you should reuse your parse_references() change here, so +you'd set in_reply_to_message_id to the last message-id in +In-Reply-To. This might tackle some of the problematic cases directly, +but should still be all right per RFC 2822. I didn't verify how the +parser handles an RFC 2822 violating free form header though. + +> + /* If the parent message ID from the Reply-To and References +> + * headers are different, use the References one. This is because +> + * the Reply-To header is more likely to be in an non-standard +> + * format. */ +> + if (in_reply_to_message_id && +> + last_ref_message_id && +> + strcmp (last_ref_message_id, in_reply_to_message_id)) { +> + in_reply_to_message_id = last_ref_message_id; +> + } + +I suggest adding an else if branch (or revamp the above if condition) to +tackle the missing In-Reply-To header: + + else if (!in_reply_to_message_id && last_ref_message_id) { + in_reply_to_message_id = last_ref_message_id; + } + +> + /* Carefully avoid adding any self-referential in-reply-to term. */ +> if (in_reply_to_message_id && +> strcmp (in_reply_to_message_id, +> notmuch_message_get_message_id (message))) + +If you change parse_references() to be careful about never returning a +self-reference, and set in_reply_to_message_id from there, I think you +can drop the strcmp here. And move the comment to an appropriate place. + +Thanks for the patch, I think we should do this. But this is an area +where I think we need to be careful, so another reviewer wouldn't +harm. Some tests for this would be good too, obviously. + + +BR, +Jani. + +> { +> _notmuch_message_add_term (message, "replyto", +> - _parse_message_id (message, in_reply_to, NULL)); +> + in_reply_to_message_id); +> } +> +> keys = g_hash_table_get_keys (parents); +> -- +> 1.8.1.4 +> _______________________________________________ +> notmuch mailing list +> notmuch@notmuchmail.org +> http://notmuchmail.org/mailman/listinfo/notmuch