Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 3D002431FB6 for ; Mon, 25 Feb 2013 15:50:50 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.799 X-Spam-Level: X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Idp4XA5TlSoJ for ; Mon, 25 Feb 2013 15:50:48 -0800 (PST) Received: from mail-qa0-f43.google.com (mail-qa0-f43.google.com [209.85.216.43]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 28145431FAF for ; Mon, 25 Feb 2013 15:50:48 -0800 (PST) Received: by mail-qa0-f43.google.com with SMTP id dx4so1998837qab.16 for ; Mon, 25 Feb 2013 15:50:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:from:to:subject:date:message-id:x-mailer; bh=Gb97bSBCXF90gLJGKBgLDpgYym8VnFeXujv5M97o9x4=; b=JBIhP3PVm+K6TEtLSNMpWdiQw47wdDy8cC5qTluNPjMfo2j6nLMRlYGcUBdS1P0fEn csEt6PN0FqXyrhcR1QTKPPvY7exswlndQ1jn9i6dae714hBL+na0gOZ6u+8+5vcZ+SAe z+5pyTkZwLDZMJy2If9KXFpJJN71RiOkv0LDD3ULM11mscKo3RAVtLMkQoQTsE4CJPcq Jqio7PckpojytDDFJfl8Eehw2J2pnXHtzF/0Q5TJb6lieK7UFVfGBzdRTFkKhlrM9xAd LZQUU9BI6GXelR3KSkrUchsJZB6ZBZ0B7ikfAMQ6bFh3Lld/Ka1eSxosSgE2xR++obR4 jJ1w== X-Received: by 10.49.96.33 with SMTP id dp1mr16646678qeb.60.1361836245427; Mon, 25 Feb 2013 15:50:45 -0800 (PST) Received: from localhost.localdomain (vagvlan532.239.wlan.wireless-pennnet.upenn.edu. [128.91.71.113]) by mx.google.com with ESMTPS id hr3sm19437068qab.4.2013.02.25.15.50.39 (version=TLSv1.2 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 25 Feb 2013 15:50:44 -0800 (PST) From: Aaron Ecay To: notmuch@notmuchmail.org Subject: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated Date: Mon, 25 Feb 2013 18:50:25 -0500 Message-Id: <1361836225-17279-1-git-send-email-aaronecay@gmail.com> X-Mailer: git-send-email 1.8.1.4 X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Feb 2013 23:50:50 -0000 Presently, the code which finds the parent of a message as it is being added to the database assumes that the first Message-ID-like substring of the In-Reply-To header is the parent Message ID. Some mail clients, however, put stuff other than the Message-ID of the parent in the In-Reply-To header, such as the email address of the sender of the parent. This can fool notmuch. The updated algorithm prefers the last Message ID in the References header. The References header lists messages oldest-first, so the last Message ID is the parent (RFC2822, p. 24). The References header is also less likely to be in a non-standard syntax (http://cr.yp.to/immhf/thread.html, http://www.jwz.org/doc/threading.html). In case the References header is not to be found, fall back to the old behavior. --- I especially notice this problem on public mailing lists, where certain people's messages always cause an "out-dent" of the threading, instead of being nested under whichever message they are replies to. Technically, putting non-Message-ID crud in the In-Reply-To field is a violation of RFC2822, but it appears that in practice the References header is respected more often than the In-Reply-To one. lib/database.cc | 30 ++++++++++++++++++++++-------- 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/lib/database.cc b/lib/database.cc index 91d4329..cbf33ae 100644 --- a/lib/database.cc +++ b/lib/database.cc @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next) * 'message_id' in the result (to avoid mass confusion when a single * message references itself cyclically---and yes, mail messages are * not infrequent in the wild that do this---don't ask me why). + * + * Return the last reference parsed. */ -static void +static char * parse_references (void *ctx, const char *message_id, GHashTable *hash, @@ -511,7 +513,7 @@ parse_references (void *ctx, char *ref; if (refs == NULL || *refs == '\0') - return; + return NULL; while (*refs) { ref = _parse_message_id (ctx, refs, &refs); @@ -519,6 +521,8 @@ parse_references (void *ctx, if (ref && strcmp (ref, message_id)) g_hash_table_insert (hash, ref, NULL); } + + return ref; } notmuch_status_t @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch) notmuch->last_doc_id++; if (notmuch->last_doc_id == 0) - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); return notmuch->last_doc_id; } @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, const char **thread_id) { GHashTable *parents = NULL; - const char *refs, *in_reply_to, *in_reply_to_message_id; + const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id; GList *l, *keys = NULL; notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS; @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, _my_talloc_free_for_g_hash, NULL); refs = notmuch_message_file_get_header (message_file, "references"); - parse_references (message, notmuch_message_get_message_id (message), - parents, refs); + last_ref_message_id = parse_references (message, + notmuch_message_get_message_id (message), + parents, refs); in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to"); parse_references (message, notmuch_message_get_message_id (message), parents, in_reply_to); - /* Carefully avoid adding any self-referential in-reply-to term. */ in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL); + /* If the parent message ID from the Reply-To and References + * headers are different, use the References one. This is because + * the Reply-To header is more likely to be in an non-standard + * format. */ + if (in_reply_to_message_id && + last_ref_message_id && + strcmp (last_ref_message_id, in_reply_to_message_id)) { + in_reply_to_message_id = last_ref_message_id; + } + /* Carefully avoid adding any self-referential in-reply-to term. */ if (in_reply_to_message_id && strcmp (in_reply_to_message_id, notmuch_message_get_message_id (message))) { _notmuch_message_add_term (message, "replyto", - _parse_message_id (message, in_reply_to, NULL)); + in_reply_to_message_id); } keys = g_hash_table_get_keys (parents); -- 1.8.1.4