Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id AB920431FAF for ; Fri, 1 Mar 2013 09:06:34 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1CAHXDA1aJ1Y for ; Fri, 1 Mar 2013 09:06:31 -0800 (PST) Received: from mail-la0-f46.google.com (mail-la0-f46.google.com [209.85.215.46]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id B9EFD431FAE for ; Fri, 1 Mar 2013 09:06:30 -0800 (PST) Received: by mail-la0-f46.google.com with SMTP id fq12so3176340lab.33 for ; Fri, 01 Mar 2013 09:06:29 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:in-reply-to:references:user-agent :date:message-id:mime-version:content-type:x-gm-message-state; bh=bq1QZUQOeVGm/fkLALOV81TtqSt+khqkR35uBp40gmA=; b=dZ/EjFDPTyTbVj+yqasnGu32v5tkmiUvQ49MQjMeiDIS9zelEwlZQfTQIqd1j5Jl4v CTvvRAKyyJUZp5tnG6Ri4Oqw8wEtfNO/+P90wTDhkGVOSje/BujhGC+zuBgzl3txVzE2 fLfsBlFjAa56H2WQn+YiSP5NMMI1bSRtLBf1WKA+6aYmniNNk9dojPyZxFcfgrMdYluV HnBwDh8mi1wNVUnuiL550fix0zZcxofrWu56cW/oe6F5IUXgGXLltyOxCrrXNz3pf9XG 6ZEB2xa1qJWM+ohqgWqtqV2n3vzF0GVGneo8ksp34WwagiJ5qEGjIjDeab5i5fKnyy49 TQEg== X-Received: by 10.112.36.2 with SMTP id m2mr1005714lbj.100.1362157589105; Fri, 01 Mar 2013 09:06:29 -0800 (PST) Received: from localhost (dsl-hkibrasgw4-50df51-27.dhcp.inet.fi. [80.223.81.27]) by mx.google.com with ESMTPS id b13sm4314722lbd.10.2013.03.01.09.06.27 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Fri, 01 Mar 2013 09:06:28 -0800 (PST) From: Jani Nikula To: Aaron Ecay , notmuch@notmuchmail.org Subject: Re: [RFC] [PATCH] lib/database.cc: change how the parent of a message is calculated In-Reply-To: <87621cteeb.fsf@nikula.org> References: <1361836225-17279-1-git-send-email-aaronecay@gmail.com> <87621cteeb.fsf@nikula.org> User-Agent: Notmuch/0.15.2+33~g98253a3 (http://notmuchmail.org) Emacs/24.2.1 (x86_64-pc-linux-gnu) Date: Fri, 01 Mar 2013 19:06:28 +0200 Message-ID: <871ubzt5gr.fsf@nikula.org> MIME-Version: 1.0 Content-Type: text/plain X-Gm-Message-State: ALoCoQkhrm09jETl70ZXzygC8E0PKHUI9zJQ0/JA3/vZ22tzJZpsbeqdu5GU6lhFJhSz2WhoWuNB X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 01 Mar 2013 17:06:34 -0000 On Thu, 28 Feb 2013, Jani Nikula wrote: > Hi Aaron - > > On Tue, 26 Feb 2013, Aaron Ecay wrote: >> Presently, the code which finds the parent of a message as it is being >> added to the database assumes that the first Message-ID-like substring >> of the In-Reply-To header is the parent Message ID. Some mail clients, >> however, put stuff other than the Message-ID of the parent in the >> In-Reply-To header, such as the email address of the sender of the >> parent. This can fool notmuch. > > I think the background is that RFC 822 defines In-Reply-To (and > References too for that matter) as *(phrase / msg-id), while RFC 2822 > defines them as 1*msg-id. I'd like something about RFC 822 being > mentioned in the commit message. > > The problem in the gmane message you link to in > id:87liaa3luc.fsf@gmail.com is likely related to the FAQ item 05.26 "How > do I fix a bogus In-Reply-To or missing References field?" in the MH FAQ > http://www.newt.com/faq/mh.html. > >> The updated algorithm prefers the last Message ID in the References >> header. The References header lists messages oldest-first, so the last >> Message ID is the parent (RFC2822, p. 24). The References header is >> also less likely to be in a non-standard >> syntax (http://cr.yp.to/immhf/thread.html, >> http://www.jwz.org/doc/threading.html). In case the References header >> is not to be found, fall back to the old behavior. >> --- >> >> I especially notice this problem on public mailing lists, where >> certain people's messages always cause an "out-dent" of the threading, >> instead of being nested under whichever message they are replies to. >> >> Technically, putting non-Message-ID crud in the In-Reply-To field is a >> violation of RFC2822, but it appears that in practice the References >> header is respected more often than the In-Reply-To one. >> >> lib/database.cc | 30 ++++++++++++++++++++++-------- >> 1 file changed, 22 insertions(+), 8 deletions(-) >> >> diff --git a/lib/database.cc b/lib/database.cc >> index 91d4329..cbf33ae 100644 >> --- a/lib/database.cc >> +++ b/lib/database.cc >> @@ -501,8 +501,10 @@ _parse_message_id (void *ctx, const char *message_id, const char **next) >> * 'message_id' in the result (to avoid mass confusion when a single >> * message references itself cyclically---and yes, mail messages are >> * not infrequent in the wild that do this---don't ask me why). >> + * >> + * Return the last reference parsed. >> */ >> -static void >> +static char * >> parse_references (void *ctx, >> const char *message_id, >> GHashTable *hash, >> @@ -511,7 +513,7 @@ parse_references (void *ctx, >> char *ref; >> >> if (refs == NULL || *refs == '\0') >> - return; >> + return NULL; >> >> while (*refs) { >> ref = _parse_message_id (ctx, refs, &refs); >> @@ -519,6 +521,8 @@ parse_references (void *ctx, >> if (ref && strcmp (ref, message_id)) >> g_hash_table_insert (hash, ref, NULL); >> } >> + >> + return ref; > > As the comment for the function says, we explicitly avoid including > self-references. I think I'd err on the safe side and return NULL if the > last ref equals message-id. > >> } >> >> notmuch_status_t >> @@ -1365,7 +1369,7 @@ _notmuch_database_generate_doc_id (notmuch_database_t *notmuch) >> notmuch->last_doc_id++; >> >> if (notmuch->last_doc_id == 0) >> - INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); >> + INTERNAL_ERROR ("Xapian document IDs are exhausted.\n"); > > I don't know how you got this non-change hunk here, but please remove > it. :) > >> >> return notmuch->last_doc_id; >> } >> @@ -1509,7 +1513,7 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, >> const char **thread_id) >> { >> GHashTable *parents = NULL; >> - const char *refs, *in_reply_to, *in_reply_to_message_id; >> + const char *refs, *in_reply_to, *in_reply_to_message_id, *last_ref_message_id; >> GList *l, *keys = NULL; >> notmuch_status_t ret = NOTMUCH_STATUS_SUCCESS; >> >> @@ -1517,21 +1521,31 @@ _notmuch_database_link_message_to_parents (notmuch_database_t *notmuch, >> _my_talloc_free_for_g_hash, NULL); >> >> refs = notmuch_message_file_get_header (message_file, "references"); >> - parse_references (message, notmuch_message_get_message_id (message), >> - parents, refs); >> + last_ref_message_id = parse_references (message, >> + notmuch_message_get_message_id (message), >> + parents, refs); >> >> in_reply_to = notmuch_message_file_get_header (message_file, "in-reply-to"); >> parse_references (message, notmuch_message_get_message_id (message), >> parents, in_reply_to); >> >> - /* Carefully avoid adding any self-referential in-reply-to term. */ >> in_reply_to_message_id = _parse_message_id (message, in_reply_to, NULL); > > I wonder if you should reuse your parse_references() change here, so > you'd set in_reply_to_message_id to the last message-id in > In-Reply-To. This might tackle some of the problematic cases directly, > but should still be all right per RFC 2822. I didn't verify how the > parser handles an RFC 2822 violating free form header though. Strike that based on http://www.jwz.org/doc/threading.html: "If there are multiple things in In-Reply-To that look like Message-IDs, only use the first one of them: odds are that the later ones are actually email addresses, not IDs." >> + /* If the parent message ID from the Reply-To and References >> + * headers are different, use the References one. This is because >> + * the Reply-To header is more likely to be in an non-standard >> + * format. */ >> + if (in_reply_to_message_id && >> + last_ref_message_id && >> + strcmp (last_ref_message_id, in_reply_to_message_id)) { >> + in_reply_to_message_id = last_ref_message_id; >> + } I talked to Austin (CC) about the patch on IRC, and his comment was, perceptive as always: 23:38 amdragon Is the logic in that patch equivalent to always using the last message ID in references unless there is no references header? Seems like it is, but in a convoluted way. And that's actually the case, isn't it? To make the code reflect that, you should use last_ref_message_id, and if that's NULL, fallback to in_reply_to_message_id. > I suggest adding an else if branch (or revamp the above if condition) to > tackle the missing In-Reply-To header: > > else if (!in_reply_to_message_id && last_ref_message_id) { > in_reply_to_message_id = last_ref_message_id; > } Strike that, it should be the other way round. > >> + /* Carefully avoid adding any self-referential in-reply-to term. */ >> if (in_reply_to_message_id && >> strcmp (in_reply_to_message_id, >> notmuch_message_get_message_id (message))) > > If you change parse_references() to be careful about never returning a > self-reference, and set in_reply_to_message_id from there, I think you > can drop the strcmp here. And move the comment to an appropriate place. Strike that. > Thanks for the patch, I think we should do this. But this is an area > where I think we need to be careful, so another reviewer wouldn't > harm. Some tests for this would be good too, obviously. I got this part right. :) Jani. > > > BR, > Jani. > >> { >> _notmuch_message_add_term (message, "replyto", >> - _parse_message_id (message, in_reply_to, NULL)); >> + in_reply_to_message_id); >> } >> >> keys = g_hash_table_get_keys (parents); >> -- >> 1.8.1.4 >> _______________________________________________ >> notmuch mailing list >> notmuch@notmuchmail.org >> http://notmuchmail.org/mailman/listinfo/notmuch