From: Michal Sojka <sojkam1@fel.cvut.cz>
To: Tomi Ollila <tomi.ollila@iki.fi>, Adam Wolfe Gordon <awg+notmuch@xvx.ca>
Subject: Re: emacs complains about encoding?
In-Reply-To: <m27gw4nyfu.fsf@guru.guru-group.fi>
References: <20120515194455.B7AD5100646@guru.guru-group.fi>
	<878vgsbprq.fsf@nikula.org> <m23970bhre.fsf@guru.guru-group.fi>
	<CAMoJFUungAFPWy0d1Lh+rqmpK--P7MMEwNaewWHR=rbYo+BKsA@mail.gmail.com>
	<871umc1int.fsf@steelpick.2x.cz>
	<m27gw4nyfu.fsf@guru.guru-group.fi>
User-Agent: Notmuch/0.13+14~g2d2a5a4 (http://notmuchmail.org) Emacs/23.4.1
	(x86_64-pc-linux-gnu)
Date: Wed, 23 May 2012 12:15:18 +0200
Message-ID: <87r4uburt5.fsf@steelpick.2x.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: notmuch@notmuchmail.org
Precedence: list

Tomi Ollila <tomi.ollila@iki.fi> writes:
> Michal Sojka <sojkam1@fel.cvut.cz> writes:
>
>> Hello Adam,
>>
>> Adam Wolfe Gordon <awg+notmuch@xvx.ca> writes:
>>> It turns out it's actually not the emacs side, but an interaction
>>> between our JSON reply format and emacs.
>>>
>>> The JSON reply (and show) code includes part content for all text/*
>>> parts except text/html. Because all JSON is required to be UTF-8, it
>>> handles the encoding itself, puts UTF-8 text in, and omits a
>>> content-charset field from the output. Emacs passes on the
>>> content-charset field to mm-display-part-inline if it's available, but
>>> for text/plain parts it's not, leaving mm-display-part-inline to its
>>> own devices for figuring out what the charset is. It seems
>>> mm-display-part-inline correctly figures out that it's UTF-8, and puts
>>> in the series of ugly \nnn characters because that's what emacs does
>>> with UTF-8 sometimes.
>>>
>>> In the original reply stuff (pre-JSON reply format) emacs used the
>>> output of notmuch reply verbatim, so all the charset stuff was handled
>>> in notmuch. Before f6c170fabca8f39e74705e3813504137811bf162, emacs was
>>> using the JSON reply format, but was inserting the text itself instead
>>> of using mm-display-part-inline, so emacs still wasn't trying to do
>>> any charset manipulation. Using mm-display-part-inline is desirable
>>> because it lets us handle non-text/plain (e.g. text/html) parts
>>> correctly in reply, and makes the display more consistent (since we
>>> use it for show). But, it leads to this problem.
>>>
>>> So, there are a couple of solutions I can see:
>>>
>>> 1) Have the JSON formats include the original content-charset even
>>> though they're actually outputting UTF-8. Of the solutions I tried,
>>> this is the best, even though it doesn't sound like a good thing to
>>> do.
>>>
>>> 2) Have the JSON formats include content only if it's actually UTF-8.
>>> This means that for non-UTF-8 parts (including ASCII parts), the emacs
>>> interface has to do more work to display the part content, since it
>>> must fetch it from outside first. When I tried this, it worked but
>>> caused the \nnn to show up when viewing messages in emacs. I suspect
>>> this is because it sets a charset for the whole buffer, and can't
>>> accommodate messages with different charsets in the same buffer
>>> properly. Reply works correctly, though.
>>>
>>> 3) Have the JSON formats include the charset for all parts, but make
>>> it UTF-8 for all parts they include content for (since we're actually
>>> outputting UTF-8). This doesn't seem to fix the problem, even though
>>> it seems like it should.
>>>
>>> If no one has a better idea or a strong reason not to, I'll send a
>>> patch for solution (1).
>>
>> Thank you very much for your analysis. It encouraged me to dig into the
>> problem and I've found another solution, which might be better than
>> those you suggested.
>>
>> I traced what Emacs does with the text inside
>> notmuch-mm-display-part-inline and the wrong charset conversion happens
>> deeply in elisp code in mm-with-part called by mm-get-part, which is in
>> turn called by mm-inline-text. There is a way to make mm-inline-text not
>> to call mm-get-part, which is to set the charset to 'gnus-decoded. This
>> sounds like something that applies to our situation, where the part is
>> already decoded.
>
> You've digged deeper than I did... :)
>
>>
>> The following patch (apply it with git am -c) solves the problem for me.
>> However, I'm not sure it is a universal solution. It sets the charset
>> only if it is not defined in notmuch json output and I'm not sure that
>> this is correct. text/html parts seem to have charset defined, but as
>> you wrote that json is always utf-8, so it might be that we need
>> 'gnus-decoded always, independently of the json output. What do you
>> think?
>
> No -- when non-inlined content is fetched by executing command
> notmuch show --format=raw --part=n --decrypt id:"<message-id>" the content
> is received with original charset -- and then mm-* components needs to have
> correct charset set (well, I think, I have not tested ;). 
>
> Also, we cannot rely that the json output doesn't contain content-charset
> information in the future...
>
> I'm currently applying this to my build tree whenever I rebuild notmuch for
> my own use: id:"1337533094-5467-1-git-send-email-tomi.ollila@iki.fi"

Great, this is more or less the same solution :-)

> I think the current plan is to use the same decoding lookup table that
> notmuch-show is using in reply too. 

Which table do you refer to? notmuch-show-handlers-for?

> That is good plan for consistency point of view. That just requires
> some code to be moved from notmuch-show.el to some other file (maybe a
> new one).

Sounds good.

Cheers,
-Michal