Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id C24066DE00E8 for ; Sat, 26 Mar 2016 02:18:34 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.305 X-Spam-Level: X-Spam-Status: No, score=-0.305 tagged_above=-999 required=5 tests=[AWL=0.265, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id laOczQxHPryt for ; Sat, 26 Mar 2016 02:18:25 -0700 (PDT) Received: from mail-wm0-f68.google.com (mail-wm0-f68.google.com [74.125.82.68]) by arlo.cworth.org (Postfix) with ESMTPS id 11FB26DE0005 for ; Sat, 26 Mar 2016 02:18:25 -0700 (PDT) Received: by mail-wm0-f68.google.com with SMTP id p65so8739050wmp.1 for ; Sat, 26 Mar 2016 02:18:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:in-reply-to:references:user-agent:date:message-id :mime-version; bh=OV6tNAA4giJd6MVRdulmNsrpFaqdi4WXbDR8yaUb1Q8=; b=HubBKQDHgna15x9mb1zdpGiJqHHv8EdV+n/xryMJZ0R/K3BZpdi1fdtIFT6NQeEl8c LVtxCdhdZLP9eHJ1+pkjSvqmebYl9dCVi4Tj+aD0nUyVSuipF3Nlcp14o7Ji2vInn9/R oMbFTQMduxzaSlcOxAAhcFRGfWaI58Jeg/mgU0skmN52Y36y29h7uTHeeKL8QinHj0SK rJy9eSY2fnFDD7okmtf2PLJmaEe4WSYKPc8l6aH+Xe9JPhJQNIfzM/E/2CtyIaLTeezc LBbRj1d/ODVaKNZWvH42dko8TWu8vpjvTkWzaRfPTUcnMtCcxuBiiwRNgLklReX7NpB5 ewog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:subject:in-reply-to:references :user-agent:date:message-id:mime-version; bh=OV6tNAA4giJd6MVRdulmNsrpFaqdi4WXbDR8yaUb1Q8=; b=Ee1k6ZG9rr9WJ0lSCVGmIHPrernLuDV/VeyC4vvXzSsM+yj65giMIdtdq/NL8/teCh 4FeX4r+CqXfSfOIRYFhh/U8rCQML55+84MQRtdyQ2rxMbey6YxhE3WIxQGJ4iyQHNtha a6UszeRy8UltVOUvSWsNIZf0Fihr11AT7YAYraAtopQ2TYMlGzL1Bx8NfVRUKn5RsHWP /GQRpMZoFFusxUlY+cHMdnKNwpmPzNiyPWJGYEq3nPBHEnipFaOPbr+bn9sC7rKlBLsU UlpVoDIGqllFsth56SctJimrR2cJVvzbNi+NZmlYUZHOMySwk1COH1R6pCMzwj3WyaBk Jcxg== X-Gm-Message-State: AD7BkJIJbcEXnjnzGf/oYhd26eqS5kZRgOB2dcdI58oKDDjlkh26D8g5Bga8QMhGzrjxCA== X-Received: by 10.28.136.19 with SMTP id k19mr945289wmd.11.1458983903175; Sat, 26 Mar 2016 02:18:23 -0700 (PDT) Received: from localhost (5751dfa2.skybroadband.com. [87.81.223.162]) by smtp.gmail.com with ESMTPSA id u14sm841793wmu.8.2016.03.26.02.18.21 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 26 Mar 2016 02:18:22 -0700 (PDT) From: Mark Walters To: David Bremner , David Edmondson , notmuch@notmuchmail.org Subject: Re: [PATCH v1 0/3] Improve the acquisition of text parts. In-Reply-To: <87bn6h5lf3.fsf@zancas.localnet> References: <1457457179-4707-1-git-send-email-dme@dme.org> <87ziu2s8rb.fsf@qmul.ac.uk> <87bn6h5lf3.fsf@zancas.localnet> User-Agent: Notmuch/0.21+69~gd27d908 (http://notmuchmail.org) Emacs/24.4.1 (x86_64-pc-linux-gnu) Date: Sat, 26 Mar 2016 09:18:20 +0000 Message-ID: <87zitlppgj.fsf@qmul.ac.uk> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Mar 2016 09:18:34 -0000 Hi Sorry this email ended up rather long: Summary: I have run a test (see below) on all of the lkml part of the performance-corpus, and all the changes look expected. So this series looks good to me. First note how we do the bodypart-insertion: for a mime type of text/plain we first try the text/plain handler, then a text/* handler, and finally a */* handler until one succeeds. Before this series, when the part is application/octet-stream but is detected as text/plain, text/plain handler fails with a "bodypart insertion error" because notmuch-get-bodypart-text fails can't get the text (because it's not officially text). Thus we fall back on the */* handler and that inserts the part. With this series notmuch-get-bodypart-text succeeds and we stop. Thus in most cases the only change is that we don't get a "bodypart insertion error", but all the text looks the same. In a couple of cases the text/plain handler wraps lines/replaces ^M by unix newlines, whereas as the */* handler does not. This is an improvement. There is one more "difference" but I think this is actually something random. Sometimes when the part is application/tar or application/zip I get "Bodypart insert error: Symbol's function definition is void: gnus-recursive-directory-files". If I load gnus this goes away. In my first batch of tests this only occurred when using this series, but since then I have reproduced it on mainline. I think something else I did when setting up the test on mainline caused gnus to be loaded, but i have not worked out what is going on there. Finally, the test was as follows. I downloaded the performance corpus, configured a separate notmuch config file to use the performance-test/corpus/mail/lkml as the mailstore, went into notmuch-emacs and to the inbox (which contained all messages) and ran the following lisp function (defun my-save-all-show () (interactive) (goto-char (point-min)) (let ((count 0)) (while (notmuch-search-find-thread-id) (let ((thread-id (notmuch-search-find-thread-id))) (setq count (1+ count)) (message "Thread %s: %s" count thread-id) (notmuch-show thread-id) (let ((text (buffer-string)) (coding-system-for-write 'no-conversion)) (with-temp-file (concat "OUTPUT-" thread-id) (insert text))) (kill-buffer)) (notmuch-search-next-thread)))) I moved the OUTPUT files elsewhere and repeated with this series applied and then ran diff on the output. This gave 7 threads with a change (each an individual message) from the 16000 threads/ 100000 messages which I looked at individually as above. Best wishes Mark On Mon, 14 Mar 2016, David Bremner wrote: > David Edmondson writes: > >> On Sun, Mar 13 2016, Mark Walters wrote: >>> However, it would be sensible to get testing in a greater variety of >>> charsets/encodings >> >> Agreed. Does anyone have suggestions on how we might achieve this? A >> corpus of mail that we could use? > > Maybe the notmuch performance corpus, particularly the lkml sample. > > grep -R charset= performance-test/corpus/mail/lkml | sed -e 's/^.*charset=//' -e 's/;.*//' -e 's/"//g' | tr '[A-Z]' '[a-z]' | sort -u > > gives > > euc-kr > gb2312 > iso-2022-jp > iso-2022-jp-2 > iso-8859-1 > iso-8859-14 > iso 8859-15 > iso-8859-15 > iso-8859-1 > iso-8859-2 > iso-8859-6 > iso-8859-7 > iso-8859-9 > koi8-r > koi8-u > ks_c_5601-1987 > shift_jis > unknown > unknown-8bit > us-ascii > utf8 > utf-8 > windows-1250 > windows-1251 > windows-1252 > windows-1255 > > > to unpack the corpus > > cd performance-test > make download-corpus > ./T00-new.sh --large > > probably interrupt the test once notmuch-new starts running.