Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id C913A431FD0 for ; Sat, 16 Jul 2011 08:07:18 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.69 X-Spam-Level: X-Spam-Status: No, score=-0.69 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7, T_MIME_NO_TEXT=0.01] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4itz8bn-Mnyi for ; Sat, 16 Jul 2011 08:07:18 -0700 (PDT) Received: from mail-wy0-f181.google.com (mail-wy0-f181.google.com [74.125.82.181]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id D65EE431FB6 for ; Sat, 16 Jul 2011 08:07:17 -0700 (PDT) Received: by wyh22 with SMTP id 22so466768wyh.26 for ; Sat, 16 Jul 2011 08:07:16 -0700 (PDT) Received: by 10.227.195.209 with SMTP id ed17mr4126119wbb.13.1310828836389; Sat, 16 Jul 2011 08:07:16 -0700 (PDT) Received: from localhost (130.40-242-81.adsl-dyn.isp.belgacom.be [81.242.40.130]) by mx.google.com with ESMTPS id gd1sm1922019wbb.10.2011.07.16.08.07.14 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 16 Jul 2011 08:07:15 -0700 (PDT) From: Pieter Praet To: Austin Clements Subject: Re: [PATCH v2] emacs: bad regexp @ `notmuch-search-process-filter' In-Reply-To: <20110713185721.GI25558@mit.edu> References: <20110705214234.GA15360@mit.edu> <1310416993-31031-1-git-send-email-pieter@praet.org> <20110711210532.GC25558@mit.edu> <878vs28dvo.fsf@praet.org> <20110713185721.GI25558@mit.edu> User-Agent: Notmuch/0.6-60-ga0910f1 (http://notmuchmail.org) Emacs/23.1.50.1 (x86_64-pc-linux-gnu) Date: Sat, 16 Jul 2011 17:07:12 +0200 Message-ID: <87oc0u6z8f.fsf@praet.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Cc: Notmuch Mail , David Edmondson X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Jul 2011 15:07:18 -0000 --=-=-= On Wed, 13 Jul 2011 14:57:21 -0400, Austin Clements wrote: > Quoth Pieter Praet on Jul 13 at 4:16 pm: > > On Mon, 11 Jul 2011 17:05:32 -0400, Austin Clements wrote: > > > Quoth Pieter Praet on Jul 11 at 10:43 pm: > > > > TL;DR: I can haz regex pl0x? > > > > > > Oof, what a pain. I'm happy to change the output format of search; I > > > hadn't realized how difficult it would be to parse. In fact, I'm not > > > sure it's even parsable by regexp, because the message ID's themselves > > > could contain parens. > > > > > > So what would be a good format? One possibility would be to > > > NULL-delimit the query part; as distasteful as I find that, this part > > > of the search output isn't meant for user consumption. Though I fear > > > this is endemic to the dual role the search output currently plays as > > > both user and computer readable. > > > > > > I've also got the code to do everything using document ID's instead of > > > message ID's. As a side-effect, it makes the search output clean and > > > readily parsable since document ID's are just numbers. Hence, there > > > are no quoting or escaping issues (plus the output is much more > > > compact). I haven't sent this to the list yet because I haven't had a > > > chance to benchmark it and determine if the performance benefits make > > > exposing document ID's worthwhile. > > > > Jamie Zawinski once said/wrote [1]: > > 'Some people, when confronted with a problem, think "I know, > > I'll use regular expressions." Now they have two problems.' > > > > With this in mind, I set out to get rid of this whole regex mess altogether, > > by populating the search buffer using Notmuch's JSON output instead of doing > > brittle text matching tricks. > > > > Looking for some documentation, I stumbled upon a long-forgotten gem [2]. > > > > David's already done pretty much all of the work for us! > > Yes, similar thoughts were running through my head as I futzed with > the formatting for this. My concern with moving to JSON for search > buffers is that parsing it is about *30 times slower* than the current > regexp-based approach (0.6 seconds versus 0.02 seconds for a mere 1413 > result search buffer). I think JSON makes a lot of sense for show > buffers because there's generally less data and it has a lot of > complicated structure. Search results, on the other hand, have a very > simple, regular, and constrained structure, so JSON doesn't buy us > nearly as much. That seems about right. Using the entire Notmuch mailing list archive, processing JSON ends up taking 23x longer (see test in att). > JSON is hard to parse because, like the text search output, it's > designed for human consumption (of course, unlike the text search > output, it's also designed for computer consumption). There's > something to be said for the debuggability and generality of this and > JSON is very good for exchanging small objects, but it's a remarkably > inefficient way to exchange large amounts of data between two > programs. > > I guess what I'm getting at, though it pains me to say it, is perhaps > search needs a fast, computer-readable interchange format. The > structure of the data is so simple and constrained that this could be > altogether trivial. I guess that's our only option then. Could you implement it for me? I'll make sure to rebase my patch series in an acceptable time frame. An extra output format shouldn't be that much of a problem though, if we further compartmentalize the code. What are your thoughts on (in the long term) moving to a plugin-based architecture? Eg. enable something like this: ./input/{Maildir, ...} ./output/{plain, JSON, ...} ./filters/{crypto, ...} ./backends/(Xapian, ...) ./uis/{Emacs, VIM, web, ...} > Or maybe I need a faster computer. That's what M$ Tech Support would want you to believe :) What we need is slower computers, so devs are forced to count cycles again. The rise of netbooks has thankfully done wonders in this respect. > If anyone is curious, here's how I timed the parsing. > > (defmacro time-it (code) > `(let ((start-time (get-internal-run-time))) > ,code > (float-time (time-subtract (get-internal-run-time) start-time)))) > > (with-current-buffer "json" > (goto-char (point-min)) > (time-it (json-read))) > > (with-current-buffer "text" > (goto-char (point-min)) > (time-it > (while (re-search-forward "^\\(thread:[0-9A-Fa-f]*\\) \\([^][]*\\) \\(\\[[0-9/]*\\]\\) \\([^;]*\\); \\(.*\\) (\\([^()]*\\))$" nil t)))) Peace -- Pieter --=-=-= Content-Type: application/octet-stream Content-Disposition: attachment; filename=regexp-vs-json.org Content-Transfer-Encoding: base64 KiBQYXJzaW5nIHBsYWluLXRleHQgdy8gcmVnZXhwIHZzLiBwYXJzaW5nIEpTT04uCgogICMrU09V UkNFOiB0aW1lci1tYWNybwogICMrQkVHSU5fU1JDIGVtYWNzLWxpc3AKICAgIChkZWZtYWNybyB0 aW1lLWl0IChjb2RlKQogICAgICBgKGxldCAoKHN0YXJ0LXRpbWUgKGdldC1pbnRlcm5hbC1ydW4t dGltZSkpKQogICAgICAgICAsY29kZQogICAgICAgICAoZmxvYXQtdGltZSAodGltZS1zdWJ0cmFj dCAoZ2V0LWludGVybmFsLXJ1bi10aW1lKSBzdGFydC10aW1lKSkpKQogICMrRU5EX1NSQwoKICAj K1NPVVJDRTogY291bnQtbXNncwogICMrQkVHSU5fU1JDIHNoCiAgICBub3RtdWNoIGNvdW50IC0t IHRhZzp4L25vdG11Y2gKICAjK0VORF9TUkMKCiAgIytTT1VSQ0U6IHRpbWUtdGV4dAogICMrQkVH SU5fU1JDIGVtYWNzLWxpc3AgOm5vd2ViIHllcwogICAgPDx0aW1lci1tYWNybz4+CiAgICAod2l0 aC10ZW1wLWJ1ZmZlcgogICAgICAoY2FsbC1wcm9jZXNzICJub3RtdWNoIiBuaWwgdCBuaWwgInNl YXJjaCIgIi0tZm9ybWF0PXRleHQiICItLSIgInRhZzp4L25vdG11Y2giKQogICAgICAoZ290by1j aGFyIChwb2ludC1taW4pKQogICAgICAodGltZS1pdAogICAgICAgKHdoaWxlIChyZS1zZWFyY2gt Zm9yd2FyZCAiXlxcKHRocmVhZDpbMC05QS1GYS1mXSpcXCkgXFwoW15dW10qXFwpIFxcKFxcW1sw LTkvXSpcXF1cXCkgXFwoW147XSpcXCk7CiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICBcXCguKlxcKSAoXFwoW14oKV0qXFwpKSQiIG5pbCB0KSkpKQogICMrRU5EX1NSQwoKICAjK1NP VVJDRTogdGltZS1qc29uCiAgIytCRUdJTl9TUkMgZW1hY3MtbGlzcCA6bm93ZWIgeWVzCiAgICA8 PHRpbWVyLW1hY3JvPj4KICAgICh3aXRoLXRlbXAtYnVmZmVyCiAgICAgIChjYWxsLXByb2Nlc3Mg Im5vdG11Y2giIG5pbCB0IG5pbCAic2VhcmNoIiAiLS1mb3JtYXQ9anNvbiIgIi0tIiAidGFnOngv bm90bXVjaCIpCiAgICAgIChnb3RvLWNoYXIgKHBvaW50LW1pbikpCiAgICAgICh0aW1lLWl0IChq c29uLXJlYWQpKSkKICAjK0VORF9TUkMKCiAgIytUQkxOQU1FOiByZXN1bHRzCiAgfC0tLS0tLS0t LS0rLS0tLS0tLS0tLS0tKy0tLS0tLS0tLS0tLSstLS0tLS0tLS0tfAogIHwgbXNnY291bnQgfCB0 aW1lKHRleHQpIHwgdGltZShqc29uKSB8ICUgc2xvd2VyIHwKICB8LS0tLS0tLS0tLSstLS0tLS0t LS0tLS0rLS0tLS0tLS0tLS0tKy0tLS0tLS0tLS18CiAgfCAgICAgNTI5NCB8ICAgICAgIDAuMDEg fCAgICAgICAwLjIzIHwgICAgMjMwMC4gfAogIHwtLS0tLS0tLS0tKy0tLS0tLS0tLS0tLSstLS0t LS0tLS0tLS0rLS0tLS0tLS0tLXwKICAjK1RCTEZNOiAkMT0nKHNiZSAiY291bnQtbXNncyIpJyA6 OiAkMj0nKHNiZSAidGltZS10ZXh0IiknIDo6ICQzPScoc2JlICJ0aW1lLWpzb24iKScgOjogJDQ9 KCQzLyQyKSoxMDAK --=-=-=--