Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 38DAD431FBC for ; Fri, 4 Dec 2009 02:40:15 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BoCCw-cbJuA0 for ; Fri, 4 Dec 2009 02:40:14 -0800 (PST) Received: from lo.gmane.org (lo.gmane.org [80.91.229.12]) by olra.theworths.org (Postfix) with ESMTP id 74557431FAE for ; Fri, 4 Dec 2009 02:40:14 -0800 (PST) Received: from list by lo.gmane.org with local (Exim 4.50) id 1NGVZz-0004DY-HQ for notmuch@notmuchmail.org; Fri, 04 Dec 2009 11:40:04 +0100 Received: from ip-118-90-131-115.xdsl.xnet.co.nz ([118.90.131.115]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 04 Dec 2009 11:40:03 +0100 Received: from olly by ip-118-90-131-115.xdsl.xnet.co.nz with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 04 Dec 2009 11:40:03 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: notmuch@notmuchmail.org From: Olly Betts Date: Fri, 4 Dec 2009 10:36:45 +0000 (UTC) Lines: 25 Message-ID: References: <1259840063-sup-1478@sam.mediasupervision.de> <871vjbh98x.fsf@yoom.home.cworth.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: sea.gmane.org User-Agent: Loom/3.14 (http://gmane.org/) X-Loom-IP: 118.90.131.115 (Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5) Sender: news Subject: Re: [notmuch] Notmuch's search view sucks X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Dec 2009 10:40:15 -0000 Karl Wiberg writes: > On Fri, Dec 4, 2009 at 1:29 AM, Carl Worth wrote: > > And a step beyond that would support different languages for > > different emails, but that sounds like something "hard" to identify. > > But probably not as hard as identifying spam. It could probably be > done with a simple Bayesian filter counting word frequencies---but > it'd be much better if somebody else had already solved the problem, > since this smells suspiciously like something that ought to be a > separate project and put in a library ... does anyone know if such a > project already exists? There's TextCat: http://www.let.rug.nl/vannoord/TextCat/ It looks at n-gram frequencies, and can guess pretty reliably from even a fairly small amount of text. TextCat is in Perl. I don't know if there's a C or C++ implementation but it isn't a huge piece of code - finding a good technique was the clever part of it. Cheers, Olly