Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id A10FE40BC64 for ; Mon, 16 Aug 2010 12:38:03 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 3.185 X-Spam-Level: *** X-Spam-Status: No, score=3.185 tagged_above=-999 required=5 tests=[BAYES_50=0.8, DATE_IN_PAST_03_06=1.592, RDNS_NONE=0.793] autolearn=no Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id m1022iNIiAx5 for ; Mon, 16 Aug 2010 12:37:52 -0700 (PDT) Received: from feelingofgreen.ru (unknown [80.92.100.69]) by olra.theworths.org (Postfix) with SMTP id DE54040BC6C for ; Mon, 16 Aug 2010 12:37:51 -0700 (PDT) Received: (qmail 29960 invoked by uid 1000); 16 Aug 2010 19:38:43 +0400 From: Samium Gromoff <_deepfire@feelingofgreen.ru> To: notmuch@notmuchmail.org Subject: Integration with training-based bayesian filters References: User-Agent: Notmuch/0.3.1-58-g6607fd6 (http://notmuchmail.org) Emacs/23.2.1 (x86_64-pc-linux-gnu) Date: Mon, 16 Aug 2010 19:38:43 +0400 Message-ID: <877hjqtvng.fsf@auriga.deep> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Aug 2010 19:38:03 -0000 Good day folks, My "+notmuch AND train" query on the local notmuch list archive didn't yield anything relevant, so I've got at least one excuse if the question I'm going to pose was already answered to death here. So, how is a notmuch user supposed to integrate a train-based message classifier like crm114[1], which operates as follows: - the filter->you information flow is established by prepending either "ADV: " or "UNS: " strings to the message subject, denoting, correspondingly, either "spam" or "please tell me if this is spam" categories. The non-spam messages, naturally, have their subject lines unmodified. - the you->filter information flow is established by taking the message file whose status you want to pin down (mostly those marked as UNS, because after a while crm144 gets really really good), and piping it to the classifier executable. One thing is certain -- we're talking elisp territory here. Another is certain, also -- such questions appear at some point, sooner or later, in the life of every mail user agent. Again, sorry if I failed the due diligence part of prior art discovery. Now to some answers (the unexpected part): The first part is handled easily, well, by a composition of procmailing the "ADV: "-prefixed messages out of one's sight, which becomes a plausible strategy once the classifier becomes clueful enough, and by adding a simple xapian "subject:" rule for "UNS: "-prefixed ones. The second part can be solved either in a way pleasant to the user, or easily. The easy way is to expect the user enter the spam thread, which contains exactly one message (never seen longer spam threads, still wondering why...), and then press some key and confirm the destination, station purple hell. Then you exit the thread. To enter another one... So, after a couple of minutes of processing the backlog, it's becoming painfully clear, that you don't want to spend more effort on these one-message spam threads than pressing 's', and then confirming it with 'y', avoiding the painful, distracting and redrawing thread enter/exit sequence. Note, that this conveniently avoids the question of non-spam messages, which actually often land within threads, but I'd like to keep this aside, sorry for incomplete solutions. So, the crux is, to pipe the file to the classifier you need the filename, and the filename appears to be easily available only in the 'show' mode. I've had to introduce some code to operate on single-message threads, or actually, threads with all messages ignored, but the first one. So, here goes, the solution modulo the conveniently avoided question of non-spam messages: (defun notmuch-pipe-file (filename command) (apply 'start-process-shell-command "notmuch-pipe-command" "*notmuch-pipe*" (list command " < " (shell-quote-argument filename)))) (defun notmuch-query (query) (notmuch-query-get-threads (append (list "\'") query (list "\'")))) (defun notmuch-result-firstmsg-property (result property) (plist-get (caaar result) property)) (defun notmuch-result-backend-remove-tags (result tags) (apply 'notmuch-call-notmuch-process (append (cons "tag" (mapcar (lambda (s) (concat "-" s)) tags)) (cons (concat "id:" (notmuch-result-firstmsg-property result :id)) nil)))) (defun notmuch-search-result-remove-tags (result tags) "Remove a tag from the current message. RESULT is not updated." (let ((current-tags (notmuch-result-firstmsg-property result :tags))) (if (intersection current-tags tags :test 'string=) ;; new result tags are (sort (set-difference current-tags tags :test 'string=) 'string<) ;; however, it's unlikely we'll need them, so no need to update (notmuch-result-backend-remove-tags result tags)))) (defun notmuch-search-query-current-thread () (notmuch-query (list (notmuch-search-find-thread-id)))) (defun notmuch-show-pipe-current-message (command) "Pipe the message currently pointed at within the show mode, through COMMAND." (interactive "sPipe message to command: ") (notmuch-pipe-file (notmuch-show-get-filename) command)) (defun notmuch-search-pipe-current-message (command) "Pipe the first message of the thread currently pointed at within the search mode, through COMMAND." (interactive "sPipe message to command: ") (let* ((result (notmuch-search-query-current-thread)) (filename (notmuch-result-firstmsg-property result :filename))) (notmuch-pipe-file filename command) result)) (setq mark-as-good-command "~/bin/stdin-is-good" mark-as-spam-command "~/bin/stdin-is-spam" spam-tagdrop-list '("inbox" "unread" "sent" "train")) (defun make-mark-as-good (piper) "Mark the message as good." (lexical-let ((piper piper)) (lambda () (interactive) (if (y-or-n-p "Mark as good? ") (progn (funcall piper mark-as-good-command) (forward-line 1)))))) (defun make-mark-as-spam (piper searchp) "Mark the message as spam." (lexical-let ((piper piper) (searchp searchp)) (lambda () (interactive) (if (y-or-n-p "Mark as spam? ") (let ((maybe-result (funcall piper mark-as-spam-command))) (if searchp (progn (notmuch-search-result-remove-tags maybe-result spam-tagdrop-list) (forward-line 1)) (notmuch-show-mark-read))))))) (define-key notmuch-show-mode-map "g" (make-mark-as-good 'notmuch-show-pipe-current-message)) (define-key notmuch-show-mode-map "s" (make-mark-as-spam 'notmuch-show-pipe-current-message nil)) (define-key notmuch-search-mode-map "g" (make-mark-as-good 'notmuch-search-pipe-current-message)) (define-key notmuch-search-mode-map "s" (make-mark-as-spam 'notmuch-search-pipe-current-message t)) I'll leave it to the more qualified people to decide which part (and in which form) is supposed to go into notmuch, and which is destined to live in the end-user's init file. -- regards, Samium Gromoff -- 1. http://crm114.sourceforge.net/ -- "Actually I made up the term 'object-oriented', and I can tell you I did not have C++ in mind." - Alan Kay (OOPSLA 1997 Keynote)