From b7bf65c024d0d9d1dfe22f9834fcf0dae916057f Mon Sep 17 00:00:00 2001 From: David Bremner Date: Mon, 27 Jun 2016 15:33:07 +0200 Subject: [PATCH] [PATCH] lib: regexp matching in 'subject' and 'from' --- 0a/674f25433f27fd2be8c147e6b7ebc18e142455 | 509 ++++++++++++++++++++++ 1 file changed, 509 insertions(+) create mode 100644 0a/674f25433f27fd2be8c147e6b7ebc18e142455 diff --git a/0a/674f25433f27fd2be8c147e6b7ebc18e142455 b/0a/674f25433f27fd2be8c147e6b7ebc18e142455 new file mode 100644 index 000000000..c5b41260d --- /dev/null +++ b/0a/674f25433f27fd2be8c147e6b7ebc18e142455 @@ -0,0 +1,509 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by arlo.cworth.org (Postfix) with ESMTP id BDD866DE014D + for ; Mon, 27 Jun 2016 06:48:42 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at cworth.org +X-Spam-Flag: NO +X-Spam-Score: -0.005 +X-Spam-Level: +X-Spam-Status: No, score=-0.005 tagged_above=-999 required=5 + tests=[AWL=-0.006, HEADER_FROM_DIFFERENT_DOMAINS=0.001] + autolearn=disabled +Received: from arlo.cworth.org ([127.0.0.1]) + by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id aPDJIOOf4Vrt for ; + Mon, 27 Jun 2016 06:48:34 -0700 (PDT) +Received: from fethera.tethera.net (fethera.tethera.net [198.245.60.197]) + by arlo.cworth.org (Postfix) with ESMTPS id F20FC6DE00CC + for ; Mon, 27 Jun 2016 06:48:33 -0700 (PDT) +Received: from remotemail by fethera.tethera.net with local (Exim 4.84) + (envelope-from ) + id 1bHWtZ-0001Xa-RZ; Mon, 27 Jun 2016 09:48:13 -0400 +Received: (nullmailer pid 17561 invoked by uid 1000); + Mon, 27 Jun 2016 13:33:20 -0000 +From: David Bremner +To: notmuch@notmuchmail.org +Subject: [PATCH] lib: regexp matching in 'subject' and 'from' +Date: Mon, 27 Jun 2016 15:33:07 +0200 +Message-Id: <1467034387-16885-1-git-send-email-david@tethera.net> +X-Mailer: git-send-email 2.8.1 +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.20 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Mon, 27 Jun 2016 13:48:42 -0000 + +the idea is that you can run + +% notmuch search re:subject: +% notmuch search re:from:' + +or + +% notmuch search subject:"your usual phrase search" +% notmuch search from:"usual phrase search" + +This should also work with bindings, since it extends the query parser. + +This is trivial to extend for other value slots, but currently the only +value slots are date, message_id, from, subject, and last_mod. Date is +already searchable, and message_id is not obviously useful to regex +match. + +This was originally written by Austin Clements, and ported to Xapian +field processors (from Austin's custom query parser) by yours truly. +--- + +This is the zero-th non-WIP version. Since the last version [1], I +have added some better error reporting for regexp syntax errors, tests +for two kinds of query syntax error, and some documentation for the +query syntax. + + doc/man7/notmuch-search-terms.rst | 17 +++++- + lib/Makefile.local | 1 + + lib/database-private.h | 1 + + lib/database.cc | 5 ++ + lib/regexp-fields.cc | 125 ++++++++++++++++++++++++++++++++++++++ + lib/regexp-fields.h | 77 +++++++++++++++++++++++ + test/T630-regexp-query.sh | 91 +++++++++++++++++++++++++++ + 7 files changed, 316 insertions(+), 1 deletion(-) + create mode 100644 lib/regexp-fields.cc + create mode 100644 lib/regexp-fields.h + create mode 100755 test/T630-regexp-query.sh + +diff --git a/doc/man7/notmuch-search-terms.rst b/doc/man7/notmuch-search-terms.rst +index 075f88c..6155406 100644 +--- a/doc/man7/notmuch-search-terms.rst ++++ b/doc/man7/notmuch-search-terms.rst +@@ -58,6 +58,8 @@ indicate user-supplied values): + + - query: + ++- re:{subject,from}: ++ + The **from:** prefix is used to match the name or address of the sender + of an email message. + +@@ -139,6 +141,12 @@ queries added with **notmuch-config(1)**. Named queries are only + available if notmuch is built with **Xapian Field Processors** (see + below). + ++The **re::** prefix can be used to restrict the results to ++those whose matches the given regular expression (see ++**regex(7)**). Regular expression searches are only available if ++notmuch is built with **Xapian Field Processors** (see below), and ++currently only for the Subject and From fields. ++ + Operators + --------- + +@@ -213,13 +221,19 @@ Boolean and Probabilistic Prefixes + ---------------------------------- + + Xapian (and hence notmuch) prefixes are either **boolean**, supporting +-exact matches like "tag:inbox" or **probabilistic**, supporting a more flexible **term** based searching. The prefixes currently supported by notmuch are as follows. ++exact matches like "tag:inbox" or **probabilistic**, supporting a more ++flexible **term** based searching. Certain **special** prefixes are ++processed by notmuch in a way not stricly fitting either of Xapian's ++built in styles. The prefixes currently supported by notmuch are as ++follows. + + + Boolean + **tag:**, **id:**, **thread:**, **folder:**, **path:** + Probabilistic + **from:**, **to:**, **subject:**, **attachment:**, **mimetype:** ++Special ++ **query:**, **re:** + + Terms and phrases + ----------------- +@@ -389,6 +403,7 @@ Currently the following features require field processor support: + + - non-range date queries, e.g. "date:today" + - named queries e.g. "query:my_special_query" ++- regular expression searches, e.g. "re:subject:^\\[SPAM\\]" + + SEE ALSO + ======== +diff --git a/lib/Makefile.local b/lib/Makefile.local +index beb9635..68771e6 100644 +--- a/lib/Makefile.local ++++ b/lib/Makefile.local +@@ -51,6 +51,7 @@ libnotmuch_cxx_srcs = \ + $(dir)/query.cc \ + $(dir)/query-fp.cc \ + $(dir)/config.cc \ ++ $(dir)/regexp-fields.cc \ + $(dir)/thread.cc + + libnotmuch_modules := $(libnotmuch_c_srcs:.c=.o) $(libnotmuch_cxx_srcs:.cc=.o) +diff --git a/lib/database-private.h b/lib/database-private.h +index ca71a92..900a989 100644 +--- a/lib/database-private.h ++++ b/lib/database-private.h +@@ -186,6 +186,7 @@ struct _notmuch_database { + #if HAVE_XAPIAN_FIELD_PROCESSOR + Xapian::FieldProcessor *date_field_processor; + Xapian::FieldProcessor *query_field_processor; ++ Xapian::FieldProcessor *re_field_processor; + #endif + Xapian::ValueRangeProcessor *last_mod_range_processor; + }; +diff --git a/lib/database.cc b/lib/database.cc +index afafe88..b52b62d 100644 +--- a/lib/database.cc ++++ b/lib/database.cc +@@ -21,6 +21,7 @@ + #include "database-private.h" + #include "parse-time-vrp.h" + #include "query-fp.h" ++#include "regexp-fields.h" + #include "string-util.h" + + #include +@@ -1016,6 +1017,8 @@ notmuch_database_open_verbose (const char *path, + notmuch->query_parser->add_boolean_prefix("date", notmuch->date_field_processor); + notmuch->query_field_processor = new QueryFieldProcessor (*notmuch->query_parser, notmuch); + notmuch->query_parser->add_boolean_prefix("query", notmuch->query_field_processor); ++ notmuch->re_field_processor = new RegexpFieldProcessor (*notmuch->query_parser, notmuch); ++ notmuch->query_parser->add_boolean_prefix("re", notmuch->re_field_processor); + #endif + notmuch->last_mod_range_processor = new Xapian::NumberValueRangeProcessor (NOTMUCH_VALUE_LAST_MOD, "lastmod:"); + +@@ -1112,6 +1115,8 @@ notmuch_database_close (notmuch_database_t *notmuch) + notmuch->date_field_processor = NULL; + delete notmuch->query_field_processor; + notmuch->query_field_processor = NULL; ++ delete notmuch->re_field_processor; ++ notmuch->re_field_processor = NULL; + #endif + + return status; +diff --git a/lib/regexp-fields.cc b/lib/regexp-fields.cc +new file mode 100644 +index 0000000..4d3d972 +--- /dev/null ++++ b/lib/regexp-fields.cc +@@ -0,0 +1,125 @@ ++/* regexp-fields.cc - "re:" field processor glue ++ * ++ * This file is part of notmuch. ++ * ++ * Copyright © 2015 Austin Clements ++ * Copyright © 2016 David Bremner ++ * ++ * This program is free software: you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License as published by ++ * the Free Software Foundation, either version 3 of the License, or ++ * (at your option) any later version. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public License ++ * along with this program. If not, see https://www.gnu.org/licenses/ . ++ * ++ * Author: Austin Clements ++ * David Bremner ++ */ ++ ++#include "regexp-fields.h" ++#include "notmuch-private.h" ++ ++#if HAVE_XAPIAN_FIELD_PROCESSOR ++RegexpPostingSource::RegexpPostingSource (Xapian::valueno slot, const std::string ®exp) ++ : slot_ (slot) ++{ ++ int err = regcomp (®exp_, regexp.c_str (), REG_EXTENDED | REG_NOSUB); ++ ++ if (err != 0) { ++ size_t len = regerror (err, ®exp_, NULL, 0); ++ char *buffer = new char[len]; ++ std::string msg; ++ (void) regerror (err, ®exp_, buffer, len); ++ msg.assign (buffer, len); ++ delete buffer; ++ ++ throw Xapian::QueryParserError (msg); ++ } ++} ++ ++RegexpPostingSource::~RegexpPostingSource () ++{ ++ regfree (®exp_); ++} ++ ++void ++RegexpPostingSource::init (const Xapian::Database &db) ++{ ++ db_ = db; ++ it_ = db_.valuestream_begin (slot_); ++ end_ = db.valuestream_end (slot_); ++ started_ = false; ++} ++ ++Xapian::doccount ++RegexpPostingSource::get_termfreq_min () const ++{ ++ return 0; ++} ++ ++Xapian::doccount ++RegexpPostingSource::get_termfreq_est () const ++{ ++ return get_termfreq_max () / 2; ++} ++ ++Xapian::doccount ++RegexpPostingSource::get_termfreq_max () const ++{ ++ return db_.get_value_freq (slot_); ++} ++ ++Xapian::docid ++RegexpPostingSource::get_docid () const ++{ ++ return it_.get_docid (); ++} ++ ++bool ++RegexpPostingSource::at_end () const ++{ ++ return it_ == end_; ++} ++ ++void ++RegexpPostingSource::next (unused (double min_wt)) ++{ ++ if (started_ && ! at_end ()) ++ ++it_; ++ started_ = true; ++ ++ for (; ! at_end (); ++it_) { ++ std::string value = *it_; ++ if (regexec (®exp_, value.c_str (), 0, NULL, 0) == 0) ++ break; ++ } ++} ++ ++static Xapian::valueno ++_find_slot (std::string prefix) ++{ ++ if (prefix == "from") ++ return NOTMUCH_VALUE_FROM; ++ else if (prefix == "subject") ++ return NOTMUCH_VALUE_SUBJECT; ++ else ++ throw Xapian::QueryParserError ("unsupported regexp field '" + prefix + "'"); ++} ++ ++Xapian::Query ++RegexpFieldProcessor::operator() (const std::string & str) ++{ ++ size_t pos = str.find_first_of (':'); ++ std::string prefix = str.substr (0, pos); ++ std::string regexp = str.substr (pos + 1); ++ ++ postings = new RegexpPostingSource (_find_slot (prefix), regexp); ++ return Xapian::Query (postings); ++} ++#endif +diff --git a/lib/regexp-fields.h b/lib/regexp-fields.h +new file mode 100644 +index 0000000..2c9c2d7 +--- /dev/null ++++ b/lib/regexp-fields.h +@@ -0,0 +1,77 @@ ++/* regex-fields.h - xapian glue for semi-bruteforce regexp search ++ * ++ * This file is part of notmuch. ++ * ++ * Copyright © 2015 Austin Clements ++ * Copyright © 2016 David Bremner ++ * ++ * This program is free software: you can redistribute it and/or modify ++ * it under the terms of the GNU General Public License as published by ++ * the Free Software Foundation, either version 3 of the License, or ++ * (at your option) any later version. ++ * ++ * This program is distributed in the hope that it will be useful, ++ * but WITHOUT ANY WARRANTY; without even the implied warranty of ++ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ++ * GNU General Public License for more details. ++ * ++ * You should have received a copy of the GNU General Public License ++ * along with this program. If not, see https://www.gnu.org/licenses/ . ++ * ++ * Author: Austin Clements ++ * David Bremner ++ */ ++ ++#ifndef NOTMUCH_REGEXP_FIELDS_H ++#define NOTMUCH_REGEXP_FIELDS_H ++#if HAVE_XAPIAN_FIELD_PROCESSOR ++#include ++#include ++#include ++#include "notmuch-private.h" ++ ++/* A posting source that returns documents where a value matches a ++ * regexp. ++ */ ++class RegexpPostingSource : public Xapian::PostingSource ++{ ++ protected: ++ const Xapian::valueno slot_; ++ regex_t regexp_; ++ Xapian::Database db_; ++ bool started_; ++ Xapian::ValueIterator it_, end_; ++ ++/* No copying */ ++ RegexpPostingSource (const RegexpPostingSource &); ++ RegexpPostingSource &operator= (const RegexpPostingSource &); ++ ++ public: ++ RegexpPostingSource (Xapian::valueno slot, const std::string ®exp); ++ ~RegexpPostingSource (); ++ void init (const Xapian::Database &db); ++ Xapian::doccount get_termfreq_min () const; ++ Xapian::doccount get_termfreq_est () const; ++ Xapian::doccount get_termfreq_max () const; ++ Xapian::docid get_docid () const; ++ bool at_end () const; ++ void next (unused (double min_wt)); ++}; ++ ++ ++class RegexpFieldProcessor : public Xapian::FieldProcessor { ++ protected: ++ Xapian::QueryParser &parser; ++ notmuch_database_t *notmuch; ++ RegexpPostingSource *postings = NULL; ++ ++ public: ++ RegexpFieldProcessor (Xapian::QueryParser &parser_, notmuch_database_t *notmuch_) ++ : parser(parser_), notmuch(notmuch_) { }; ++ ++ ~RegexpFieldProcessor () { delete postings; }; ++ ++ Xapian::Query operator()(const std::string & str); ++}; ++#endif ++#endif /* NOTMUCH_REGEXP_FIELDS_H */ +diff --git a/test/T630-regexp-query.sh b/test/T630-regexp-query.sh +new file mode 100755 +index 0000000..3bbe47c +--- /dev/null ++++ b/test/T630-regexp-query.sh +@@ -0,0 +1,91 @@ ++#!/usr/bin/env bash ++test_description='regular expression searches' ++. ./test-lib.sh || exit 1 ++ ++add_email_corpus ++ ++ ++if [ $NOTMUCH_HAVE_XAPIAN_FIELD_PROCESSOR -eq 1 ]; then ++ ++ notmuch search --output=messages from:cworth > cworth.msg-ids ++ ++ test_begin_subtest "regexp from search, case sensitive" ++ notmuch search --output=messages re:from:carl > OUTPUT ++ test_expect_equal_file /dev/null OUTPUT ++ ++ test_begin_subtest "empty regexp or query" ++ notmuch search --output=messages re:from:carl or from:cworth > OUTPUT ++ test_expect_equal_file cworth.msg-ids OUTPUT ++ ++ test_begin_subtest "non-empty regexp and query" ++ notmuch search re:from:cworth and subject:patch > OUTPUT ++ cat < EXPECTED ++thread:0000000000000008 2009-11-18 [1/2] Carl Worth| Alex Botero-Lowry; [notmuch] [PATCH] Error out if no query is supplied to search instead of going into an infinite loop (attachment inbox unread) ++thread:0000000000000007 2009-11-18 [1/2] Carl Worth| Ingmar Vanhassel; [notmuch] [PATCH] Typsos (inbox unread) ++thread:0000000000000018 2009-11-18 [1/2] Carl Worth| Jan Janak; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread) ++thread:0000000000000017 2009-11-18 [1/2] Carl Worth| Keith Packard; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread) ++thread:0000000000000014 2009-11-18 [2/5] Carl Worth| Mikhail Gusarov, Keith Packard; [notmuch] [PATCH 1/2] Close message file after parsing message headers (inbox unread) ++thread:0000000000000001 2009-11-18 [1/1] Stewart Smith; [notmuch] [PATCH] Fix linking with gcc to use g++ to link in C++ libs. (inbox unread) ++EOF ++ test_expect_equal_file EXPECTED OUTPUT ++ ++ test_begin_subtest "regexp from search, duplicate term search" ++ notmuch search --output=messages re:from:cworth > OUTPUT ++ test_expect_equal_file cworth.msg-ids OUTPUT ++ ++ test_begin_subtest "long enough regexp matches only desired senders" ++ notmuch search --output=messages 're:"from:C.* Wo"' > OUTPUT ++ test_expect_equal_file cworth.msg-ids OUTPUT ++ ++ test_begin_subtest "shorter regexp matches one more sender" ++ notmuch search --output=messages 're:"from:C.* W"' > OUTPUT ++ (echo id:1258544095-16616-1-git-send-email-chris@chris-wilson.co.uk ; cat cworth.msg-ids) > EXPECTED ++ test_expect_equal_file EXPECTED OUTPUT ++ ++ test_begin_subtest "regexp subject search, non-ASCII" ++ notmuch search --output=messages re:subject:accentué > OUTPUT ++ echo id:877h1wv7mg.fsf@inf-8657.int-evry.fr > EXPECTED ++ test_expect_equal_file EXPECTED OUTPUT ++ ++ test_begin_subtest "regexp subject search, punctuation" ++ notmuch search re:subject:\'X\' > OUTPUT ++ cat < EXPECTED ++thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread) ++EOF ++ test_expect_equal_file EXPECTED OUTPUT ++ ++ test_begin_subtest "regexp subject search, no punctuation" ++ notmuch search re:subject:X > OUTPUT ++ cat < EXPECTED ++thread:0000000000000017 2009-11-18 [2/2] Keith Packard, Carl Worth; [notmuch] [PATCH] Make notmuch-show 'X' (and 'x') commands remove inbox (and unread) tags (inbox unread) ++thread:000000000000000f 2009-11-18 [4/4] Jjgod Jiang, Alexander Botero-Lowry; [notmuch] Mac OS X/Darwin compatibility issues (inbox unread) ++EOF ++ test_expect_equal_file EXPECTED OUTPUT ++ ++ test_begin_subtest "combine regexp from and subject" ++ notmuch search re:subject:-C and re:from:.an.k > OUTPUT ++ cat < EXPECTED ++thread:0000000000000018 2009-11-17 [1/2] Jan Janak| Carl Worth; [notmuch] [PATCH] Older versions of install do not support -C. (inbox unread) ++EOF ++ test_expect_equal_file EXPECTED OUTPUT ++ ++ test_begin_subtest "bad subprefix" ++ notmuch search 're:unsupported:.*' 1>OUTPUT 2>&1 ++ cat < EXPECTED ++notmuch search: A Xapian exception occurred ++A Xapian exception occurred performing query: unsupported regexp field 'unsupported' ++Query string was: re:unsupported:.* ++EOF ++ test_expect_equal_file EXPECTED OUTPUT ++ ++ test_begin_subtest "regexp error reporting" ++ notmuch search 're:from:unbalanced[' 1>OUTPUT 2>&1 ++ cat < EXPECTED ++notmuch search: A Xapian exception occurred ++A Xapian exception occurred performing query: Invalid regular expression ++Query string was: re:from:unbalanced[ ++EOF ++ test_expect_equal_file EXPECTED OUTPUT ++fi ++ ++test_done +-- +2.8.1 + -- 2.26.2