Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id A977C431FC2 for ; Sat, 17 Jan 2015 08:41:47 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 2.438 X-Spam-Level: ** X-Spam-Status: No, score=2.438 tagged_above=-999 required=5 tests=[DNS_FROM_AHBL_RHSBL=2.438] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LuMEjxPDhkyZ for ; Sat, 17 Jan 2015 08:41:44 -0800 (PST) Received: from s75.web-hosting.com (s75.web-hosting.com [198.187.31.9]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 6847F431FAF for ; Sat, 17 Jan 2015 08:41:44 -0800 (PST) Received: from user-69-73-37-128.knology.net ([69.73.37.128]:46736 helo=tz-lab) by server75.web-hosting.com with esmtpsa (UNKNOWN:DHE-RSA-AES128-SHA:128) (Exim 4.82) (envelope-from ) id 1YCWRS-001OHm-HT; Sat, 17 Jan 2015 11:41:42 -0500 From: Todd To: David Bremner , notmuch@notmuchmail.org Subject: Re: [PATCH v3 3/5] Add indexing for the mimetype term In-Reply-To: <877fwlbfg1.fsf@maritornes.cs.unb.ca> References: <1421368229-4360-1-git-send-email-todd@electricoding.com> <1421368229-4360-3-git-send-email-todd@electricoding.com> <877fwlbfg1.fsf@maritornes.cs.unb.ca> User-Agent: Notmuch/0.19+17~gd8b219d (http://notmuchmail.org) Emacs/24.4.1 (x86_64-unknown-linux-gnu) Date: Sat, 17 Jan 2015 10:41:10 -0600 Message-ID: <871tmt5pi1.fsf@electricoding.com> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha1; protocol="application/pgp-signature" X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server75.web-hosting.com X-AntiAbuse: Original Domain - notmuchmail.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - electricoding.com X-Get-Message-Sender-Via: server75.web-hosting.com: authenticated_id: todd@electricoding.com X-Source: X-Source-Args: X-Source-Dir: X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Jan 2015 16:41:47 -0000 --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable >>>>> "DB" =3D=3D David Bremner writes: DB> Todd writes: >> Adds the indexing and removes the broken test flag >> --- >> lib/database.cc | 1 + >> lib/index.cc | 10 ++++++++++ >> test/T190-multipart.sh | 4 ---- >> 3 files changed, 11 insertions(+), 4 deletions(-) >> >> diff --git a/lib/database.cc b/lib/database.cc >> index 0d2c417..3974e2e 100644 >> --- a/lib/database.cc >> +++ b/lib/database.cc >> @@ -254,6 +254,7 @@ static prefix_t PROBABILISTIC_PREFIX[]=3D { >> { "from", "XFROM" }, >> { "to", "XTO" }, >> { "attachment", "XATTACHMENT" }, >> + { "mimetype", "XMIMETYPE"}, >> { "subject", "XSUBJECT"}, >> }; DB> I think the commit message should articulate why we are indexing th= is as DB> a probabilistic prefix, rather than as a boolean prefix. In particu= lar, DB> this gives people a last chance to complain. DB> The reference I know is http://xapian.org/docs/queryparser.html DB> If I understand correctly (it would be great if you could test this DB> Todd) , with a probabilistic prefix, DB> mimetime:pdf DB> will match DB> application/pdf DB> image/pdf DB> application/x-pdf DB> application/x-ext-pdf DB> but not DB> application/x-bzpdf DB> application/x-gzpdf DB> application/x-xzpdf I just tested, and it does work this way with your examples. I *believe* from reading the docs, that xapian is treating the full MIME-type queries as phrase searches anyway due to the embedded slashes. From http://xapian.org/docs/queryparser.html: A phrase surrounded with double quotes ("") matches documents containing that exact phrase. Hyphenated words are also treated as phrases, as are cases such as filenames and email addresses (e.g. /etc/passwd or president@whitehouse.gov). I think that we'll get good behavior from the types of queries that will typically be performed due to this automatic phrasing. DB> On the whole, this is probably more beneficial than bad. The downs= ide DB> of probabilistic prefixes/fields is that they are not "anchored", so DB> there is no easy way to distinguish DB> application/pdf DB> from DB> pdf DB> application/x-pdf DB> I guess in a perfect world this would also be explained in DB> notmuch-search-terms(7), but that's pretty much orthogonal to this DB> series. If separate messages with application/pdf and application/x-pdf are indexed, then: =20=20=20=20 mimetype:application/x-pdf finds only the application/x-pdf mimetype:application/pdf finds only the application/pdf mimetype:pdf finds both of the messages I am fairly sure that this behaviour is a result of the automatic phrasing mentioned above. - Todd =20=20=20=20 DB> d --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJUupCnAAoJEEc0ULlfRYDu0f8QAJVtVpA9kQKjBgpTkrieYQnE ADCWWrIwiI7rU8MyaWD5GqVBPVUdHvYaKCGoQhiirnqvNEk0CrsF4rrDB7UNcSVH LKV5SDNIBGxw0EsMtukPXz0zgoJfKIWfqWieC97j832fI/2NZHetrs9VEWPHVLzJ 1VnPQpsAFt3dLXw8ff9WjkEZVcj/fbVBvHNZNX+YqY9RdzTRomJP4pqn0S1YKY9o SohqbLpS7HVh7JFOdPMVyALOqs5dh44n0PJYe7FDazqNwb2w0PqEa2dQnHjGF/0e 8SRUSKCTpvYC9buRfcFmZj5KWGx/vgi9T17etXJYU2Vd/CQNPAZmliZS9gaYKlWt 8YasMJyDDRq79XmiFbJwao47HUig6IFBdgGCMVxzmUZPTlINO8lQyuP/O9DlHVo5 2PK2vf/d07k5VnH6tjukEY6fEMQqQFkXG5JIWw0VLKMbVBG8esFwfpeEx0KdW6Qi oJfHxjmHMfAug9L/lukHotW7fH3mHZ2RQLWClaqhVBGgeGRfyMJEjnbLVCiZlk/0 0p4TDt5LTVAtopquwCMHpwJG7BA9CMOwGdOJB7hv/OTqVuj3ZSq1JP93jsrV4tO7 azEYOYW/VnrsOoGmsW/K3Hggl2OYej9aYmugTw3fodU9RV+xmfSrvvU/qkKWMSel oTv39uIcY/R+dmhU8EO1 =5flc -----END PGP SIGNATURE----- --=-=-=--