Re: [PATCH v3 3/5] Add indexing for the mimetype term
[notmuch-archives.git] / 34 / e3f9eba8ed841f241dcdfebf19d7c6c657aa6a
diff --git a/34/e3f9eba8ed841f241dcdfebf19d7c6c657aa6a b/34/e3f9eba8ed841f241dcdfebf19d7c6c657aa6a
new file mode 100644 (file)
index 0000000..5e93974
--- /dev/null
@@ -0,0 +1,186 @@
+Return-Path: <todd@electricoding.com>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+       by olra.theworths.org (Postfix) with ESMTP id A977C431FC2\r
+       for <notmuch@notmuchmail.org>; Sat, 17 Jan 2015 08:41:47 -0800 (PST)\r
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: 2.438\r
+X-Spam-Level: **\r
+X-Spam-Status: No, score=2.438 tagged_above=-999 required=5\r
+       tests=[DNS_FROM_AHBL_RHSBL=2.438] autolearn=disabled\r
+Received: from olra.theworths.org ([127.0.0.1])\r
+       by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
+       with ESMTP id LuMEjxPDhkyZ for <notmuch@notmuchmail.org>;\r
+       Sat, 17 Jan 2015 08:41:44 -0800 (PST)\r
+Received: from s75.web-hosting.com (s75.web-hosting.com [198.187.31.9])\r
+       (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))\r
+       (No client certificate requested)\r
+       by olra.theworths.org (Postfix) with ESMTPS id 6847F431FAF\r
+       for <notmuch@notmuchmail.org>; Sat, 17 Jan 2015 08:41:44 -0800 (PST)\r
+Received: from user-69-73-37-128.knology.net ([69.73.37.128]:46736\r
+ helo=tz-lab)  by server75.web-hosting.com with esmtpsa\r
+       (UNKNOWN:DHE-RSA-AES128-SHA:128) (Exim 4.82)    (envelope-from\r
+ <todd@electricoding.com>)     id 1YCWRS-001OHm-HT; Sat, 17 Jan 2015 11:41:42\r
+ -0500\r
+From: Todd <todd@electricoding.com>\r
+To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org\r
+Subject: Re: [PATCH v3 3/5] Add indexing for the mimetype term\r
+In-Reply-To: <877fwlbfg1.fsf@maritornes.cs.unb.ca>\r
+References: <1421368229-4360-1-git-send-email-todd@electricoding.com>\r
+       <1421368229-4360-3-git-send-email-todd@electricoding.com>\r
+       <877fwlbfg1.fsf@maritornes.cs.unb.ca>\r
+User-Agent: Notmuch/0.19+17~gd8b219d (http://notmuchmail.org) Emacs/24.4.1\r
+       (x86_64-unknown-linux-gnu)\r
+Date: Sat, 17 Jan 2015 10:41:10 -0600\r
+Message-ID: <871tmt5pi1.fsf@electricoding.com>\r
+MIME-Version: 1.0\r
+Content-Type: multipart/signed; boundary="=-=-=";\r
+       micalg=pgp-sha1; protocol="application/pgp-signature"\r
+X-AntiAbuse: This header was added to track abuse,\r
+       please include it with any abuse report\r
+X-AntiAbuse: Primary Hostname - server75.web-hosting.com\r
+X-AntiAbuse: Original Domain - notmuchmail.org\r
+X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]\r
+X-AntiAbuse: Sender Address Domain - electricoding.com\r
+X-Get-Message-Sender-Via: server75.web-hosting.com: authenticated_id:\r
+       todd@electricoding.com\r
+X-Source: \r
+X-Source-Args: \r
+X-Source-Dir: \r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.13\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+       <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Sat, 17 Jan 2015 16:41:47 -0000\r
+\r
+--=-=-=\r
+Content-Type: text/plain\r
+Content-Transfer-Encoding: quoted-printable\r
+\r
+\r
+>>>>> "DB" =3D=3D David Bremner <david@tethera.net> writes:\r
+\r
+    DB> Todd <todd@electricoding.com> writes:\r
+    >> Adds the indexing and removes the broken test flag\r
+    >> ---\r
+    >> lib/database.cc        |  1 +\r
+    >> lib/index.cc           | 10 ++++++++++\r
+    >> test/T190-multipart.sh |  4 ----\r
+    >> 3 files changed, 11 insertions(+), 4 deletions(-)\r
+    >>\r
+    >> diff --git a/lib/database.cc b/lib/database.cc\r
+    >> index 0d2c417..3974e2e 100644\r
+    >> --- a/lib/database.cc\r
+    >> +++ b/lib/database.cc\r
+    >> @@ -254,6 +254,7 @@ static prefix_t PROBABILISTIC_PREFIX[]=3D {\r
+    >> { "from",                       "XFROM" },\r
+    >> { "to",                 "XTO" },\r
+    >> { "attachment",         "XATTACHMENT" },\r
+    >> +    { "mimetype",              "XMIMETYPE"},\r
+    >> { "subject",            "XSUBJECT"},\r
+    >> };\r
+\r
+    DB> I think the commit message should articulate why we are indexing th=\r
+is as\r
+    DB> a probabilistic prefix, rather than as a boolean prefix. In particu=\r
+lar,\r
+    DB> this gives people a last chance to complain.\r
+\r
+    DB> The reference I know is http://xapian.org/docs/queryparser.html\r
+\r
+    DB> If I understand correctly (it would be great if you could test this\r
+    DB> Todd) , with a probabilistic prefix,\r
+\r
+    DB>    mimetime:pdf\r
+\r
+    DB> will match\r
+\r
+    DB> application/pdf\r
+    DB> image/pdf\r
+    DB> application/x-pdf\r
+    DB> application/x-ext-pdf\r
+\r
+    DB> but not\r
+\r
+    DB> application/x-bzpdf\r
+    DB> application/x-gzpdf\r
+    DB> application/x-xzpdf\r
+\r
+    I just tested, and it does work this way with your examples.  I\r
+    *believe* from reading the docs, that xapian is treating the full\r
+    MIME-type queries as phrase searches anyway due to the embedded\r
+    slashes.\r
+\r
+    From http://xapian.org/docs/queryparser.html:\r
+\r
+         A phrase surrounded with double quotes ("") matches documents\r
+         containing that exact phrase. Hyphenated words are also treated\r
+         as phrases, as are cases such as filenames and email addresses\r
+         (e.g. /etc/passwd or president@whitehouse.gov).\r
+\r
+    I think that we'll get good behavior from the types of queries that\r
+    will typically be performed due to this automatic phrasing.\r
+\r
+\r
+\r
+    DB> On the whole, this is probably more beneficial than bad.  The downs=\r
+ide\r
+    DB> of probabilistic prefixes/fields is that they are not "anchored", so\r
+    DB> there is no easy way to distinguish\r
+\r
+    DB>       application/pdf\r
+\r
+    DB> from\r
+\r
+    DB>       pdf\r
+    DB>       application/x-pdf\r
+\r
+    DB> I guess in a perfect world this would also be explained in\r
+    DB> notmuch-search-terms(7), but that's pretty much orthogonal to this\r
+    DB> series.\r
+\r
+    If separate messages with application/pdf and application/x-pdf are\r
+    indexed, then:\r
+=20=20=20=20\r
+    mimetype:application/x-pdf finds only the application/x-pdf\r
+    mimetype:application/pdf finds only the application/pdf\r
+    mimetype:pdf finds both of the messages\r
+\r
+    I am fairly sure that this behaviour is a result of the automatic\r
+    phrasing mentioned above.\r
+\r
+    - Todd\r
+=20=20=20=20\r
+    DB> d\r
+\r
+--=-=-=\r
+Content-Type: application/pgp-signature; name="signature.asc"\r
+\r
+-----BEGIN PGP SIGNATURE-----\r
+Version: GnuPG v1\r
+\r
+iQIcBAEBAgAGBQJUupCnAAoJEEc0ULlfRYDu0f8QAJVtVpA9kQKjBgpTkrieYQnE\r
+ADCWWrIwiI7rU8MyaWD5GqVBPVUdHvYaKCGoQhiirnqvNEk0CrsF4rrDB7UNcSVH\r
+LKV5SDNIBGxw0EsMtukPXz0zgoJfKIWfqWieC97j832fI/2NZHetrs9VEWPHVLzJ\r
+1VnPQpsAFt3dLXw8ff9WjkEZVcj/fbVBvHNZNX+YqY9RdzTRomJP4pqn0S1YKY9o\r
+SohqbLpS7HVh7JFOdPMVyALOqs5dh44n0PJYe7FDazqNwb2w0PqEa2dQnHjGF/0e\r
+8SRUSKCTpvYC9buRfcFmZj5KWGx/vgi9T17etXJYU2Vd/CQNPAZmliZS9gaYKlWt\r
+8YasMJyDDRq79XmiFbJwao47HUig6IFBdgGCMVxzmUZPTlINO8lQyuP/O9DlHVo5\r
+2PK2vf/d07k5VnH6tjukEY6fEMQqQFkXG5JIWw0VLKMbVBG8esFwfpeEx0KdW6Qi\r
+oJfHxjmHMfAug9L/lukHotW7fH3mHZ2RQLWClaqhVBGgeGRfyMJEjnbLVCiZlk/0\r
+0p4TDt5LTVAtopquwCMHpwJG7BA9CMOwGdOJB7hv/OTqVuj3ZSq1JP93jsrV4tO7\r
+azEYOYW/VnrsOoGmsW/K3Hggl2OYej9aYmugTw3fodU9RV+xmfSrvvU/qkKWMSel\r
+oTv39uIcY/R+dmhU8EO1\r
+=5flc\r
+-----END PGP SIGNATURE-----\r
+--=-=-=--\r