Re: Searching messages by size with notmuch
authorDaniel Kahn Gillmor <dkg@fifthhorseman.net>
Fri, 22 Jan 2016 17:02:35 +0000 (12:02 +1900)
committerW. Trevor King <wking@tremily.us>
Sat, 20 Aug 2016 23:20:55 +0000 (16:20 -0700)
2c/0c55fad0fb9c07b8093772eb5204c524853a42 [new file with mode: 0644]

diff --git a/2c/0c55fad0fb9c07b8093772eb5204c524853a42 b/2c/0c55fad0fb9c07b8093772eb5204c524853a42
new file mode 100644 (file)
index 0000000..28dd813
--- /dev/null
@@ -0,0 +1,209 @@
+Return-Path: <dkg@fifthhorseman.net>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+ by arlo.cworth.org (Postfix) with ESMTP id ADF3E6DE0B38\r
+ for <notmuch@notmuchmail.org>; Fri, 22 Jan 2016 09:02:43 -0800 (PST)\r
+X-Virus-Scanned: Debian amavisd-new at cworth.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: -0.013\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=-0.013 tagged_above=-999 required=5\r
+ tests=[AWL=-0.013] autolearn=disabled\r
+Received: from arlo.cworth.org ([127.0.0.1])\r
+ by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)\r
+ with ESMTP id W4AgVm5pY6tx for <notmuch@notmuchmail.org>;\r
+ Fri, 22 Jan 2016 09:02:38 -0800 (PST)\r
+Received: from che.mayfirst.org (che.mayfirst.org [209.234.253.108])\r
+ by arlo.cworth.org (Postfix) with ESMTP id 4C4F16DE0B2F\r
+ for <notmuch@notmuchmail.org>; Fri, 22 Jan 2016 09:02:38 -0800 (PST)\r
+Received: from fifthhorseman.net (unknown [38.109.115.130])\r
+ by che.mayfirst.org (Postfix) with ESMTPSA id 000A3F984;\r
+ Fri, 22 Jan 2016 12:02:35 -0500 (EST)\r
+Received: by fifthhorseman.net (Postfix, from userid 1000)\r
+ id 7376C20112; Fri, 22 Jan 2016 12:02:35 -0500 (EST)\r
+From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>\r
+To: Antoine Amarilli <a3nm@a3nm.net>, notmuch@notmuchmail.org\r
+Subject: Re: Searching messages by size with notmuch\r
+In-Reply-To: <20160122151318.GA17099@mu.a3nm.net>\r
+References: <20160122151318.GA17099@mu.a3nm.net>\r
+User-Agent: Notmuch/0.21+67~g41ad7ff (http://notmuchmail.org) Emacs/24.5.1\r
+ (x86_64-pc-linux-gnu)\r
+Date: Fri, 22 Jan 2016 12:02:35 -0500\r
+Message-ID: <87twm5r1hw.fsf@alice.fifthhorseman.net>\r
+MIME-Version: 1.0\r
+Content-Type: multipart/mixed; boundary="=-=-="\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.20\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+ <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <https://notmuchmail.org/mailman/options/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch/>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <https://notmuchmail.org/mailman/listinfo/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Fri, 22 Jan 2016 17:02:43 -0000\r
+\r
+--=-=-=\r
+Content-Type: text/plain\r
+\r
+On Fri 2016-01-22 10:13:18 -0500, Antoine Amarilli wrote:\r
+\r
+> After chatting on #notmuch, I wanted to suggest a feature which would be\r
+> useful, at least to me: searching for messages by size.\r
+>\r
+> My use case would be to look for long messages, but dkg on IRC mentioned\r
+> that it could also be useful to clean up messages to save disk space.\r
+>\r
+> It is unclear whether the size of a message should be defined as that of\r
+> a single copy of the message, or that of all copies; and it is unclear\r
+> whether it should be the total size (for my purposes I would have been\r
+> interested in the size of the plaintext part of the message only).\r
+\r
+Note that "the plaintext part of the message" might mean the sum of\r
+multiple plaintext parts too -- consider this message as an example:\r
+\r
+\r
+--=-=-=\r
+Content-Type: image/png\r
+Content-Disposition: inline; filename=stock_smiley-7.png\r
+Content-Transfer-Encoding: base64\r
+\r
+iVBORw0KGgoAAAANSUhEUgAAADAAAAAwCAYAAABXAvmHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\r
+AAAN1wAADdcBQiibeAAAABl0RVh0U29mdHdhcmUAd3d3Lmlua3NjYXBlLm9yZ5vuPBoAAAAXdEVY\r
+dEF1dGhvcgBMYXBvIENhbGFtYW5kcmVp35EaKgAAACl0RVh0RGVzY3JpcHRpb24AQmFzZWQgb2Yg\r
+SmFrdWIgU3RlaW5lciBkZXNpZ26ghAVzAAAP7UlEQVRo3r2Ze4xcV3nAf+fce+exs++112snjhM7\r
+NiFxRIKSUPIghBSImiht1BpVqYpAIJAQIKqqoJKgFkSrQhCggugDaEJbpZVKVBISEoLCQylpAgmJ\r
+A9gkcezYjp+7OzM7szNz7z3nfF//uHftjbWxTQId6dPO7J255/t973OuUVVexcsclycsjFlYiKBr\r
+Ya2BtoFxhaAwEOgKjIQf0tU382YBtPj5K1fCvAqAJeUt7IqgH4MkoAlIDEkE1oBonyBDRB6Mg3kH\r
+eBgJsFtgm7waiFcKsGT1CEigUoW4lhHXDWnDUK0ZXAUSG4OCz3MkrWD74AZg+6A5uHwJwmDQVwAR\r
+v3Lld8SQJDA01DpycOa5x//n90KeviWJorNsUp2K4uqwjepV8QMnPu+J5PPO+YM2qf/wrC2X3jez\r
+ZdN+yHpwQVaY///FA8bAXxvYFtOs1L531+dvjjS8qzo6ffH68y5m1brNlaQxRVQZwURVUAEToSFF\r
+8h55f57moV1u/zPbGSwcfNaJvfOia9/6tTWbLlyAcz1Q5sVvL4QM7Eju/fvPXE+efeGcrZfNbNh6\r
+ZbU+uR5CHzTD2Bi1McbEYGNQQTVgxKHBg61ioiHS9ovs3/FItuPJR3tE0S033fSpOzj77LyE+O0A\r
+fOu2mzb5/uDOmY1bL3zdFdfV68PjSHoE4gq2PolJRjBxHWwFY6Miv1HQgIYcQoq6RSRtoT7HVleR\r
+9RbZ/uPvpC88s/0QSX3bzZ+46En4K/mNA/zHx992dVJN7r78rTeOrDrzXCvdPWAV25jBDk1iKiOY\r
+pAFRBWMTMLZMF1AVEA8hQ/0AzbvIoIn0joKAHd5A69Bu/d69dw36afau9332oW+ebijZ0/nSv3/s\r
+qvdVG0P3v+X6bWPjNW+z3fcgeQvianmHAMajODAOJV8mGZgcTI4aBxRV1FgwcYK6NvneBxhNcnPD\r
+TTcPjQwPf+MfPvzGT2OM+Y0A/NtfXPmescnpL1573Y31uPUU7uDDmCjBIKApaB/VHkgP6KHaR00h\r
+mD5qBig90D5oD9XFQqSH6gDUgbW4I48RtZ7khhv+oL52es1HvvKBy/7uVQN846NvuqJSa3zpyive\r
+VJdd9+DnnsZYi2oGoQuuhbgm6puItEA6GLpAD0MJQSFqFlHtoNJCfQtcE81bqO9A6KNGcHO/wD9/\r
+D9dc+5ahWr3xoS+8/5KbXzHA7X9+9Xqreu9Vb7yszoGH8c1fAQENfTRvI9kckh+BfBbjmyBdhB5C\r
+VobOUrjkQIZqitIH6UJoIu4omh+GbBbyFup6GPX49vPIvu9z/bVX1KMo/upn33vpZa8IwGj2X5ds\r
+fe1ovb+XfP/DhfK+j+YLaD6PZLNoPo+GFvc8+DhGUozmGHWgvpgWNAAeJWBw5bWMu7/zGBramLyJ\r
+ZLNINlt40/cw4sgPPEplcQ/X/c4FQxZ/1yc/eU38awF89SOXv314eHzr2tXjNn3mvrKK5KhfRPM2\r
+ZG3EtVC/yG3//Agf+PgD3HnXwyBZMSGow6gv4ls9RhxIDpLypX+8nw/e+l0+90+PIKGLug6aLyBZ\r
+G1wXlRRVIX3uftasmmTdqrGJ6gsL7z19AGOMlfDl129Z38j3/KhQAkHFQUiR0Eekh5Wirt/3gz3c\r
+eustfPv+JyH0IPQxMkBlUDQ2GRTJKgMIfb517xPceust3PPQLjRkRW+QPsigCE9xGBVQR7b7+1x5\r
+4aaGivztJ99/ydBpAXzlQ2/4w1WT4zPjiccd/QUYMGgxFojHiGCCR0qr7tq3wNe//i8889xh8F0k\r
+dIsElx5GuqA9TOiioYOGDr/adaT4/p6Fwqvqyh7hQAQklC3A4I7uZDj2nHfGaNX0wp+d1jBnxb93\r
+8/TocL7/p8cnXGPAaHnjgCIY9Swu9mku5DS3b2eoHqO+jY0EY3LUVMDGqATwGbgB3dYss3M9Zue2\r
+AzA7v8hUxS0bgaRYpxwXjYF8/0+54Mzzhn6xv/0e4G9O6oE73n1NTUxy9dT4KKG1d9kVfel7EUwI\r
+NJJArVLcYnqqhmZtNG+ibg5cE+PnwM8X4po0kj5DtcJmSWyYGBIIHhUpjHVC71UgtPYxNT6KGrP2\r
+E+983TknBViop2+bGBlydA8hLj12U10yvioqxQctE3TjGTUAtmxogGsVlSmbh3wWSWchmy0+Z/NY\r
+12Lz2cMAnLOuTowrCoTK8UWWQJbW9jmhc5iNqxqEnN8/aQhpcNdND8fDrrnvuAlecsNSRNAgqAS2\r
+bhpmePVWXveaJpp1MeRFAocYjMEoxRTqMkLuuOi8EaqjW9gwuhcJHoKUuwFZZrClNQslfGsfZ07O\r
+1HYeaP0R8MWX9YBIOLcaR8b128eU1WXKqwKihctFMAqf+8trOHdtxsc/fDXq+mjWQdMFZLCApm0k\r
+bUPaRrIuNgi33Xot56zJ+Pwtb4IQUA3Fllm0VPylBlMF129TTyJEzVkn94BwRhwZXN4nluLHZulG\r
+ophyEVNCqCirpldxx5ffgQyOElpZkY/BYSKLlB5ABHUeNTA6uYZ//co2wtwu8nYAX3iy8GwZoroM\r
+RhSX96kmljwwdVKAgJmOLORpn1qkGBV6qVKPlCQulDZBjv2VPCMc2o72D6GuDZIiscWEANaCKQ0p\r
+Wlg7pGj2FCYaIrRehPJeiKKlF5Y8HIKSOaWqikv7RFFEEKrvfvc5tdtv35OuDCCMqAq5d0U5E6Ve\r
+UfY3A1NjlvGk9ERQNAjGe0J3DslamCTGxBFIBJGgtiy/qiAUOeMD6o+gzqEuFP9bEjkuiwNhoSPM\r
+1BRxSu4dIoo1mtXmKlPAgZcBkO4gzap1WyWXAVVRrAjrxyxPHvaM9uC89RYTCQSDegEbwC/FYJHg\r
+RLY4USk3NIVVS4X9cgmFF4JAEJxTduz3iBMumALNlSCKs1V8npJ5raaL+fzJkvjoYpqTmwqZAKGw\r
+uBXh4nWG+U7gP/+3z479OeoEDaFQxElh0VIk92ju0dyhuUOcQ/NQiAulJ4rfiitgXjjq+O/HB6gI\r
+W9cUeUMQMgFnqywOBqhKdvsPjofPSkl8YDHNzx+yFTIPIoINggaLjZTLN0Ts7hge2jHgx8/lXPHa\r
+OhecXSSqZSnZLcYa1JjyvKos7yKFQY5BB8QHnn0x48ndKb1+4G2viZmuFZBL+ZEJuKjCQi8D0fmT\r
+JrGHXZ2B/93JpGYGwZAGZah0sQkBiQwbJ2KmL2vw/Wc99/+sx4+fSdl8ZpVNaytsWFfFeMVG5Rxg\r
+TmiApVKdrmf78wOe2uNI08D5Z1S54QKoBAdOES9oULIAg2BwtRrNvkNU9p28Col94GhX/2T92tHR\r
+QQcGAeplfGowGC+oFUaSnBsvHmfPYoMnn59n594BT+/JsNEiZ61OWDsVE8eWyEIUGWIDcx3PkaZn\r
+dsGTpkolUjauW8WFG0aYsbNI2kN8wHg51igHodAhVEbZf8ANnNhvnhSgWus8uDhoVPqSUDUV+iGn\r
+FpSGFzQyZcIW9d1mXTY2crZcfxUdX+NXO3/O3n37SNPArgOCyLLBRg2RUSzKqkbMea/fysZN51Br\r
+/xI3t4uQF2FVKB5QJ6Re6QdDqjGehIPtzDiN7z7lscrH3nHBAxsm9e3rojka2RyTCUxUlKQSYZII\r
+k8TYJII4QmMLNsI2JqiccSmMbiFXw2CxRdqdZ7F9GBEYnTqT4ckZao1JbNZGWzvJX3wUyVPwRXhS\r
+FgByT3CeZg5NZ+glkxwMkzy6x++548HdG085Todgvnaww+XTMxMjvf48FaskHsZNODZaF6dOpmhS\r
+EfjOPHn3QdQ+SDK1mZEz3sj4yNmYDVvBRhAc6aEn6O98jJD1MSqYsmlZ0aJQ+AC+SOxugF4w9IPB\r
+j07x3O68F8R87bQOtowx5iM3nffMxqmweSZuMpQ1mUhgLFFGKgYii63EaBxBbNHIIljEWkRBDagY\r
+pJyNjw1mxmBUMUaxZf2ORDA+YELA+IBmgb4X2s7QdtBLJjgSRnlkt2l1GtmZ99xzoH/KHZmqahDz\r
+gReaUS+Px+hJRNdD1xt6uR4rgSwTE47Xd5cHcufJs0CWheKvCzjn8U6KCuMFnAfnUe/BFz1i4IWu\r
+N3Q99ILFVcf5+QFdDIGPraT8y27qv3T3zodyL0/ta2nw9dUlAHS8oecUzYvFyZcgCkviBRuO13Dx\r
+xUyDF/CKCYLxgnEB6wMm95gy7nte6HjDgivWyoemeX42o92nedjsuf3lNvXxCnt6A5iFNPvTIJWf\r
+VeJkfKY6SSdrlps+Q1CloYEkUggWYsEaC6ZoYLEW4SJl9FiKySICIopB0FKChUDXw2Jp+Y6HNJng\r
+6MDw9IGo7zx//MNHEGOM0RUOco/lgDl+Flk+NiLedvU5V00Ox3efv7pXn0pSavkCjUgYjqERKUMx\r
+1CMwkUWtRY1BjEEwxWB5bJHiYCAqu3UkxcyUBaUfoOcNvVBYPo3HWNA6Dz8fD7opH7zvsf13Fj32\r
++LOD5SArAUSlZxIg2Xb1hndNDEWfPn96UJ9uWJKsTU36NCIYiqAWKTULVQvGFkmuAmrN8T2uKVxh\r
+pDjdyAOkAqkY+h76AfpicY01NAfCT16I+62+3PHtxw5+ojzeK5+rEU4FYJcBVJfkxjesf+fUaHLL\r
+5qledf1khboNRP05ajhqFmoRVKwSG0gsRKZMLnN8gyUKXsEJRa8IlKMC5JVRZGg1L853+eneJO0O\r
+wh33P37oM0AKZKUsHV/I6QAkywDqQO3SLVNv3nzGyOc2jKfVTdPWToyME/k+ZvEIcUipWKiUU3RS\r
+jkIvOV0QcGXY5wq5WEJ9AhrT9LOMZw50+OWRanZgNvvUT56d/RYwWAEg/DoAlRKgVkKMr5mqX3L5\r
+lsmPNmp25typQbxhqsLE5FqqsUUHLUK/iRksYBDsslluKR80HsLUx6E6itbG6HZavHCoxS8Px66b\r
+0d2+e+G2fbO9HwHt8lh7Sfn8dAFOzIFqCVIDGsAMsHHzutGrtm4YvWG4po2zxgbx2jHL5PgkI+Or\r
+qQ9PkiQVVAPqHcZaMBFBlbTfoddp0uk22XukzwvtxHcGkXvuUO+7O/a2vwvsAo4AvdL6+TLrn1YO\r
+LPUGW0LEJ4TTMLAKmAZmXnvW2BUbVtffUK/FZzaiENaMZJWhJFCLhOFazFC1Qpp7epmj7yD1EYcX\r
+K66bJzbN/ZED89kTO/cuPOREDgFHgTlg8YSw8cuqkLwswAl9YOkBV3yCR5Z7YwwYAUaSxE5snBm5\r
+eO1E9aKhip2OIjsWWzukxlRRdaLa8z50Uq9zR1vZ07sP9x7vZ24O6AKdUpZbfbnF/bFIPFkfOAFg\r
+qR8sz4toGUx8Qp7US7DKMuglI5SHnseUyUtFByfEt1+m9JIIy4+4TgfgJCBmWXjZFcDMMsWPP6I8\r
+XohkGYiuoOhy0VMpfkqAFUBOhGGZomaFayu9dAWRFa5xKsVPG+BlQJYDcYLSpwI48b2uNBFzmq//\r
+AxyL7Nqf76KTAAAAAElFTkSuQmCC\r
+--=-=-=\r
+Content-Type: text/plain\r
+\r
+\r
+Would you count the total of all related text/plain parts?  or all\r
+text/* parts?  or, if you have a multipart/alternative node in the MIME\r
+tree, would you report it as the maximum of any of the text/*\r
+alternatives? \r
+\r
+if you're really interested in textual analysis, then the "size of\r
+plaintext" might instead be better measured in words or paragraphs,\r
+rather than octets.  Also, you might want to ignore quoted text and only\r
+measure non-quoted text (this is particularly relevant for\r
+conversations where people top-post and don't trim, or else you're\r
+actually measuring just how deep in the thread a given message is).\r
+\r
+I'm not trying to say that these metrics are impossible, just pointing\r
+out that the underlying data formats can be much more complicated than\r
+most people think about with mail.  The decision about what to count and\r
+how to count it greatly effects the possible use cases.\r
+\r
+> Ideally I'd say that all of these could make sense.\r
+\r
+They could indeed, but I think we could motivate this much better as\r
+initial work by picking one particular use case, and implementing it.\r
+The work would be something like:\r
+\r
+ a) choose the metric we care about, and describe concretely how to\r
+    calculate it for a given rfc822 file.\r
+\r
+ b) assign and name a new notmuch_value_t to to identify the metric\r
+\r
+ c) update notmuch_database_add_message to insert that new value when a\r
+    file is added\r
+\r
+ d) consider what workflows are available to update the database for\r
+    already-indexed documents that do not have this value.\r
\r
+ e) resolve what to do about documents associated with multiple filenames\r
+\r
+ f) define how to include it in searches (this is probably a\r
+    NumberValueRangeProcessor, see\r
+    file:///usr/share/doc/xapian-doc/valueranges.html)\r
+\r
+ g) update documentation for notmuch cli tools.\r
+\r
+\r
+If you work through this process for one particular "message size" use\r
+case and document the steps, then we could presumably handle the other\r
+"message size" metrics in exactly the same way, modifying only steps (a)\r
+and (e) depending on the metric.\r
+\r
+> Would anyone else on the list be interested by such a feature?\r
+\r
+I'm definitely interested, but don't have a lot of time to work on it.\r
+\r
+    --dkg\r
+\r
+--=-=-=--\r