From 1c53d326b007b5352f356072da482d5ec549ba88 Mon Sep 17 00:00:00 2001 From: Daniel Kahn Gillmor Date: Sat, 23 Jan 2016 12:02:35 +1900 Subject: [PATCH] Re: Searching messages by size with notmuch --- 2c/0c55fad0fb9c07b8093772eb5204c524853a42 | 209 ++++++++++++++++++++++ 1 file changed, 209 insertions(+) create mode 100644 2c/0c55fad0fb9c07b8093772eb5204c524853a42 diff --git a/2c/0c55fad0fb9c07b8093772eb5204c524853a42 b/2c/0c55fad0fb9c07b8093772eb5204c524853a42 new file mode 100644 index 000000000..28dd813e0 --- /dev/null +++ b/2c/0c55fad0fb9c07b8093772eb5204c524853a42 @@ -0,0 +1,209 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by arlo.cworth.org (Postfix) with ESMTP id ADF3E6DE0B38 + for ; Fri, 22 Jan 2016 09:02:43 -0800 (PST) +X-Virus-Scanned: Debian amavisd-new at cworth.org +X-Spam-Flag: NO +X-Spam-Score: -0.013 +X-Spam-Level: +X-Spam-Status: No, score=-0.013 tagged_above=-999 required=5 + tests=[AWL=-0.013] autolearn=disabled +Received: from arlo.cworth.org ([127.0.0.1]) + by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id W4AgVm5pY6tx for ; + Fri, 22 Jan 2016 09:02:38 -0800 (PST) +Received: from che.mayfirst.org (che.mayfirst.org [209.234.253.108]) + by arlo.cworth.org (Postfix) with ESMTP id 4C4F16DE0B2F + for ; Fri, 22 Jan 2016 09:02:38 -0800 (PST) +Received: from fifthhorseman.net (unknown [38.109.115.130]) + by che.mayfirst.org (Postfix) with ESMTPSA id 000A3F984; + Fri, 22 Jan 2016 12:02:35 -0500 (EST) +Received: by fifthhorseman.net (Postfix, from userid 1000) + id 7376C20112; Fri, 22 Jan 2016 12:02:35 -0500 (EST) +From: Daniel Kahn Gillmor +To: Antoine Amarilli , notmuch@notmuchmail.org +Subject: Re: Searching messages by size with notmuch +In-Reply-To: <20160122151318.GA17099@mu.a3nm.net> +References: <20160122151318.GA17099@mu.a3nm.net> +User-Agent: Notmuch/0.21+67~g41ad7ff (http://notmuchmail.org) Emacs/24.5.1 + (x86_64-pc-linux-gnu) +Date: Fri, 22 Jan 2016 12:02:35 -0500 +Message-ID: <87twm5r1hw.fsf@alice.fifthhorseman.net> +MIME-Version: 1.0 +Content-Type: multipart/mixed; boundary="=-=-=" +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.20 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Fri, 22 Jan 2016 17:02:43 -0000 + +--=-=-= +Content-Type: text/plain + +On Fri 2016-01-22 10:13:18 -0500, Antoine Amarilli wrote: + +> After chatting on #notmuch, I wanted to suggest a feature which would be +> useful, at least to me: searching for messages by size. +> +> My use case would be to look for long messages, but dkg on IRC mentioned +> that it could also be useful to clean up messages to save disk space. +> +> It is unclear whether the size of a message should be defined as that of +> a single copy of the message, or that of all copies; and it is unclear +> whether it should be the total size (for my purposes I would have been +> interested in the size of the plaintext part of the message only). + +Note that "the plaintext part of the message" might mean the sum of +multiple plaintext parts too -- consider this message as an example: + + +--=-=-= +Content-Type: image/png +Content-Disposition: inline; filename=stock_smiley-7.png +Content-Transfer-Encoding: base64 + +iVBORw0KGgoAAAANSUhEUgAAADAAAAAwCAYAAABXAvmHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz +AAAN1wAADdcBQiibeAAAABl0RVh0U29mdHdhcmUAd3d3Lmlua3NjYXBlLm9yZ5vuPBoAAAAXdEVY +dEF1dGhvcgBMYXBvIENhbGFtYW5kcmVp35EaKgAAACl0RVh0RGVzY3JpcHRpb24AQmFzZWQgb2Yg +SmFrdWIgU3RlaW5lciBkZXNpZ26ghAVzAAAP7UlEQVRo3r2Ze4xcV3nAf+fce+exs++112snjhM7 +NiFxRIKSUPIghBSImiht1BpVqYpAIJAQIKqqoJKgFkSrQhCggugDaEJbpZVKVBISEoLCQylpAgmJ +A9gkcezYjp+7OzM7szNz7z3nfF//uHftjbWxTQId6dPO7J255/t973OuUVVexcsclycsjFlYiKBr +Ya2BtoFxhaAwEOgKjIQf0tU382YBtPj5K1fCvAqAJeUt7IqgH4MkoAlIDEkE1oBonyBDRB6Mg3kH +eBgJsFtgm7waiFcKsGT1CEigUoW4lhHXDWnDUK0ZXAUSG4OCz3MkrWD74AZg+6A5uHwJwmDQVwAR +v3Lld8SQJDA01DpycOa5x//n90KeviWJorNsUp2K4uqwjepV8QMnPu+J5PPO+YM2qf/wrC2X3jez +ZdN+yHpwQVaY///FA8bAXxvYFtOs1L531+dvjjS8qzo6ffH68y5m1brNlaQxRVQZwURVUAEToSFF +8h55f57moV1u/zPbGSwcfNaJvfOia9/6tTWbLlyAcz1Q5sVvL4QM7Eju/fvPXE+efeGcrZfNbNh6 +ZbU+uR5CHzTD2Bi1McbEYGNQQTVgxKHBg61ioiHS9ovs3/FItuPJR3tE0S033fSpOzj77LyE+O0A +fOu2mzb5/uDOmY1bL3zdFdfV68PjSHoE4gq2PolJRjBxHWwFY6Miv1HQgIYcQoq6RSRtoT7HVleR +9RbZ/uPvpC88s/0QSX3bzZ+46En4K/mNA/zHx992dVJN7r78rTeOrDrzXCvdPWAV25jBDk1iKiOY +pAFRBWMTMLZMF1AVEA8hQ/0AzbvIoIn0joKAHd5A69Bu/d69dw36afau9332oW+ebijZ0/nSv3/s +qvdVG0P3v+X6bWPjNW+z3fcgeQvianmHAMajODAOJV8mGZgcTI4aBxRV1FgwcYK6NvneBxhNcnPD +TTcPjQwPf+MfPvzGT2OM+Y0A/NtfXPmescnpL1573Y31uPUU7uDDmCjBIKApaB/VHkgP6KHaR00h +mD5qBig90D5oD9XFQqSH6gDUgbW4I48RtZ7khhv+oL52es1HvvKBy/7uVQN846NvuqJSa3zpyive +VJdd9+DnnsZYi2oGoQuuhbgm6puItEA6GLpAD0MJQSFqFlHtoNJCfQtcE81bqO9A6KNGcHO/wD9/ +D9dc+5ahWr3xoS+8/5KbXzHA7X9+9Xqreu9Vb7yszoGH8c1fAQENfTRvI9kckh+BfBbjmyBdhB5C +VobOUrjkQIZqitIH6UJoIu4omh+GbBbyFup6GPX49vPIvu9z/bVX1KMo/upn33vpZa8IwGj2X5ds +fe1ovb+XfP/DhfK+j+YLaD6PZLNoPo+GFvc8+DhGUozmGHWgvpgWNAAeJWBw5bWMu7/zGBramLyJ +ZLNINlt40/cw4sgPPEplcQ/X/c4FQxZ/1yc/eU38awF89SOXv314eHzr2tXjNn3mvrKK5KhfRPM2 +ZG3EtVC/yG3//Agf+PgD3HnXwyBZMSGow6gv4ls9RhxIDpLypX+8nw/e+l0+90+PIKGLug6aLyBZ +G1wXlRRVIX3uftasmmTdqrGJ6gsL7z19AGOMlfDl129Z38j3/KhQAkHFQUiR0Eekh5Wirt/3gz3c +eustfPv+JyH0IPQxMkBlUDQ2GRTJKgMIfb517xPceust3PPQLjRkRW+QPsigCE9xGBVQR7b7+1x5 +4aaGivztJ99/ydBpAXzlQ2/4w1WT4zPjiccd/QUYMGgxFojHiGCCR0qr7tq3wNe//i8889xh8F0k +dIsElx5GuqA9TOiioYOGDr/adaT4/p6Fwqvqyh7hQAQklC3A4I7uZDj2nHfGaNX0wp+d1jBnxb93 +8/TocL7/p8cnXGPAaHnjgCIY9Swu9mku5DS3b2eoHqO+jY0EY3LUVMDGqATwGbgB3dYss3M9Zue2 +AzA7v8hUxS0bgaRYpxwXjYF8/0+54Mzzhn6xv/0e4G9O6oE73n1NTUxy9dT4KKG1d9kVfel7EUwI +NJJArVLcYnqqhmZtNG+ibg5cE+PnwM8X4po0kj5DtcJmSWyYGBIIHhUpjHVC71UgtPYxNT6KGrP2 +E+983TknBViop2+bGBlydA8hLj12U10yvioqxQctE3TjGTUAtmxogGsVlSmbh3wWSWchmy0+Z/NY +12Lz2cMAnLOuTowrCoTK8UWWQJbW9jmhc5iNqxqEnN8/aQhpcNdND8fDrrnvuAlecsNSRNAgqAS2 +bhpmePVWXveaJpp1MeRFAocYjMEoxRTqMkLuuOi8EaqjW9gwuhcJHoKUuwFZZrClNQslfGsfZ07O +1HYeaP0R8MWX9YBIOLcaR8b128eU1WXKqwKihctFMAqf+8trOHdtxsc/fDXq+mjWQdMFZLCApm0k +bUPaRrIuNgi33Xot56zJ+Pwtb4IQUA3Fllm0VPylBlMF129TTyJEzVkn94BwRhwZXN4nluLHZulG +ophyEVNCqCirpldxx5ffgQyOElpZkY/BYSKLlB5ABHUeNTA6uYZ//co2wtwu8nYAX3iy8GwZoroM +RhSX96kmljwwdVKAgJmOLORpn1qkGBV6qVKPlCQulDZBjv2VPCMc2o72D6GuDZIiscWEANaCKQ0p +Wlg7pGj2FCYaIrRehPJeiKKlF5Y8HIKSOaWqikv7RFFEEKrvfvc5tdtv35OuDCCMqAq5d0U5E6Ve +UfY3A1NjlvGk9ERQNAjGe0J3DslamCTGxBFIBJGgtiy/qiAUOeMD6o+gzqEuFP9bEjkuiwNhoSPM +1BRxSu4dIoo1mtXmKlPAgZcBkO4gzap1WyWXAVVRrAjrxyxPHvaM9uC89RYTCQSDegEbwC/FYJHg +RLY4USk3NIVVS4X9cgmFF4JAEJxTduz3iBMumALNlSCKs1V8npJ5raaL+fzJkvjoYpqTmwqZAKGw +uBXh4nWG+U7gP/+3z479OeoEDaFQxElh0VIk92ju0dyhuUOcQ/NQiAulJ4rfiitgXjjq+O/HB6gI +W9cUeUMQMgFnqywOBqhKdvsPjofPSkl8YDHNzx+yFTIPIoINggaLjZTLN0Ts7hge2jHgx8/lXPHa +OhecXSSqZSnZLcYa1JjyvKos7yKFQY5BB8QHnn0x48ndKb1+4G2viZmuFZBL+ZEJuKjCQi8D0fmT +JrGHXZ2B/93JpGYGwZAGZah0sQkBiQwbJ2KmL2vw/Wc99/+sx4+fSdl8ZpVNaytsWFfFeMVG5Rxg +TmiApVKdrmf78wOe2uNI08D5Z1S54QKoBAdOES9oULIAg2BwtRrNvkNU9p28Col94GhX/2T92tHR +QQcGAeplfGowGC+oFUaSnBsvHmfPYoMnn59n594BT+/JsNEiZ61OWDsVE8eWyEIUGWIDcx3PkaZn +dsGTpkolUjauW8WFG0aYsbNI2kN8wHg51igHodAhVEbZf8ANnNhvnhSgWus8uDhoVPqSUDUV+iGn +FpSGFzQyZcIW9d1mXTY2crZcfxUdX+NXO3/O3n37SNPArgOCyLLBRg2RUSzKqkbMea/fysZN51Br +/xI3t4uQF2FVKB5QJ6Re6QdDqjGehIPtzDiN7z7lscrH3nHBAxsm9e3rojka2RyTCUxUlKQSYZII +k8TYJII4QmMLNsI2JqiccSmMbiFXw2CxRdqdZ7F9GBEYnTqT4ckZao1JbNZGWzvJX3wUyVPwRXhS +FgByT3CeZg5NZ+glkxwMkzy6x++548HdG085Todgvnaww+XTMxMjvf48FaskHsZNODZaF6dOpmhS +EfjOPHn3QdQ+SDK1mZEz3sj4yNmYDVvBRhAc6aEn6O98jJD1MSqYsmlZ0aJQ+AC+SOxugF4w9IPB +j07x3O68F8R87bQOtowx5iM3nffMxqmweSZuMpQ1mUhgLFFGKgYii63EaBxBbNHIIljEWkRBDagY +pJyNjw1mxmBUMUaxZf2ORDA+YELA+IBmgb4X2s7QdtBLJjgSRnlkt2l1GtmZ99xzoH/KHZmqahDz +gReaUS+Px+hJRNdD1xt6uR4rgSwTE47Xd5cHcufJs0CWheKvCzjn8U6KCuMFnAfnUe/BFz1i4IWu +N3Q99ILFVcf5+QFdDIGPraT8y27qv3T3zodyL0/ta2nw9dUlAHS8oecUzYvFyZcgCkviBRuO13Dx +xUyDF/CKCYLxgnEB6wMm95gy7nte6HjDgivWyoemeX42o92nedjsuf3lNvXxCnt6A5iFNPvTIJWf +VeJkfKY6SSdrlps+Q1CloYEkUggWYsEaC6ZoYLEW4SJl9FiKySICIopB0FKChUDXw2Jp+Y6HNJng +6MDw9IGo7zx//MNHEGOM0RUOco/lgDl+Flk+NiLedvU5V00Ox3efv7pXn0pSavkCjUgYjqERKUMx +1CMwkUWtRY1BjEEwxWB5bJHiYCAqu3UkxcyUBaUfoOcNvVBYPo3HWNA6Dz8fD7opH7zvsf13Fj32 ++LOD5SArAUSlZxIg2Xb1hndNDEWfPn96UJ9uWJKsTU36NCIYiqAWKTULVQvGFkmuAmrN8T2uKVxh +pDjdyAOkAqkY+h76AfpicY01NAfCT16I+62+3PHtxw5+ojzeK5+rEU4FYJcBVJfkxjesf+fUaHLL +5qledf1khboNRP05ajhqFmoRVKwSG0gsRKZMLnN8gyUKXsEJRa8IlKMC5JVRZGg1L853+eneJO0O +wh33P37oM0AKZKUsHV/I6QAkywDqQO3SLVNv3nzGyOc2jKfVTdPWToyME/k+ZvEIcUipWKiUU3RS +jkIvOV0QcGXY5wq5WEJ9AhrT9LOMZw50+OWRanZgNvvUT56d/RYwWAEg/DoAlRKgVkKMr5mqX3L5 +lsmPNmp25typQbxhqsLE5FqqsUUHLUK/iRksYBDsslluKR80HsLUx6E6itbG6HZavHCoxS8Px66b +0d2+e+G2fbO9HwHt8lh7Sfn8dAFOzIFqCVIDGsAMsHHzutGrtm4YvWG4po2zxgbx2jHL5PgkI+Or +qQ9PkiQVVAPqHcZaMBFBlbTfoddp0uk22XukzwvtxHcGkXvuUO+7O/a2vwvsAo4AvdL6+TLrn1YO +LPUGW0LEJ4TTMLAKmAZmXnvW2BUbVtffUK/FZzaiENaMZJWhJFCLhOFazFC1Qpp7epmj7yD1EYcX +K66bJzbN/ZED89kTO/cuPOREDgFHgTlg8YSw8cuqkLwswAl9YOkBV3yCR5Z7YwwYAUaSxE5snBm5 +eO1E9aKhip2OIjsWWzukxlRRdaLa8z50Uq9zR1vZ07sP9x7vZ24O6AKdUpZbfbnF/bFIPFkfOAFg +qR8sz4toGUx8Qp7US7DKMuglI5SHnseUyUtFByfEt1+m9JIIy4+4TgfgJCBmWXjZFcDMMsWPP6I8 +XohkGYiuoOhy0VMpfkqAFUBOhGGZomaFayu9dAWRFa5xKsVPG+BlQJYDcYLSpwI48b2uNBFzmq// +AxyL7Nqf76KTAAAAAElFTkSuQmCC +--=-=-= +Content-Type: text/plain + + +Would you count the total of all related text/plain parts? or all +text/* parts? or, if you have a multipart/alternative node in the MIME +tree, would you report it as the maximum of any of the text/* +alternatives? + +if you're really interested in textual analysis, then the "size of +plaintext" might instead be better measured in words or paragraphs, +rather than octets. Also, you might want to ignore quoted text and only +measure non-quoted text (this is particularly relevant for +conversations where people top-post and don't trim, or else you're +actually measuring just how deep in the thread a given message is). + +I'm not trying to say that these metrics are impossible, just pointing +out that the underlying data formats can be much more complicated than +most people think about with mail. The decision about what to count and +how to count it greatly effects the possible use cases. + +> Ideally I'd say that all of these could make sense. + +They could indeed, but I think we could motivate this much better as +initial work by picking one particular use case, and implementing it. +The work would be something like: + + a) choose the metric we care about, and describe concretely how to + calculate it for a given rfc822 file. + + b) assign and name a new notmuch_value_t to to identify the metric + + c) update notmuch_database_add_message to insert that new value when a + file is added + + d) consider what workflows are available to update the database for + already-indexed documents that do not have this value. + + e) resolve what to do about documents associated with multiple filenames + + f) define how to include it in searches (this is probably a + NumberValueRangeProcessor, see + file:///usr/share/doc/xapian-doc/valueranges.html) + + g) update documentation for notmuch cli tools. + + +If you work through this process for one particular "message size" use +case and document the steps, then we could presumably handle the other +"message size" metrics in exactly the same way, modifying only steps (a) +and (e) depending on the metric. + +> Would anyone else on the list be interested by such a feature? + +I'm definitely interested, but don't have a lot of time to work on it. + + --dkg + +--=-=-=-- -- 2.26.2