Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id ADF3E6DE0B38 for ; Fri, 22 Jan 2016 09:02:43 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.013 X-Spam-Level: X-Spam-Status: No, score=-0.013 tagged_above=-999 required=5 tests=[AWL=-0.013] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id W4AgVm5pY6tx for ; Fri, 22 Jan 2016 09:02:38 -0800 (PST) Received: from che.mayfirst.org (che.mayfirst.org [209.234.253.108]) by arlo.cworth.org (Postfix) with ESMTP id 4C4F16DE0B2F for ; Fri, 22 Jan 2016 09:02:38 -0800 (PST) Received: from fifthhorseman.net (unknown [38.109.115.130]) by che.mayfirst.org (Postfix) with ESMTPSA id 000A3F984; Fri, 22 Jan 2016 12:02:35 -0500 (EST) Received: by fifthhorseman.net (Postfix, from userid 1000) id 7376C20112; Fri, 22 Jan 2016 12:02:35 -0500 (EST) From: Daniel Kahn Gillmor To: Antoine Amarilli , notmuch@notmuchmail.org Subject: Re: Searching messages by size with notmuch In-Reply-To: <20160122151318.GA17099@mu.a3nm.net> References: <20160122151318.GA17099@mu.a3nm.net> User-Agent: Notmuch/0.21+67~g41ad7ff (http://notmuchmail.org) Emacs/24.5.1 (x86_64-pc-linux-gnu) Date: Fri, 22 Jan 2016 12:02:35 -0500 Message-ID: <87twm5r1hw.fsf@alice.fifthhorseman.net> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Jan 2016 17:02:43 -0000 --=-=-= Content-Type: text/plain On Fri 2016-01-22 10:13:18 -0500, Antoine Amarilli wrote: > After chatting on #notmuch, I wanted to suggest a feature which would be > useful, at least to me: searching for messages by size. > > My use case would be to look for long messages, but dkg on IRC mentioned > that it could also be useful to clean up messages to save disk space. > > It is unclear whether the size of a message should be defined as that of > a single copy of the message, or that of all copies; and it is unclear > whether it should be the total size (for my purposes I would have been > interested in the size of the plaintext part of the message only). Note that "the plaintext part of the message" might mean the sum of multiple plaintext parts too -- consider this message as an example: --=-=-= Content-Type: image/png Content-Disposition: inline; filename=stock_smiley-7.png Content-Transfer-Encoding: base64 iVBORw0KGgoAAAANSUhEUgAAADAAAAAwCAYAAABXAvmHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz AAAN1wAADdcBQiibeAAAABl0RVh0U29mdHdhcmUAd3d3Lmlua3NjYXBlLm9yZ5vuPBoAAAAXdEVY dEF1dGhvcgBMYXBvIENhbGFtYW5kcmVp35EaKgAAACl0RVh0RGVzY3JpcHRpb24AQmFzZWQgb2Yg SmFrdWIgU3RlaW5lciBkZXNpZ26ghAVzAAAP7UlEQVRo3r2Ze4xcV3nAf+fce+exs++112snjhM7 NiFxRIKSUPIghBSImiht1BpVqYpAIJAQIKqqoJKgFkSrQhCggugDaEJbpZVKVBISEoLCQylpAgmJ A9gkcezYjp+7OzM7szNz7z3nfF//uHftjbWxTQId6dPO7J255/t973OuUVVexcsclycsjFlYiKBr Ya2BtoFxhaAwEOgKjIQf0tU382YBtPj5K1fCvAqAJeUt7IqgH4MkoAlIDEkE1oBonyBDRB6Mg3kH eBgJsFtgm7waiFcKsGT1CEigUoW4lhHXDWnDUK0ZXAUSG4OCz3MkrWD74AZg+6A5uHwJwmDQVwAR v3Lld8SQJDA01DpycOa5x//n90KeviWJorNsUp2K4uqwjepV8QMnPu+J5PPO+YM2qf/wrC2X3jez ZdN+yHpwQVaY///FA8bAXxvYFtOs1L531+dvjjS8qzo6ffH68y5m1brNlaQxRVQZwURVUAEToSFF 8h55f57moV1u/zPbGSwcfNaJvfOia9/6tTWbLlyAcz1Q5sVvL4QM7Eju/fvPXE+efeGcrZfNbNh6 ZbU+uR5CHzTD2Bi1McbEYGNQQTVgxKHBg61ioiHS9ovs3/FItuPJR3tE0S033fSpOzj77LyE+O0A fOu2mzb5/uDOmY1bL3zdFdfV68PjSHoE4gq2PolJRjBxHWwFY6Miv1HQgIYcQoq6RSRtoT7HVleR 9RbZ/uPvpC88s/0QSX3bzZ+46En4K/mNA/zHx992dVJN7r78rTeOrDrzXCvdPWAV25jBDk1iKiOY pAFRBWMTMLZMF1AVEA8hQ/0AzbvIoIn0joKAHd5A69Bu/d69dw36afau9332oW+ebijZ0/nSv3/s qvdVG0P3v+X6bWPjNW+z3fcgeQvianmHAMajODAOJV8mGZgcTI4aBxRV1FgwcYK6NvneBxhNcnPD TTcPjQwPf+MfPvzGT2OM+Y0A/NtfXPmescnpL1573Y31uPUU7uDDmCjBIKApaB/VHkgP6KHaR00h mD5qBig90D5oD9XFQqSH6gDUgbW4I48RtZ7khhv+oL52es1HvvKBy/7uVQN846NvuqJSa3zpyive VJdd9+DnnsZYi2oGoQuuhbgm6puItEA6GLpAD0MJQSFqFlHtoNJCfQtcE81bqO9A6KNGcHO/wD9/ D9dc+5ahWr3xoS+8/5KbXzHA7X9+9Xqreu9Vb7yszoGH8c1fAQENfTRvI9kckh+BfBbjmyBdhB5C VobOUrjkQIZqitIH6UJoIu4omh+GbBbyFup6GPX49vPIvu9z/bVX1KMo/upn33vpZa8IwGj2X5ds fe1ovb+XfP/DhfK+j+YLaD6PZLNoPo+GFvc8+DhGUozmGHWgvpgWNAAeJWBw5bWMu7/zGBramLyJ ZLNINlt40/cw4sgPPEplcQ/X/c4FQxZ/1yc/eU38awF89SOXv314eHzr2tXjNn3mvrKK5KhfRPM2 ZG3EtVC/yG3//Agf+PgD3HnXwyBZMSGow6gv4ls9RhxIDpLypX+8nw/e+l0+90+PIKGLug6aLyBZ G1wXlRRVIX3uftasmmTdqrGJ6gsL7z19AGOMlfDl129Z38j3/KhQAkHFQUiR0Eekh5Wirt/3gz3c eustfPv+JyH0IPQxMkBlUDQ2GRTJKgMIfb517xPceust3PPQLjRkRW+QPsigCE9xGBVQR7b7+1x5 4aaGivztJ99/ydBpAXzlQ2/4w1WT4zPjiccd/QUYMGgxFojHiGCCR0qr7tq3wNe//i8889xh8F0k dIsElx5GuqA9TOiioYOGDr/adaT4/p6Fwqvqyh7hQAQklC3A4I7uZDj2nHfGaNX0wp+d1jBnxb93 8/TocL7/p8cnXGPAaHnjgCIY9Swu9mku5DS3b2eoHqO+jY0EY3LUVMDGqATwGbgB3dYss3M9Zue2 AzA7v8hUxS0bgaRYpxwXjYF8/0+54Mzzhn6xv/0e4G9O6oE73n1NTUxy9dT4KKG1d9kVfel7EUwI NJJArVLcYnqqhmZtNG+ibg5cE+PnwM8X4po0kj5DtcJmSWyYGBIIHhUpjHVC71UgtPYxNT6KGrP2 E+983TknBViop2+bGBlydA8hLj12U10yvioqxQctE3TjGTUAtmxogGsVlSmbh3wWSWchmy0+Z/NY 12Lz2cMAnLOuTowrCoTK8UWWQJbW9jmhc5iNqxqEnN8/aQhpcNdND8fDrrnvuAlecsNSRNAgqAS2 bhpmePVWXveaJpp1MeRFAocYjMEoxRTqMkLuuOi8EaqjW9gwuhcJHoKUuwFZZrClNQslfGsfZ07O 1HYeaP0R8MWX9YBIOLcaR8b128eU1WXKqwKihctFMAqf+8trOHdtxsc/fDXq+mjWQdMFZLCApm0k bUPaRrIuNgi33Xot56zJ+Pwtb4IQUA3Fllm0VPylBlMF129TTyJEzVkn94BwRhwZXN4nluLHZulG ophyEVNCqCirpldxx5ffgQyOElpZkY/BYSKLlB5ABHUeNTA6uYZ//co2wtwu8nYAX3iy8GwZoroM RhSX96kmljwwdVKAgJmOLORpn1qkGBV6qVKPlCQulDZBjv2VPCMc2o72D6GuDZIiscWEANaCKQ0p Wlg7pGj2FCYaIrRehPJeiKKlF5Y8HIKSOaWqikv7RFFEEKrvfvc5tdtv35OuDCCMqAq5d0U5E6Ve UfY3A1NjlvGk9ERQNAjGe0J3DslamCTGxBFIBJGgtiy/qiAUOeMD6o+gzqEuFP9bEjkuiwNhoSPM 1BRxSu4dIoo1mtXmKlPAgZcBkO4gzap1WyWXAVVRrAjrxyxPHvaM9uC89RYTCQSDegEbwC/FYJHg RLY4USk3NIVVS4X9cgmFF4JAEJxTduz3iBMumALNlSCKs1V8npJ5raaL+fzJkvjoYpqTmwqZAKGw uBXh4nWG+U7gP/+3z479OeoEDaFQxElh0VIk92ju0dyhuUOcQ/NQiAulJ4rfiitgXjjq+O/HB6gI W9cUeUMQMgFnqywOBqhKdvsPjofPSkl8YDHNzx+yFTIPIoINggaLjZTLN0Ts7hge2jHgx8/lXPHa OhecXSSqZSnZLcYa1JjyvKos7yKFQY5BB8QHnn0x48ndKb1+4G2viZmuFZBL+ZEJuKjCQi8D0fmT JrGHXZ2B/93JpGYGwZAGZah0sQkBiQwbJ2KmL2vw/Wc99/+sx4+fSdl8ZpVNaytsWFfFeMVG5Rxg TmiApVKdrmf78wOe2uNI08D5Z1S54QKoBAdOES9oULIAg2BwtRrNvkNU9p28Col94GhX/2T92tHR QQcGAeplfGowGC+oFUaSnBsvHmfPYoMnn59n594BT+/JsNEiZ61OWDsVE8eWyEIUGWIDcx3PkaZn dsGTpkolUjauW8WFG0aYsbNI2kN8wHg51igHodAhVEbZf8ANnNhvnhSgWus8uDhoVPqSUDUV+iGn FpSGFzQyZcIW9d1mXTY2crZcfxUdX+NXO3/O3n37SNPArgOCyLLBRg2RUSzKqkbMea/fysZN51Br /xI3t4uQF2FVKB5QJ6Re6QdDqjGehIPtzDiN7z7lscrH3nHBAxsm9e3rojka2RyTCUxUlKQSYZII k8TYJII4QmMLNsI2JqiccSmMbiFXw2CxRdqdZ7F9GBEYnTqT4ckZao1JbNZGWzvJX3wUyVPwRXhS FgByT3CeZg5NZ+glkxwMkzy6x++548HdG085Todgvnaww+XTMxMjvf48FaskHsZNODZaF6dOpmhS EfjOPHn3QdQ+SDK1mZEz3sj4yNmYDVvBRhAc6aEn6O98jJD1MSqYsmlZ0aJQ+AC+SOxugF4w9IPB j07x3O68F8R87bQOtowx5iM3nffMxqmweSZuMpQ1mUhgLFFGKgYii63EaBxBbNHIIljEWkRBDagY pJyNjw1mxmBUMUaxZf2ORDA+YELA+IBmgb4X2s7QdtBLJjgSRnlkt2l1GtmZ99xzoH/KHZmqahDz gReaUS+Px+hJRNdD1xt6uR4rgSwTE47Xd5cHcufJs0CWheKvCzjn8U6KCuMFnAfnUe/BFz1i4IWu N3Q99ILFVcf5+QFdDIGPraT8y27qv3T3zodyL0/ta2nw9dUlAHS8oecUzYvFyZcgCkviBRuO13Dx xUyDF/CKCYLxgnEB6wMm95gy7nte6HjDgivWyoemeX42o92nedjsuf3lNvXxCnt6A5iFNPvTIJWf VeJkfKY6SSdrlps+Q1CloYEkUggWYsEaC6ZoYLEW4SJl9FiKySICIopB0FKChUDXw2Jp+Y6HNJng 6MDw9IGo7zx//MNHEGOM0RUOco/lgDl+Flk+NiLedvU5V00Ox3efv7pXn0pSavkCjUgYjqERKUMx 1CMwkUWtRY1BjEEwxWB5bJHiYCAqu3UkxcyUBaUfoOcNvVBYPo3HWNA6Dz8fD7opH7zvsf13Fj32 +LOD5SArAUSlZxIg2Xb1hndNDEWfPn96UJ9uWJKsTU36NCIYiqAWKTULVQvGFkmuAmrN8T2uKVxh pDjdyAOkAqkY+h76AfpicY01NAfCT16I+62+3PHtxw5+ojzeK5+rEU4FYJcBVJfkxjesf+fUaHLL 5qledf1khboNRP05ajhqFmoRVKwSG0gsRKZMLnN8gyUKXsEJRa8IlKMC5JVRZGg1L853+eneJO0O wh33P37oM0AKZKUsHV/I6QAkywDqQO3SLVNv3nzGyOc2jKfVTdPWToyME/k+ZvEIcUipWKiUU3RS jkIvOV0QcGXY5wq5WEJ9AhrT9LOMZw50+OWRanZgNvvUT56d/RYwWAEg/DoAlRKgVkKMr5mqX3L5 lsmPNmp25typQbxhqsLE5FqqsUUHLUK/iRksYBDsslluKR80HsLUx6E6itbG6HZavHCoxS8Px66b 0d2+e+G2fbO9HwHt8lh7Sfn8dAFOzIFqCVIDGsAMsHHzutGrtm4YvWG4po2zxgbx2jHL5PgkI+Or qQ9PkiQVVAPqHcZaMBFBlbTfoddp0uk22XukzwvtxHcGkXvuUO+7O/a2vwvsAo4AvdL6+TLrn1YO LPUGW0LEJ4TTMLAKmAZmXnvW2BUbVtffUK/FZzaiENaMZJWhJFCLhOFazFC1Qpp7epmj7yD1EYcX K66bJzbN/ZED89kTO/cuPOREDgFHgTlg8YSw8cuqkLwswAl9YOkBV3yCR5Z7YwwYAUaSxE5snBm5 eO1E9aKhip2OIjsWWzukxlRRdaLa8z50Uq9zR1vZ07sP9x7vZ24O6AKdUpZbfbnF/bFIPFkfOAFg qR8sz4toGUx8Qp7US7DKMuglI5SHnseUyUtFByfEt1+m9JIIy4+4TgfgJCBmWXjZFcDMMsWPP6I8 XohkGYiuoOhy0VMpfkqAFUBOhGGZomaFayu9dAWRFa5xKsVPG+BlQJYDcYLSpwI48b2uNBFzmq// AxyL7Nqf76KTAAAAAElFTkSuQmCC --=-=-= Content-Type: text/plain Would you count the total of all related text/plain parts? or all text/* parts? or, if you have a multipart/alternative node in the MIME tree, would you report it as the maximum of any of the text/* alternatives? if you're really interested in textual analysis, then the "size of plaintext" might instead be better measured in words or paragraphs, rather than octets. Also, you might want to ignore quoted text and only measure non-quoted text (this is particularly relevant for conversations where people top-post and don't trim, or else you're actually measuring just how deep in the thread a given message is). I'm not trying to say that these metrics are impossible, just pointing out that the underlying data formats can be much more complicated than most people think about with mail. The decision about what to count and how to count it greatly effects the possible use cases. > Ideally I'd say that all of these could make sense. They could indeed, but I think we could motivate this much better as initial work by picking one particular use case, and implementing it. The work would be something like: a) choose the metric we care about, and describe concretely how to calculate it for a given rfc822 file. b) assign and name a new notmuch_value_t to to identify the metric c) update notmuch_database_add_message to insert that new value when a file is added d) consider what workflows are available to update the database for already-indexed documents that do not have this value. e) resolve what to do about documents associated with multiple filenames f) define how to include it in searches (this is probably a NumberValueRangeProcessor, see file:///usr/share/doc/xapian-doc/valueranges.html) g) update documentation for notmuch cli tools. If you work through this process for one particular "message size" use case and document the steps, then we could presumably handle the other "message size" metrics in exactly the same way, modifying only steps (a) and (e) depending on the metric. > Would anyone else on the list be interested by such a feature? I'm definitely interested, but don't have a lot of time to work on it. --dkg --=-=-=--