Re: Deduplication ?
authorJani Nikula <jani@nikula.org>
Mon, 2 Jun 2014 17:06:09 +0000 (20:06 +0300)
committerW. Trevor King <wking@tremily.us>
Fri, 7 Nov 2014 18:03:08 +0000 (10:03 -0800)
26/d2cbe8e25ed585977669fb9846f65121db3763 [new file with mode: 0644]

diff --git a/26/d2cbe8e25ed585977669fb9846f65121db3763 b/26/d2cbe8e25ed585977669fb9846f65121db3763
new file mode 100644 (file)
index 0000000..5f14ba1
--- /dev/null
@@ -0,0 +1,109 @@
+Return-Path: <jani@nikula.org>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+       by olra.theworths.org (Postfix) with ESMTP id 1EAE1431FBC\r
+       for <notmuch@notmuchmail.org>; Mon,  2 Jun 2014 10:06:24 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: -0.7\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5\r
+       tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled\r
+Received: from olra.theworths.org ([127.0.0.1])\r
+       by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
+       with ESMTP id CCYUPGMmLCkc for <notmuch@notmuchmail.org>;\r
+       Mon,  2 Jun 2014 10:06:16 -0700 (PDT)\r
+Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com\r
+       [209.85.212.171]) (using TLSv1 with cipher RC4-SHA (128/128 bits))\r
+       (No client certificate requested)\r
+       by olra.theworths.org (Postfix) with ESMTPS id 4D7B2431FAE\r
+       for <notmuch@notmuchmail.org>; Mon,  2 Jun 2014 10:06:16 -0700 (PDT)\r
+Received: by mail-wi0-f171.google.com with SMTP id cc10so5008597wib.4\r
+       for <notmuch@notmuchmail.org>; Mon, 02 Jun 2014 10:06:13 -0700 (PDT)\r
+X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;\r
+       d=1e100.net; s=20130820;\r
+       h=x-gm-message-state:from:to:subject:in-reply-to:references\r
+       :user-agent:date:message-id:mime-version:content-type;\r
+       bh=O6RcBBKkVtQR85qqmLXWlxZHYEO+QkP6X9rWxRrWqw0=;\r
+       b=YxaW4/sNIOlBMj8C4BK4Nm5bWnh94sbXaZs+aPXBRopsxZ+uf42RFarFLkunl3NWD8\r
+       I07Y2PjtntByjEMPhGsbzrG38Ypn4PQANnij881RL6OJk9yhSKp50PGLmtZ0mS+0Mh7n\r
+       efZxO1Xrnd6XHST8Xyk7LUY5y+efUhmEA/nz/T2q0LWMStKhx9jM+wQMFdfgZgE9Yl9I\r
+       LqA80brb+oBO82cK5BBOQXjbV5+aKFrJYwjlbCxKTENz65pkhxjcm/9WspQOJ0atUPdN\r
+       oOqwg83Ix9T/upCPCLorxT6g6SbYrZ0QXSJYTAZxBcPAUBXTevdTGFWPGaIqLbnwFtAS\r
+       HnTg==\r
+X-Gm-Message-State:\r
+ ALoCoQkP4DDQBSF2cdVDqtxF4eswnUbYpGTcy9XPk6sDtL2keTojdTMIIDBk5eYC4zKVip4BBUPu\r
+X-Received: by 10.180.90.51 with SMTP id bt19mr24467825wib.22.1401728773691;\r
+       Mon, 02 Jun 2014 10:06:13 -0700 (PDT)\r
+Received: from localhost (dsl-hkibrasgw2-58c36f-91.dhcp.inet.fi.\r
+       [88.195.111.91])\r
+       by mx.google.com with ESMTPSA id m2sm36855357wjw.3.2014.06.02.10.06.12\r
+       for <multiple recipients>\r
+       (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);\r
+       Mon, 02 Jun 2014 10:06:13 -0700 (PDT)\r
+From: Jani Nikula <jani@nikula.org>\r
+To: Mark Walters <markwalters1009@gmail.com>,\r
+ Tomi Ollila <tomi.ollila@iki.fi>,     Vladimir Marek <Vladimir.Marek@oracle.com>,\r
+ notmuch@notmuchmail.org\r
+Subject: Re: Deduplication ?\r
+In-Reply-To: <87ppirqtfa.fsf@qmul.ac.uk>\r
+References: <20140602123212.GA12639@virt.cz.oracle.com>\r
+       <87d2ers9mi.fsf@qmul.ac.uk> <m2ppirs8ea.fsf@guru.guru-group.fi>\r
+       <87ppirqtfa.fsf@qmul.ac.uk>\r
+User-Agent: Notmuch/0.18+24~gfe8cd90 (http://notmuchmail.org) Emacs/24.3.1\r
+       (x86_64-pc-linux-gnu)\r
+Date: Mon, 02 Jun 2014 20:06:09 +0300\r
+Message-ID: <87y4xfz1fi.fsf@nikula.org>\r
+MIME-Version: 1.0\r
+Content-Type: text/plain\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.13\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+       <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Mon, 02 Jun 2014 17:06:24 -0000\r
+\r
+On Mon, 02 Jun 2014, Mark Walters <markwalters1009@gmail.com> wrote:\r
+> Tomi Ollila <tomi.ollila@iki.fi> writes:\r
+>\r
+>> On Mon, Jun 02 2014, Mark Walters <markwalters1009@gmail.com> wrote:\r
+>>\r
+>>> Vladimir Marek <Vladimir.Marek@oracle.com> writes:\r
+>>> If you want to save disk space then you could delete the duplicates\r
+>>> after with something like\r
+>>>\r
+>>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to\r
+>>> xargs -0\r
+>>\r
+>> What if there are 3 duplicates (or 4... ;)\r
+>\r
+> I was assuming that it was merging 2 duplicate-free bunches of messages,\r
+> but I guess the new 100000 might not be. In that case running the above\r
+> repeatedly (ie until it is a no-op) would be fine. \r
+\r
+With 'notmuch new' in between the runs, obviously.\r
+\r
+Alternatively, find the biggest --duplicate=N which still outputs\r
+something, and run the command for each N...2.\r
+\r
+\r
+>> One should also have some message content heuristics to determine that the\r
+>> content is indeed duplicate and not something totally different (not that\r
+>> we can see the different content anyway... but...)\r
+>\r
+> That would be nice.\r
+\r
+And quite hard.\r
+\r
+\r
+BR,\r
+Jani.\r
+\r