Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 1EAE1431FBC for ; Mon, 2 Jun 2014 10:06:24 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CCYUPGMmLCkc for ; Mon, 2 Jun 2014 10:06:16 -0700 (PDT) Received: from mail-wi0-f171.google.com (mail-wi0-f171.google.com [209.85.212.171]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 4D7B2431FAE for ; Mon, 2 Jun 2014 10:06:16 -0700 (PDT) Received: by mail-wi0-f171.google.com with SMTP id cc10so5008597wib.4 for ; Mon, 02 Jun 2014 10:06:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:subject:in-reply-to:references :user-agent:date:message-id:mime-version:content-type; bh=O6RcBBKkVtQR85qqmLXWlxZHYEO+QkP6X9rWxRrWqw0=; b=YxaW4/sNIOlBMj8C4BK4Nm5bWnh94sbXaZs+aPXBRopsxZ+uf42RFarFLkunl3NWD8 I07Y2PjtntByjEMPhGsbzrG38Ypn4PQANnij881RL6OJk9yhSKp50PGLmtZ0mS+0Mh7n efZxO1Xrnd6XHST8Xyk7LUY5y+efUhmEA/nz/T2q0LWMStKhx9jM+wQMFdfgZgE9Yl9I LqA80brb+oBO82cK5BBOQXjbV5+aKFrJYwjlbCxKTENz65pkhxjcm/9WspQOJ0atUPdN oOqwg83Ix9T/upCPCLorxT6g6SbYrZ0QXSJYTAZxBcPAUBXTevdTGFWPGaIqLbnwFtAS HnTg== X-Gm-Message-State: ALoCoQkP4DDQBSF2cdVDqtxF4eswnUbYpGTcy9XPk6sDtL2keTojdTMIIDBk5eYC4zKVip4BBUPu X-Received: by 10.180.90.51 with SMTP id bt19mr24467825wib.22.1401728773691; Mon, 02 Jun 2014 10:06:13 -0700 (PDT) Received: from localhost (dsl-hkibrasgw2-58c36f-91.dhcp.inet.fi. [88.195.111.91]) by mx.google.com with ESMTPSA id m2sm36855357wjw.3.2014.06.02.10.06.12 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 02 Jun 2014 10:06:13 -0700 (PDT) From: Jani Nikula To: Mark Walters , Tomi Ollila , Vladimir Marek , notmuch@notmuchmail.org Subject: Re: Deduplication ? In-Reply-To: <87ppirqtfa.fsf@qmul.ac.uk> References: <20140602123212.GA12639@virt.cz.oracle.com> <87d2ers9mi.fsf@qmul.ac.uk> <87ppirqtfa.fsf@qmul.ac.uk> User-Agent: Notmuch/0.18+24~gfe8cd90 (http://notmuchmail.org) Emacs/24.3.1 (x86_64-pc-linux-gnu) Date: Mon, 02 Jun 2014 20:06:09 +0300 Message-ID: <87y4xfz1fi.fsf@nikula.org> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jun 2014 17:06:24 -0000 On Mon, 02 Jun 2014, Mark Walters wrote: > Tomi Ollila writes: > >> On Mon, Jun 02 2014, Mark Walters wrote: >> >>> Vladimir Marek writes: >>> If you want to save disk space then you could delete the duplicates >>> after with something like >>> >>> notmuch search --output=files --format=text0 --duplicate=2 '*' piped to >>> xargs -0 >> >> What if there are 3 duplicates (or 4... ;) > > I was assuming that it was merging 2 duplicate-free bunches of messages, > but I guess the new 100000 might not be. In that case running the above > repeatedly (ie until it is a no-op) would be fine. With 'notmuch new' in between the runs, obviously. Alternatively, find the biggest --duplicate=N which still outputs something, and run the command for each N...2. >> One should also have some message content heuristics to determine that the >> content is indeed duplicate and not something totally different (not that >> we can see the different content anyway... but...) > > That would be nice. And quite hard. BR, Jani.