Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib.
authorMichal Nazarewicz <mina86@mina86.com>
Tue, 4 Sep 2012 20:26:16 +0000 (22:26 +0200)
committerW. Trevor King <wking@tremily.us>
Fri, 7 Nov 2014 17:49:22 +0000 (09:49 -0800)
e3/c81474d9b504bbe5c3e15d2d1d90af941b8eed [new file with mode: 0644]

diff --git a/e3/c81474d9b504bbe5c3e15d2d1d90af941b8eed b/e3/c81474d9b504bbe5c3e15d2d1d90af941b8eed
new file mode 100644 (file)
index 0000000..95bb05d
--- /dev/null
@@ -0,0 +1,208 @@
+Return-Path: <mpn@google.com>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+       by olra.theworths.org (Postfix) with ESMTP id 2F826431FB6\r
+       for <notmuch@notmuchmail.org>; Tue,  4 Sep 2012 13:26:27 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: -0.7\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5\r
+       tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7]\r
+       autolearn=disabled\r
+Received: from olra.theworths.org ([127.0.0.1])\r
+       by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
+       with ESMTP id MfAPHEtIqKwR for <notmuch@notmuchmail.org>;\r
+       Tue,  4 Sep 2012 13:26:26 -0700 (PDT)\r
+Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com\r
+ [74.125.83.53])       (using TLSv1 with cipher RC4-SHA (128/128 bits))        (No client\r
+ certificate requested)        by olra.theworths.org (Postfix) with ESMTPS id\r
+ 10D48431FAF   for <notmuch@notmuchmail.org>; Tue,  4 Sep 2012 13:26:25 -0700\r
+ (PDT)\r
+Received: by eekb47 with SMTP id b47so2986891eek.26\r
+       for <notmuch@notmuchmail.org>; Tue, 04 Sep 2012 13:26:24 -0700 (PDT)\r
+DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;\r
+ s=20120113;   h=sender:from:to:subject:in-reply-to:organization:references\r
+       :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version\r
+       :content-type; bh=sNVDy4WyUKME6jjouDXoNP1/3NnfrybALBbSdrMLZO4=;\r
+       b=YvPAp7JMv6nOnGQUBdhbqBHE/jyyursJMQ9i7pabRiF8klWSG2zzY8fwQNhBMr9PeC\r
+       HnUorZOBkMDeMaPEqO/o1JbKLzBIuJbNOEq/1mvYf2tecXzfoutdzAxq1DEJU6gDUWVc\r
+       +0W93MfUIqqVvGJBGKKuUnUpfj0ONasYnTj/W2UMN4X+9DOjiyYOpVzTemWCIPzayEYa\r
+       7w9zhLDCoppXoQSUElkNABg66wxFvqfvkE8DnyLhYeZsjcH9OBtfR0qrE2veA5Vj4C+S\r
+       VnpidoOM+PmwOAG+07n4oxzDKzu0uhjRm2Fwaf1tBje5fQUwu2BD9GjXX/mGor2puO+u    GgjQ==\r
+X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;\r
+       d=google.com; s=20120113;\r
+       h=sender:from:to:subject:in-reply-to:organization:references\r
+       :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version\r
+       :content-type:x-gm-message-state;\r
+       bh=sNVDy4WyUKME6jjouDXoNP1/3NnfrybALBbSdrMLZO4=;\r
+       b=hHyT9HKQ4bIjJjIaPTFNtr7YKxB1i9XJ/L5/YD2eZopmcE8OevRTmHlJ1hpM1lICZQ\r
+       2FBAGqrHrMcRqIFsVwDshc/T7BcKQKyUugl2WKPYxEdb/24p6y1Y2A0urL2LgFnrZaDm\r
+       DsZMnc/whEyljoD+DDsXP3OD0uGl5ClzJtRgR6MRt4hXhSiFBfSusnu2Qhr3IPA6SRQU\r
+       m27lTrLMAl1OttXlx1pGKfXuJmKdzRTKS+3iA3tQObLblxIxe7kjvB6mOkOh7o0yxU4Z\r
+       daQhwTn+Ai+bbOP4mog3wjRiFRMdNBuuJLqBPiUVm9qRke4r7EYZ1EOFBqL8sxztf3G8\r
+       yfYQ==\r
+Received: by 10.14.218.134 with SMTP id k6mr27948267eep.14.1346790384901;\r
+       Tue, 04 Sep 2012 13:26:24 -0700 (PDT)\r
+Received: by 10.14.218.134 with SMTP id k6mr27948245eep.14.1346790384652;\r
+       Tue, 04 Sep 2012 13:26:24 -0700 (PDT)\r
+Received: from mpn-glaptop ([2620:0:105f:5:f2de:f1ff:fe35:1a72])\r
+       by mx.google.com with ESMTPS id 45sm48181447eeb.8.2012.09.04.13.26.22\r
+       (version=TLSv1/SSLv3 cipher=OTHER);\r
+       Tue, 04 Sep 2012 13:26:23 -0700 (PDT)\r
+Sender: Michal Nazarewicz <mpn@google.com>\r
+From: Michal Nazarewicz <mina86@mina86.com>\r
+To: Dmitry Kurochkin <dmitry.kurochkin@gmail.com>, notmuch@notmuchmail.org\r
+Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib.\r
+In-Reply-To: <87d321sg20.fsf@gmail.com>\r
+Organization: http://mina86.com/\r
+References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>\r
+       <xa1tligpk1za.fsf@mina86.com> <87d321sg20.fsf@gmail.com>\r
+User-Agent: Notmuch/0.14+2~g416b120 (http://notmuchmail.org) Emacs/24.2.50.1\r
+       (x86_64-unknown-linux-gnu)\r
+X-Face: PbkBB1w#)bOqd`iCe"Ds{e+!C7`pkC9a|f)Qo^BMQvy\q5x3?vDQJeN(DS?|-^$uMti[3D*#^_Ts"pU$jBQLq~Ud6iNwAw_r_o_4]|JO?]}P_}Nc&"p#D(ZgUb4uCNPe7~a[DbPG0T~!&c.y$Ur,=N4RT>]dNpd;       KFrfMCylc}gc??'U2j,!8%xdD\r
+Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACQElEQVQ4jW3TMWvbQBQHcBk1xE6WyALX1069oZBMlq+ouUwpEQQ6uRjttkWP4CmBgGM0BQLBdPFZYPsyFUo6uEtKDQ7oy/U96XR2Ux8ehH/89Z6enqxBcS7Lg81jmSuujrfCZcLI/TYYvbGj+jbgFpHJ/bqQAUISj8iLyu4LuFHJTosxsucO4jSDNE0Hq3hwK/ceQ5sx97b8LcUDsILfk+ovHkOIsMbBfg43VuQ5Ln9YAGCkUdKJoXR9EclFBhixy3EGVz1K6eEkhxCAkeMMnqoAhAKwhoUJkDrCqvbecaYINlFKSRS1i12VKH1XpUd4qxL876EkMcDvHj3s5RBajHHMlA5iK32e0C7VgG0RlzFPvoYHZLRmAC0BmNcBruhkE0KsMsbEc62ZwUJDxWUdMsMhVqovoT96i/DnX/ASvz/6hbCabELLk/6FF/8PNpPCGqcZTGFcBhhAaZZDbQPaAB3+KrWWy2XgbYDNIinkdWAFcCpraDE/knwe5DBqGmgzESl1p2E4MWAz0VUPgYYzmfWb9yS4vCvgsxJriNTHoIBz5YteBvg+VGISQWUqhMiByPIPpygeDBE6elD973xWwKkEiHZAHKjhuPsFnBuArrzxtakRcISv+XMIPl4aGBUJm8Emk7qBYU8IlgNEIpiJhk/No24jHwkKTFHDWfPniR4iw5vJaw2nzSjfq2zffcE/GDjRC2dn0J0XwPAbDL84TvaFCJEU4Oml9pRyEUhR3Cl2t01AoEjRbs0sYugp14/4X5n4pU4EHHnMAAAAAElFTkSuQmCC\r
+X-PGP: 50751FF4\r
+X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4\r
+Date: Tue, 04 Sep 2012 22:26:16 +0200\r
+Message-ID: <xa1tipbtk00n.fsf@mina86.com>\r
+MIME-Version: 1.0\r
+Content-Type: multipart/mixed; boundary="=-=-="\r
+X-Gm-Message-State: ALoCoQk/xF9cupH17t4530QVOx1nvqEv5KEURZzPKFAr1FZehQJKvp2ihr10O2mg2NbAwjnv2j2jVTXYK7QNsO59WJVum/5rjfAvScIG+LE185k5oCmc3wu2Q4aJsKsBiicacCBawGDsYp9gj1eufS3q0tjCSvYThLzm2Bv2tFSVR05AqY6fD5NtrLOVZq/vW8cRUxIvYBO6zxN7mjXb+BaDxgBBIvrVKA==\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.13\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+       <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Tue, 04 Sep 2012 20:26:27 -0000\r
+\r
+--=-=-=\r
+Content-Type: text/plain; charset=utf-8\r
+Content-Transfer-Encoding: quoted-printable\r
+\r
+>> On Tue, Sep 04 2012, Dmitry Kurochkin wrote:\r
+>>> +class MailComparator:\r
+>>> +    """Checks if mail files are duplicates."""\r
+>>> +    def __init__(self, filename):\r
+>>> +        self.filename =3D filename\r
+>>> +        self.mail =3D self.readFile(self.filename)\r
+>>> +\r
+>>> +    def isDuplicate(self, filename):\r
+>>> +        return self.mail =3D=3D self.readFile(filename)\r
+>>> +\r
+>>> +    @staticmethod\r
+>>> +    def readFile(filename):\r
+>>> +        with open(filename) as f:\r
+>>> +            data =3D ""\r
+>>> +            while True:\r
+>>> +                line =3D f.readline()\r
+>>> +                for header in IGNORED_HEADERS:\r
+>>> +                    if line.startswith(header):\r
+\r
+> Michal Nazarewicz <mina86@mina86.com> writes:\r
+>> Case of headers should be ignored, but this does not ignore it.\r
+\r
+On Tue, Sep 04 2012, Dmitry Kurochkin wrote:\r
+> It does.\r
+\r
+Wait, how?  If line is =E2=80=9Creceived:=E2=80=9D how does it starts with =\r
+=E2=80=9CReceived:=E2=80=9D?\r
+\r
+>>> +            if os.path.realpath(comparator.filename) =3D=3D os.path.re=\r
+alpath(filename):\r
+>>> +                print "Message '%s' has filenames pointing to the\r
+>>> same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename,\r
+>>> filename)\r
+>>\r
+>> So why aren't those removed?\r
+>>\r
+>\r
+> Because it is the same file indexed twice (probably because of\r
+> symlinks).  We do not want to remove the only message file.\r
+\r
+Ah, right, with symlinks this is troublesome, but than again, we can\r
+check if there is at least one non-symlink.  If there is, delete\r
+everything else, if there is not, delete all but one arbitrarily chosen\r
+symlink.\r
+\r
+>>> +            elif comparator.isDuplicate(filename):\r
+>>> +                os.remove(filename)\r
+>>> +                duplicates_count +=3D 1\r
+>>> +            else:\r
+>>> +                #print "Potential duplicates: %s" % msg.get_message_id=\r
+()\r
+>>> +                suspected_duplicates_count +=3D 1\r
+>>> +\r
+>>> +    new_timestamp =3D time.time()\r
+>>> +    if new_timestamp - timestamp > 1:\r
+>>> +        timestamp =3D new_timestamp\r
+>>> +        sys.stdout.write("\rProcessed %s messages, removed %s duplicat=\r
+es..." % (msg_count, duplicates_count))\r
+>>> +        sys.stdout.flush()\r
+>>> +\r
+>>> +print "\rFinished. Processed %s messages, removed %s duplicates." % (m=\r
+sg_count, duplicates_count)\r
+>>> +if duplicates_count > 0:\r
+>>> +    print "You might want to run 'notmuch new' now."\r
+>>> +\r
+>>> +if suspected_duplicates_count > 0:\r
+>>> +    print\r
+>>> +    print "Found %s messages with duplicate IDs but different content.=\r
+" % suspected_duplicates_count\r
+>>> +    print "Perhaps we should ignore more headers."\r
+>>\r
+>> Please consider the following instead (not tested):\r
+\r
+> Thanks for reviewing my poor python code :) I am afraid I do not have\r
+> enough interest in improving it.  I just implemented a simple solution\r
+> for my problem.  Though it looks like you already took time to rewrite\r
+> the script.  Would be great if you send it as a proper patch obsoleting\r
+> this one.\r
+\r
+Bah, I'll probably won't have time to properly test it.\r
+\r
+--=20\r
+Best regards,                                         _     _\r
+.o. | Liege of Serenely Enlightened Majesty of      o' \,=3D./ `o\r
+..o | Computer Science,  Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz   =\r
+ (o o)\r
+ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--\r
+--=-=-=\r
+Content-Type: multipart/signed; boundary="==-=-=";\r
+       micalg=pgp-sha1; protocol="application/pgp-signature"\r
+\r
+--==-=-=\r
+Content-Type: text/plain\r
+\r
+\r
+--==-=-=\r
+Content-Type: application/pgp-signature\r
+\r
+-----BEGIN PGP SIGNATURE-----\r
+Version: GnuPG v1.4.10 (GNU/Linux)\r
+\r
+iQIbBAEBAgAGBQJQRmPoAAoJECBgQBJQdR/0kLEP+KCPbNE7PTqoYiHjOEc8QpFD\r
+LiKIHYNFdtx41eYbBuOMovNyBE4CS7F1WyFnDXSoXY2ajRgHFUjEwQxncakCGyD+\r
+OxJGUGsVWUo8Vq0Sb+cp5+a5Giz6iDU57XvUyXrqgdRZsGPpSPJVUtGpXCXSGJkX\r
+UA9X/Q/uUiUbZGRsLgwwRLI7NBkNMbHR8WHJBBEt2cIUPnGttRUNfhO5IVAZhr7q\r
+VUK06VXW6+dMWoaH4oOkkDzGOuDH41NEKXFxjtpCsKXUU0H5FG6XT5ertqGX6msB\r
+HMZpkSE6LYcuXMNHj4gqOtAUS7K6vao2LtLRQ0J/r8tvHCOyFeTdwcccoWZl3i8V\r
+sr5ZVGBWWTB3TAuRxD/ViTxH20f5EnbyoaJs1DNBQV8Df5TlqrmWl0f6WOMCs5GO\r
+TDN/93gF+KK1aHAVAXmsTOnkKRDYdk8NvjV8o/aoGvpvbhCVliWkARiYQFRA1X/h\r
+1MoHlcGDZUbJmCbhmlTun3rB8oXHfeQmqeIdmYRp5i/LwVW15TiEyw/Joa59exCi\r
+s3raOx7HU4Tke65S0JQ4tpTuWyBFMetmHoFH+ainb6FjGop5u6Obnl47NcxgtC5j\r
+yTeHT6iIgC3Y6sDnqjs7/UVH+FtDHm8nvhlBVqTacARUEsDkrScDLKuigcwQkT4E\r
++5qIEIK1Qqjcl2zNCNg=\r
+=o9Ln\r
+-----END PGP SIGNATURE-----\r
+--==-=-=--\r
+\r
+--=-=-=--\r