--- /dev/null
+Return-Path: <dmitry.kurochkin@gmail.com>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+ by olra.theworths.org (Postfix) with ESMTP id AD3E7431FB6\r
+ for <notmuch@notmuchmail.org>; Tue, 4 Sep 2012 13:33:14 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: -0.799\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5\r
+ tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,\r
+ FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled\r
+Received: from olra.theworths.org ([127.0.0.1])\r
+ by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
+ with ESMTP id rc2qtB-UN8JA for <notmuch@notmuchmail.org>;\r
+ Tue, 4 Sep 2012 13:33:14 -0700 (PDT)\r
+Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com\r
+ [74.125.83.53]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client\r
+ certificate requested) by olra.theworths.org (Postfix) with ESMTPS id\r
+ DCA24431FAF for <notmuch@notmuchmail.org>; Tue, 4 Sep 2012 13:33:13 -0700\r
+ (PDT)\r
+Received: by eekb47 with SMTP id b47so2989397eek.26\r
+ for <notmuch@notmuchmail.org>; Tue, 04 Sep 2012 13:33:12 -0700 (PDT)\r
+DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;\r
+ h=from:to:subject:in-reply-to:references:user-agent:date:message-id\r
+ :mime-version:content-type:content-transfer-encoding;\r
+ bh=adZvOEGHSOE5vQKYmPradQUCmd2HfnHq3QSYE2wFnv8=;\r
+ b=BuqAzy4EnNtgSmG6Zi8xZzE78jEPJXqQd2pa64UQFuO3cK1jY8iZN8U6yw6RR6bFqi\r
+ CsdZxxFUt3B6t61XB5aSrXkiwQwvOXhwJ5ej4lyQq23KWoWpTIoltWxKiygQ9gCIfoxt\r
+ qfHGF4sY2oRa1VBKRJBL/bdCNCELEG0MjxU8lpUgx6rP0eKTaka4srR7EqQVJbiH9J4q\r
+ 9rTwIJqVMky8pbb6w8wC3Noz2J07H+x3lB0zTjg+LJLPT6JSFamTbr5o8GJXuJqiKlKd\r
+ Aua4DcWnGkj3KILYhKbaPVOE7c76367SmwetPiipP8C+Qn8bESTKZ+RURxC/kIMQS6pQ\r
+ qd/g==\r
+Received: by 10.14.173.9 with SMTP id u9mr27733873eel.8.1346790792770;\r
+ Tue, 04 Sep 2012 13:33:12 -0700 (PDT)\r
+Received: from localhost ([2001:470:1f0b:14dd:224:d7ff:fee2:c588])\r
+ by mx.google.com with ESMTPS id k41sm48201821eep.13.2012.09.04.13.33.11\r
+ (version=TLSv1/SSLv3 cipher=OTHER);\r
+ Tue, 04 Sep 2012 13:33:12 -0700 (PDT)\r
+From: Dmitry Kurochkin <dmitry.kurochkin@gmail.com>\r
+To: Michal Nazarewicz <mina86@mina86.com>, notmuch@notmuchmail.org\r
+Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib.\r
+In-Reply-To: <xa1tipbtk00n.fsf@mina86.com>\r
+References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>\r
+ <xa1tligpk1za.fsf@mina86.com> <87d321sg20.fsf@gmail.com>\r
+ <xa1tipbtk00n.fsf@mina86.com>\r
+User-Agent: Notmuch/0.14+18~g79a73cd (http://notmuchmail.org) Emacs/23.4.1\r
+ (x86_64-pc-linux-gnu)\r
+Date: Wed, 05 Sep 2012 00:33:10 +0400\r
+Message-ID: <87a9x5sf3t.fsf@gmail.com>\r
+MIME-Version: 1.0\r
+Content-Type: text/plain; charset=utf-8\r
+Content-Transfer-Encoding: quoted-printable\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.13\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+ <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Tue, 04 Sep 2012 20:33:14 -0000\r
+\r
+Michal Nazarewicz <mina86@mina86.com> writes:\r
+\r
+>>> On Tue, Sep 04 2012, Dmitry Kurochkin wrote:\r
+>>>> +class MailComparator:\r
+>>>> + """Checks if mail files are duplicates."""\r
+>>>> + def __init__(self, filename):\r
+>>>> + self.filename =3D filename\r
+>>>> + self.mail =3D self.readFile(self.filename)\r
+>>>> +\r
+>>>> + def isDuplicate(self, filename):\r
+>>>> + return self.mail =3D=3D self.readFile(filename)\r
+>>>> +\r
+>>>> + @staticmethod\r
+>>>> + def readFile(filename):\r
+>>>> + with open(filename) as f:\r
+>>>> + data =3D ""\r
+>>>> + while True:\r
+>>>> + line =3D f.readline()\r
+>>>> + for header in IGNORED_HEADERS:\r
+>>>> + if line.startswith(header):\r
+>\r
+>> Michal Nazarewicz <mina86@mina86.com> writes:\r
+>>> Case of headers should be ignored, but this does not ignore it.\r
+>\r
+> On Tue, Sep 04 2012, Dmitry Kurochkin wrote:\r
+>> It does.\r
+>\r
+> Wait, how? If line is =E2=80=9Creceived:=E2=80=9D how does it starts wit=\r
+h =E2=80=9CReceived:=E2=80=9D?\r
+>\r
+\r
+Sorry, I misunderstood your comment. It does not ignore the case indeed.\r
+\r
+>>>> + if os.path.realpath(comparator.filename) =3D=3D os.path.r=\r
+ealpath(filename):\r
+>>>> + print "Message '%s' has filenames pointing to the\r
+>>>> same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename,\r
+>>>> filename)\r
+>>>\r
+>>> So why aren't those removed?\r
+>>>\r
+>>\r
+>> Because it is the same file indexed twice (probably because of\r
+>> symlinks). We do not want to remove the only message file.\r
+>\r
+> Ah, right, with symlinks this is troublesome, but than again, we can\r
+> check if there is at least one non-symlink. If there is, delete\r
+> everything else, if there is not, delete all but one arbitrarily chosen\r
+> symlink.\r
+>\r
+\r
+Sure, we could do that.\r
+\r
+>>>> + elif comparator.isDuplicate(filename):\r
+>>>> + os.remove(filename)\r
+>>>> + duplicates_count +=3D 1\r
+>>>> + else:\r
+>>>> + #print "Potential duplicates: %s" % msg.get_message_i=\r
+d()\r
+>>>> + suspected_duplicates_count +=3D 1\r
+>>>> +\r
+>>>> + new_timestamp =3D time.time()\r
+>>>> + if new_timestamp - timestamp > 1:\r
+>>>> + timestamp =3D new_timestamp\r
+>>>> + sys.stdout.write("\rProcessed %s messages, removed %s duplica=\r
+tes..." % (msg_count, duplicates_count))\r
+>>>> + sys.stdout.flush()\r
+>>>> +\r
+>>>> +print "\rFinished. Processed %s messages, removed %s duplicates." % (=\r
+msg_count, duplicates_count)\r
+>>>> +if duplicates_count > 0:\r
+>>>> + print "You might want to run 'notmuch new' now."\r
+>>>> +\r
+>>>> +if suspected_duplicates_count > 0:\r
+>>>> + print\r
+>>>> + print "Found %s messages with duplicate IDs but different content=\r
+." % suspected_duplicates_count\r
+>>>> + print "Perhaps we should ignore more headers."\r
+>>>\r
+>>> Please consider the following instead (not tested):\r
+>\r
+>> Thanks for reviewing my poor python code :) I am afraid I do not have\r
+>> enough interest in improving it. I just implemented a simple solution\r
+>> for my problem. Though it looks like you already took time to rewrite\r
+>> the script. Would be great if you send it as a proper patch obsoleting\r
+>> this one.\r
+>\r
+> Bah, I'll probably won't have time to properly test it.\r
+>\r
+\r
+Same problem :)\r
+\r
+Regards,\r
+ Dmitry\r
+\r
+> --=20\r
+> Best regards, _ _\r
+> .o. | Liege of Serenely Enlightened Majesty of o' \,=3D./ `o\r
+> ..o | Computer Science, Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz =\r
+ (o o)\r
+> ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--\r