01/f44a9a85642bf294e7c2491edd6eaf2bb42635

   1 Return-Path: <dmitry.kurochkin@gmail.com>\r
   2 X-Original-To: notmuch@notmuchmail.org\r
   3 Delivered-To: notmuch@notmuchmail.org\r
   4 Received: from localhost (localhost [127.0.0.1])\r
   5         by olra.theworths.org (Postfix) with ESMTP id AD3E7431FB6\r
   6         for <notmuch@notmuchmail.org>; Tue,  4 Sep 2012 13:33:14 -0700 (PDT)\r
   7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
   8 X-Spam-Flag: NO\r
   9 X-Spam-Score: -0.799\r
  10 X-Spam-Level: \r
  11 X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5\r
  12         tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,\r
  13         FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled\r
  14 Received: from olra.theworths.org ([127.0.0.1])\r
  15         by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
  16         with ESMTP id rc2qtB-UN8JA for <notmuch@notmuchmail.org>;\r
  17         Tue,  4 Sep 2012 13:33:14 -0700 (PDT)\r
  18 Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com\r
  19  [74.125.83.53])        (using TLSv1 with cipher RC4-SHA (128/128 bits))        (No client\r
  20  certificate requested) by olra.theworths.org (Postfix) with ESMTPS id\r
  21  DCA24431FAF    for <notmuch@notmuchmail.org>; Tue,  4 Sep 2012 13:33:13 -0700\r
  22  (PDT)\r
  23 Received: by eekb47 with SMTP id b47so2989397eek.26\r
  24         for <notmuch@notmuchmail.org>; Tue, 04 Sep 2012 13:33:12 -0700 (PDT)\r
  25 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;\r
  26         h=from:to:subject:in-reply-to:references:user-agent:date:message-id\r
  27         :mime-version:content-type:content-transfer-encoding;\r
  28         bh=adZvOEGHSOE5vQKYmPradQUCmd2HfnHq3QSYE2wFnv8=;\r
  29         b=BuqAzy4EnNtgSmG6Zi8xZzE78jEPJXqQd2pa64UQFuO3cK1jY8iZN8U6yw6RR6bFqi\r
  30         CsdZxxFUt3B6t61XB5aSrXkiwQwvOXhwJ5ej4lyQq23KWoWpTIoltWxKiygQ9gCIfoxt\r
  31         qfHGF4sY2oRa1VBKRJBL/bdCNCELEG0MjxU8lpUgx6rP0eKTaka4srR7EqQVJbiH9J4q\r
  32         9rTwIJqVMky8pbb6w8wC3Noz2J07H+x3lB0zTjg+LJLPT6JSFamTbr5o8GJXuJqiKlKd\r
  33         Aua4DcWnGkj3KILYhKbaPVOE7c76367SmwetPiipP8C+Qn8bESTKZ+RURxC/kIMQS6pQ\r
  34         qd/g==\r
  35 Received: by 10.14.173.9 with SMTP id u9mr27733873eel.8.1346790792770;\r
  36         Tue, 04 Sep 2012 13:33:12 -0700 (PDT)\r
  37 Received: from localhost ([2001:470:1f0b:14dd:224:d7ff:fee2:c588])\r
  38         by mx.google.com with ESMTPS id k41sm48201821eep.13.2012.09.04.13.33.11\r
  39         (version=TLSv1/SSLv3 cipher=OTHER);\r
  40         Tue, 04 Sep 2012 13:33:12 -0700 (PDT)\r
  41 From: Dmitry Kurochkin <dmitry.kurochkin@gmail.com>\r
  42 To: Michal Nazarewicz <mina86@mina86.com>, notmuch@notmuchmail.org\r
  43 Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib.\r
  44 In-Reply-To: <xa1tipbtk00n.fsf@mina86.com>\r
  45 References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>\r
  46         <xa1tligpk1za.fsf@mina86.com> <87d321sg20.fsf@gmail.com>\r
  47         <xa1tipbtk00n.fsf@mina86.com>\r
  48 User-Agent: Notmuch/0.14+18~g79a73cd (http://notmuchmail.org) Emacs/23.4.1\r
  49         (x86_64-pc-linux-gnu)\r
  50 Date: Wed, 05 Sep 2012 00:33:10 +0400\r
  51 Message-ID: <87a9x5sf3t.fsf@gmail.com>\r
  52 MIME-Version: 1.0\r
  53 Content-Type: text/plain; charset=utf-8\r
  54 Content-Transfer-Encoding: quoted-printable\r
  55 X-BeenThere: notmuch@notmuchmail.org\r
  56 X-Mailman-Version: 2.1.13\r
  57 Precedence: list\r
  58 List-Id: "Use and development of the notmuch mail system."\r
  59         <notmuch.notmuchmail.org>\r
  60 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
  61         <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
  62 List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
  63 List-Post: <mailto:notmuch@notmuchmail.org>\r
  64 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
  65 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
  66         <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
  67 X-List-Received-Date: Tue, 04 Sep 2012 20:33:14 -0000\r
  68 \r
  69 Michal Nazarewicz <mina86@mina86.com> writes:\r
  70 \r
  71 >>> On Tue, Sep 04 2012, Dmitry Kurochkin wrote:\r
  72 >>>> +class MailComparator:\r
  73 >>>> +    """Checks if mail files are duplicates."""\r
  74 >>>> +    def __init__(self, filename):\r
  75 >>>> +        self.filename =3D filename\r
  76 >>>> +        self.mail =3D self.readFile(self.filename)\r
  77 >>>> +\r
  78 >>>> +    def isDuplicate(self, filename):\r
  79 >>>> +        return self.mail =3D=3D self.readFile(filename)\r
  80 >>>> +\r
  81 >>>> +    @staticmethod\r
  82 >>>> +    def readFile(filename):\r
  83 >>>> +        with open(filename) as f:\r
  84 >>>> +            data =3D ""\r
  85 >>>> +            while True:\r
  86 >>>> +                line =3D f.readline()\r
  87 >>>> +                for header in IGNORED_HEADERS:\r
  88 >>>> +                    if line.startswith(header):\r
  89 >\r
  90 >> Michal Nazarewicz <mina86@mina86.com> writes:\r
  91 >>> Case of headers should be ignored, but this does not ignore it.\r
  92 >\r
  93 > On Tue, Sep 04 2012, Dmitry Kurochkin wrote:\r
  94 >> It does.\r
  95 >\r
  96 > Wait, how?  If line is =E2=80=9Creceived:=E2=80=9D how does it starts wit=\r
  97 h =E2=80=9CReceived:=E2=80=9D?\r
  98 >\r
  99 \r
 100 Sorry, I misunderstood your comment.  It does not ignore the case indeed.\r
 101 \r
 102 >>>> +            if os.path.realpath(comparator.filename) =3D=3D os.path.r=\r
 103 ealpath(filename):\r
 104 >>>> +                print "Message '%s' has filenames pointing to the\r
 105 >>>> same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename,\r
 106 >>>> filename)\r
 107 >>>\r
 108 >>> So why aren't those removed?\r
 109 >>>\r
 110 >>\r
 111 >> Because it is the same file indexed twice (probably because of\r
 112 >> symlinks).  We do not want to remove the only message file.\r
 113 >\r
 114 > Ah, right, with symlinks this is troublesome, but than again, we can\r
 115 > check if there is at least one non-symlink.  If there is, delete\r
 116 > everything else, if there is not, delete all but one arbitrarily chosen\r
 117 > symlink.\r
 118 >\r
 119 \r
 120 Sure, we could do that.\r
 121 \r
 122 >>>> +            elif comparator.isDuplicate(filename):\r
 123 >>>> +                os.remove(filename)\r
 124 >>>> +                duplicates_count +=3D 1\r
 125 >>>> +            else:\r
 126 >>>> +                #print "Potential duplicates: %s" % msg.get_message_i=\r
 127 d()\r
 128 >>>> +                suspected_duplicates_count +=3D 1\r
 129 >>>> +\r
 130 >>>> +    new_timestamp =3D time.time()\r
 131 >>>> +    if new_timestamp - timestamp > 1:\r
 132 >>>> +        timestamp =3D new_timestamp\r
 133 >>>> +        sys.stdout.write("\rProcessed %s messages, removed %s duplica=\r
 134 tes..." % (msg_count, duplicates_count))\r
 135 >>>> +        sys.stdout.flush()\r
 136 >>>> +\r
 137 >>>> +print "\rFinished. Processed %s messages, removed %s duplicates." % (=\r
 138 msg_count, duplicates_count)\r
 139 >>>> +if duplicates_count > 0:\r
 140 >>>> +    print "You might want to run 'notmuch new' now."\r
 141 >>>> +\r
 142 >>>> +if suspected_duplicates_count > 0:\r
 143 >>>> +    print\r
 144 >>>> +    print "Found %s messages with duplicate IDs but different content=\r
 145 ." % suspected_duplicates_count\r
 146 >>>> +    print "Perhaps we should ignore more headers."\r
 147 >>>\r
 148 >>> Please consider the following instead (not tested):\r
 149 >\r
 150 >> Thanks for reviewing my poor python code :) I am afraid I do not have\r
 151 >> enough interest in improving it.  I just implemented a simple solution\r
 152 >> for my problem.  Though it looks like you already took time to rewrite\r
 153 >> the script.  Would be great if you send it as a proper patch obsoleting\r
 154 >> this one.\r
 155 >\r
 156 > Bah, I'll probably won't have time to properly test it.\r
 157 >\r
 158 \r
 159 Same problem :)\r
 160 \r
 161 Regards,\r
 162   Dmitry\r
 163 \r
 164 > --=20\r
 165 > Best regards,                                         _     _\r
 166 > .o. | Liege of Serenely Enlightened Majesty of      o' \,=3D./ `o\r
 167 > ..o | Computer Science,  Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz =\r
 168    (o o)\r
 169 > ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--\r