1 Return-Path: <dmitry.kurochkin@gmail.com>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by olra.theworths.org (Postfix) with ESMTP id AD3E7431FB6
\r
6 for <notmuch@notmuchmail.org>; Tue, 4 Sep 2012 13:33:14 -0700 (PDT)
\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
\r
11 X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5
\r
12 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
\r
13 FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled
\r
14 Received: from olra.theworths.org ([127.0.0.1])
\r
15 by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
\r
16 with ESMTP id rc2qtB-UN8JA for <notmuch@notmuchmail.org>;
\r
17 Tue, 4 Sep 2012 13:33:14 -0700 (PDT)
\r
18 Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com
\r
19 [74.125.83.53]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client
\r
20 certificate requested) by olra.theworths.org (Postfix) with ESMTPS id
\r
21 DCA24431FAF for <notmuch@notmuchmail.org>; Tue, 4 Sep 2012 13:33:13 -0700
\r
23 Received: by eekb47 with SMTP id b47so2989397eek.26
\r
24 for <notmuch@notmuchmail.org>; Tue, 04 Sep 2012 13:33:12 -0700 (PDT)
\r
25 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
\r
26 h=from:to:subject:in-reply-to:references:user-agent:date:message-id
\r
27 :mime-version:content-type:content-transfer-encoding;
\r
28 bh=adZvOEGHSOE5vQKYmPradQUCmd2HfnHq3QSYE2wFnv8=;
\r
29 b=BuqAzy4EnNtgSmG6Zi8xZzE78jEPJXqQd2pa64UQFuO3cK1jY8iZN8U6yw6RR6bFqi
\r
30 CsdZxxFUt3B6t61XB5aSrXkiwQwvOXhwJ5ej4lyQq23KWoWpTIoltWxKiygQ9gCIfoxt
\r
31 qfHGF4sY2oRa1VBKRJBL/bdCNCELEG0MjxU8lpUgx6rP0eKTaka4srR7EqQVJbiH9J4q
\r
32 9rTwIJqVMky8pbb6w8wC3Noz2J07H+x3lB0zTjg+LJLPT6JSFamTbr5o8GJXuJqiKlKd
\r
33 Aua4DcWnGkj3KILYhKbaPVOE7c76367SmwetPiipP8C+Qn8bESTKZ+RURxC/kIMQS6pQ
\r
35 Received: by 10.14.173.9 with SMTP id u9mr27733873eel.8.1346790792770;
\r
36 Tue, 04 Sep 2012 13:33:12 -0700 (PDT)
\r
37 Received: from localhost ([2001:470:1f0b:14dd:224:d7ff:fee2:c588])
\r
38 by mx.google.com with ESMTPS id k41sm48201821eep.13.2012.09.04.13.33.11
\r
39 (version=TLSv1/SSLv3 cipher=OTHER);
\r
40 Tue, 04 Sep 2012 13:33:12 -0700 (PDT)
\r
41 From: Dmitry Kurochkin <dmitry.kurochkin@gmail.com>
\r
42 To: Michal Nazarewicz <mina86@mina86.com>, notmuch@notmuchmail.org
\r
43 Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib.
\r
44 In-Reply-To: <xa1tipbtk00n.fsf@mina86.com>
\r
45 References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>
\r
46 <xa1tligpk1za.fsf@mina86.com> <87d321sg20.fsf@gmail.com>
\r
47 <xa1tipbtk00n.fsf@mina86.com>
\r
48 User-Agent: Notmuch/0.14+18~g79a73cd (http://notmuchmail.org) Emacs/23.4.1
\r
49 (x86_64-pc-linux-gnu)
\r
50 Date: Wed, 05 Sep 2012 00:33:10 +0400
\r
51 Message-ID: <87a9x5sf3t.fsf@gmail.com>
\r
53 Content-Type: text/plain; charset=utf-8
\r
54 Content-Transfer-Encoding: quoted-printable
\r
55 X-BeenThere: notmuch@notmuchmail.org
\r
56 X-Mailman-Version: 2.1.13
\r
58 List-Id: "Use and development of the notmuch mail system."
\r
59 <notmuch.notmuchmail.org>
\r
60 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
61 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
62 List-Archive: <http://notmuchmail.org/pipermail/notmuch>
\r
63 List-Post: <mailto:notmuch@notmuchmail.org>
\r
64 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
65 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
66 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
67 X-List-Received-Date: Tue, 04 Sep 2012 20:33:14 -0000
\r
69 Michal Nazarewicz <mina86@mina86.com> writes:
\r
71 >>> On Tue, Sep 04 2012, Dmitry Kurochkin wrote:
\r
72 >>>> +class MailComparator:
\r
73 >>>> + """Checks if mail files are duplicates."""
\r
74 >>>> + def __init__(self, filename):
\r
75 >>>> + self.filename =3D filename
\r
76 >>>> + self.mail =3D self.readFile(self.filename)
\r
78 >>>> + def isDuplicate(self, filename):
\r
79 >>>> + return self.mail =3D=3D self.readFile(filename)
\r
81 >>>> + @staticmethod
\r
82 >>>> + def readFile(filename):
\r
83 >>>> + with open(filename) as f:
\r
86 >>>> + line =3D f.readline()
\r
87 >>>> + for header in IGNORED_HEADERS:
\r
88 >>>> + if line.startswith(header):
\r
90 >> Michal Nazarewicz <mina86@mina86.com> writes:
\r
91 >>> Case of headers should be ignored, but this does not ignore it.
\r
93 > On Tue, Sep 04 2012, Dmitry Kurochkin wrote:
\r
96 > Wait, how? If line is =E2=80=9Creceived:=E2=80=9D how does it starts wit=
\r
97 h =E2=80=9CReceived:=E2=80=9D?
\r
100 Sorry, I misunderstood your comment. It does not ignore the case indeed.
\r
102 >>>> + if os.path.realpath(comparator.filename) =3D=3D os.path.r=
\r
104 >>>> + print "Message '%s' has filenames pointing to the
\r
105 >>>> same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename,
\r
108 >>> So why aren't those removed?
\r
111 >> Because it is the same file indexed twice (probably because of
\r
112 >> symlinks). We do not want to remove the only message file.
\r
114 > Ah, right, with symlinks this is troublesome, but than again, we can
\r
115 > check if there is at least one non-symlink. If there is, delete
\r
116 > everything else, if there is not, delete all but one arbitrarily chosen
\r
120 Sure, we could do that.
\r
122 >>>> + elif comparator.isDuplicate(filename):
\r
123 >>>> + os.remove(filename)
\r
124 >>>> + duplicates_count +=3D 1
\r
126 >>>> + #print "Potential duplicates: %s" % msg.get_message_i=
\r
128 >>>> + suspected_duplicates_count +=3D 1
\r
130 >>>> + new_timestamp =3D time.time()
\r
131 >>>> + if new_timestamp - timestamp > 1:
\r
132 >>>> + timestamp =3D new_timestamp
\r
133 >>>> + sys.stdout.write("\rProcessed %s messages, removed %s duplica=
\r
134 tes..." % (msg_count, duplicates_count))
\r
135 >>>> + sys.stdout.flush()
\r
137 >>>> +print "\rFinished. Processed %s messages, removed %s duplicates." % (=
\r
138 msg_count, duplicates_count)
\r
139 >>>> +if duplicates_count > 0:
\r
140 >>>> + print "You might want to run 'notmuch new' now."
\r
142 >>>> +if suspected_duplicates_count > 0:
\r
144 >>>> + print "Found %s messages with duplicate IDs but different content=
\r
145 ." % suspected_duplicates_count
\r
146 >>>> + print "Perhaps we should ignore more headers."
\r
148 >>> Please consider the following instead (not tested):
\r
150 >> Thanks for reviewing my poor python code :) I am afraid I do not have
\r
151 >> enough interest in improving it. I just implemented a simple solution
\r
152 >> for my problem. Though it looks like you already took time to rewrite
\r
153 >> the script. Would be great if you send it as a proper patch obsoleting
\r
156 > Bah, I'll probably won't have time to properly test it.
\r
165 > Best regards, _ _
\r
166 > .o. | Liege of Serenely Enlightened Majesty of o' \,=3D./ `o
\r
167 > ..o | Computer Science, Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz =
\r
169 > ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--
\r