1 Return-Path: <mpn@google.com>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by olra.theworths.org (Postfix) with ESMTP id 68D93431FB6
\r
6 for <notmuch@notmuchmail.org>; Tue, 4 Sep 2012 12:44:04 -0700 (PDT)
\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
\r
11 X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5
\r
12 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7]
\r
14 Received: from olra.theworths.org ([127.0.0.1])
\r
15 by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
\r
16 with ESMTP id 2ysK6ajrBbF0 for <notmuch@notmuchmail.org>;
\r
17 Tue, 4 Sep 2012 12:44:03 -0700 (PDT)
\r
18 Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com
\r
19 [74.125.83.53]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client
\r
20 certificate requested) by olra.theworths.org (Postfix) with ESMTPS id
\r
21 CD1AA431FAF for <notmuch@notmuchmail.org>; Tue, 4 Sep 2012 12:44:02 -0700
\r
23 Received: by eekb47 with SMTP id b47so2969538eek.26
\r
24 for <notmuch@notmuchmail.org>; Tue, 04 Sep 2012 12:44:01 -0700 (PDT)
\r
25 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
\r
26 s=20120113; h=sender:from:to:subject:in-reply-to:organization:references
\r
27 :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version
\r
28 :content-type; bh=k3cSusr4oXo9pUf0EKe6avU/BtoWgLojYDKKQ1dCp/E=;
\r
29 b=FbO3vYN4Wzgm/NDjOLlRLs3RwqNcWLVu/dHcnt43tBXCyk1oud8BrIRi3z41Yq9bYX
\r
30 ee5/Ekq9tybfaLtzMOt6snW9H7qI+WEmK7PMOFuA3IQdt0REsb+bjNN5SxAxbvo46yec
\r
31 M3rWauDcweoV9gB7WrvU+ElKHLpIsflfY408+LY/838DEcp2pIwAquL818pxuAs2RB7g
\r
32 CE021F2BJ5AdkKKwZJICAr0ViNSl8l8N+5hTm5hT7iFsSx0Eu5qj05XOiycy3h4I75tx
\r
33 adiFLOf12io527H0ZAN0nyyjXMW4OuE6fY0JSg15VGnlC5BIIQGRFHgt/EwRi0xnnlfu 93jA==
\r
34 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
\r
35 d=google.com; s=20120113;
\r
36 h=sender:from:to:subject:in-reply-to:organization:references
\r
37 :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version
\r
38 :content-type:x-gm-message-state;
\r
39 bh=k3cSusr4oXo9pUf0EKe6avU/BtoWgLojYDKKQ1dCp/E=;
\r
40 b=QR50DENAyjo5YeT44qbzJR9oQkOfOQJlLoXLL8pciUICJ6lgUEaF0kq+Gv4CmmOPC0
\r
41 +cgH8N/zjS5gas29+UiqxGW5YHnY/JFiTBNw9HK+tduO/dlKZihJgiwTwTq6NH0sZ5Lg
\r
42 4PyWK0d21OTQ3IoTZp6Ckm0hYewPydhv9GSukrgy6qbD2YDcIfLIRXrXqXShME17OR+M
\r
43 Izlff+IYp9BFPzu+tK1rpq+dyRkIz0dybTkRwZRxi+X0YKSw7b9wvnLnWlDSa6gUQxg4
\r
44 gsDpmIebmWl7of+tYE1HlqZSdlzDeYTyUfJXXZoZ4o29R9tpElh3Te4KP+gYtZ2FvLsp
\r
46 Received: by 10.14.172.193 with SMTP id t41mr27637811eel.25.1346787841727;
\r
47 Tue, 04 Sep 2012 12:44:01 -0700 (PDT)
\r
48 Received: by 10.14.172.193 with SMTP id t41mr27637799eel.25.1346787841546;
\r
49 Tue, 04 Sep 2012 12:44:01 -0700 (PDT)
\r
50 Received: from mpn-glaptop ([2620:0:105f:5:f2de:f1ff:fe35:1a72])
\r
51 by mx.google.com with ESMTPS id v3sm47922341eep.10.2012.09.04.12.43.59
\r
52 (version=TLSv1/SSLv3 cipher=OTHER);
\r
53 Tue, 04 Sep 2012 12:44:00 -0700 (PDT)
\r
54 Sender: Michal Nazarewicz <mpn@google.com>
\r
55 From: Michal Nazarewicz <mina86@mina86.com>
\r
56 To: Dmitry Kurochkin <dmitry.kurochkin@gmail.com>, notmuch@notmuchmail.org
\r
57 Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib.
\r
58 In-Reply-To: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>
\r
59 Organization: http://mina86.com/
\r
60 References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>
\r
61 User-Agent: Notmuch/0.14+2~g416b120 (http://notmuchmail.org) Emacs/24.2.50.1
\r
62 (x86_64-unknown-linux-gnu)
\r
63 X-Face: PbkBB1w#)bOqd`iCe"Ds{e+!C7`pkC9a|f)Qo^BMQvy\q5x3?vDQJeN(DS?|-^$uMti[3D*#^_Ts"pU$jBQLq~Ud6iNwAw_r_o_4]|JO?]}P_}Nc&"p#D(ZgUb4uCNPe7~a[DbPG0T~!&c.y$Ur,=N4RT>]dNpd; KFrfMCylc}gc??'U2j,!8%xdD
\r
64 Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACQElEQVQ4jW3TMWvbQBQHcBk1xE6WyALX1069oZBMlq+ouUwpEQQ6uRjttkWP4CmBgGM0BQLBdPFZYPsyFUo6uEtKDQ7oy/U96XR2Ux8ehH/89Z6enqxBcS7Lg81jmSuujrfCZcLI/TYYvbGj+jbgFpHJ/bqQAUISj8iLyu4LuFHJTosxsucO4jSDNE0Hq3hwK/ceQ5sx97b8LcUDsILfk+ovHkOIsMbBfg43VuQ5Ln9YAGCkUdKJoXR9EclFBhixy3EGVz1K6eEkhxCAkeMMnqoAhAKwhoUJkDrCqvbecaYINlFKSRS1i12VKH1XpUd4qxL876EkMcDvHj3s5RBajHHMlA5iK32e0C7VgG0RlzFPvoYHZLRmAC0BmNcBruhkE0KsMsbEc62ZwUJDxWUdMsMhVqovoT96i/DnX/ASvz/6hbCabELLk/6FF/8PNpPCGqcZTGFcBhhAaZZDbQPaAB3+KrWWy2XgbYDNIinkdWAFcCpraDE/knwe5DBqGmgzESl1p2E4MWAz0VUPgYYzmfWb9yS4vCvgsxJriNTHoIBz5YteBvg+VGISQWUqhMiByPIPpygeDBE6elD973xWwKkEiHZAHKjhuPsFnBuArrzxtakRcISv+XMIPl4aGBUJm8Emk7qBYU8IlgNEIpiJhk/No24jHwkKTFHDWfPniR4iw5vJaw2nzSjfq2zffcE/GDjRC2dn0J0XwPAbDL84TvaFCJEU4Oml9pRyEUhR3Cl2t01AoEjRbs0sYugp14/4X5n4pU4EHHnMAAAAAElFTkSuQmCC
\r
66 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4
\r
67 Date: Tue, 04 Sep 2012 21:43:53 +0200
\r
68 Message-ID: <xa1tligpk1za.fsf@mina86.com>
\r
70 Content-Type: multipart/mixed; boundary="=-=-="
\r
71 X-Gm-Message-State: ALoCoQlVnkhgHJ0LZZPUpTWbE6G/J1gRct7k8MTReEQQx8SMsxxf8DVkw91sbuO04tPVb/dEgT2RYYt7cU2/CgwxISsOmNdZstU1qExzkrAiFdK3foWxXFdhYyYc4OdJoXmZ5744jTqBhqYU1dc2cBIRxLmdcu2ahUFQuIgKopDTwfAyfGsERq0GsHCaR0q8ha1GlZR6/kzGFvNu2JnzkVtnvHHgz2oWlA==
\r
72 X-BeenThere: notmuch@notmuchmail.org
\r
73 X-Mailman-Version: 2.1.13
\r
75 List-Id: "Use and development of the notmuch mail system."
\r
76 <notmuch.notmuchmail.org>
\r
77 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
78 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
79 List-Archive: <http://notmuchmail.org/pipermail/notmuch>
\r
80 List-Post: <mailto:notmuch@notmuchmail.org>
\r
81 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
82 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
83 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
84 X-List-Received-Date: Tue, 04 Sep 2012 19:44:04 -0000
\r
87 Content-Type: text/plain; charset=utf-8
\r
88 Content-Transfer-Encoding: quoted-printable
\r
90 On Tue, Sep 04 2012, Dmitry Kurochkin wrote:
\r
91 > The script removes duplicate message files. It takes no options.
\r
93 > Files are assumed duplicates if their content is the same except for
\r
94 > ignored headers. Currently, the only ignored header is Received:.
\r
96 > contrib/notmuch-remove-duplicates.py | 95 ++++++++++++++++++++++++++++=
\r
98 > 1 file changed, 95 insertions(+)
\r
99 > create mode 100755 contrib/notmuch-remove-duplicates.py
\r
101 > diff --git a/contrib/notmuch-remove-duplicates.py b/contrib/notmuch-remov=
\r
103 > new file mode 100755
\r
104 > index 0000000..dbe2e25
\r
106 > +++ b/contrib/notmuch-remove-duplicates.py
\r
108 > +#!/usr/bin/env python
\r
112 > +IGNORED_HEADERS =3D [ "Received:" ]
\r
114 > +if len(sys.argv) !=3D 1:
\r
115 > + print "Usage: %s" % sys.argv[0]
\r
117 > + print "The script removes duplicate message files. Takes no options=
\r
119 > + print "Requires notmuch python module."
\r
121 > + print "Files are assumed duplicates if their content is the same"
\r
122 > + print "except for the following headers: %s." % ", ".join(IGNORED_HE=
\r
126 It's much better put inside a main() function, which is than called only
\r
127 if the script is run directly.
\r
134 > +class MailComparator:
\r
135 > + """Checks if mail files are duplicates."""
\r
136 > + def __init__(self, filename):
\r
137 > + self.filename =3D filename
\r
138 > + self.mail =3D self.readFile(self.filename)
\r
140 > + def isDuplicate(self, filename):
\r
141 > + return self.mail =3D=3D self.readFile(filename)
\r
144 > + def readFile(filename):
\r
145 > + with open(filename) as f:
\r
148 > + line =3D f.readline()
\r
149 > + for header in IGNORED_HEADERS:
\r
150 > + if line.startswith(header):
\r
152 Case of headers should be ignored, but this does not ignore it.
\r
154 > + # skip header continuation lines
\r
156 > + line =3D f.readline()
\r
157 > + if len(line) =3D=3D 0 or line[0] not in [" "=
\r
162 This will ignore line just after the ignored header.
\r
166 > + if line =3D=3D "\n":
\r
168 > + data +=3D f.read()
\r
171 > +db =3D notmuch.Database()
\r
172 > +query =3D db.create_query('*')
\r
173 > +print "Number of messages: %s" % query.count_messages()
\r
175 > +files_count =3D 0
\r
176 > +for root, dirs, files in os.walk(db.get_path()):
\r
177 > + if not root.startswith(os.path.join(db.get_path(), ".notmuch/")):
\r
178 > + files_count +=3D len(files)
\r
179 > +print "Number of files: %s" % files_count
\r
180 > +print "Estimated number of duplicates: %s" % (files_count - query.count_=
\r
183 > +msgs =3D query.search_messages()
\r
185 > +suspected_duplicates_count =3D 0
\r
186 > +duplicates_count =3D 0
\r
187 > +timestamp =3D time.time()
\r
188 > +for msg in msgs:
\r
189 > + msg_count +=3D 1
\r
190 > + if len(msg.get_filenames()) > 1:
\r
191 > + filenames =3D msg.get_filenames()
\r
192 > + comparator =3D MailComparator(filenames.next())
\r
193 > + for filename in filenames:
\r
195 Strictly speaking, you need to compare each file to each file, and not
\r
196 just every file to the first file.
\r
198 > + if os.path.realpath(comparator.filename) =3D=3D os.path.real=
\r
200 > + print "Message '%s' has filenames pointing to the
\r
201 > same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename,
\r
204 So why aren't those removed?
\r
206 > + elif comparator.isDuplicate(filename):
\r
207 > + os.remove(filename)
\r
208 > + duplicates_count +=3D 1
\r
210 > + #print "Potential duplicates: %s" % msg.get_message_id()
\r
211 > + suspected_duplicates_count +=3D 1
\r
213 > + new_timestamp =3D time.time()
\r
214 > + if new_timestamp - timestamp > 1:
\r
215 > + timestamp =3D new_timestamp
\r
216 > + sys.stdout.write("\rProcessed %s messages, removed %s duplicates=
\r
217 ..." % (msg_count, duplicates_count))
\r
218 > + sys.stdout.flush()
\r
220 > +print "\rFinished. Processed %s messages, removed %s duplicates." % (msg=
\r
221 _count, duplicates_count)
\r
222 > +if duplicates_count > 0:
\r
223 > + print "You might want to run 'notmuch new' now."
\r
225 > +if suspected_duplicates_count > 0:
\r
227 > + print "Found %s messages with duplicate IDs but different content." =
\r
228 % suspected_duplicates_count
\r
229 > + print "Perhaps we should ignore more headers."
\r
231 Please consider the following instead (not tested):
\r
234 #!/usr/bin/env python
\r
244 IGNORED_HEADERS =3D [ 'Received' ]
\r
247 isIgnoredHeadersLine =3D re.compile(
\r
248 r'^(?:%s)\s*:' % '|'.join(IGNORED_HEADERS),
\r
249 re.IGNORECASE).search
\r
251 doesStartWithWS =3D re.compile(r'^\s').search
\r
255 print """Usage: %s [<query-string>]
\r
257 The script removes duplicate message files. Takes no options."
\r
258 Requires notmuch python module."
\r
260 Files are assumed duplicates if their content is the same"
\r
261 except for the following headers: %s.""" % (argv0, ', '.join(IGNORED_HEADER=
\r
265 def readMailFile(filename):
\r
266 with open(filename) as fd:
\r
268 skip_header =3D False
\r
270 if doesStartWithWS(line):
\r
271 if not skip_header:
\r
273 elif isIgnoredHeadersLine(line):
\r
274 skip_header =3D True
\r
277 if line =3D=3D '\n':
\r
279 data.append(fd.read())
\r
280 return ''.join(data)
\r
283 def dedupMessage(msg):
\r
284 filenames =3D msg.get_filenames()
\r
285 if len(filenames) <=3D 1:
\r
288 realpaths =3D collections.defaultdict(list)
\r
289 contents =3D collections.defaultdict(list)
\r
290 for filename in filenames:
\r
291 real =3D os.path.realpath(filename)
\r
292 lst =3D realpaths[real]
\r
293 lst.append(filename)
\r
294 if len(lst) =3D=3D 1:
\r
295 contents[readMailFile(real)].append(real)
\r
299 for filenames in contents.itervalues():
\r
300 if len(filenames) > 1:
\r
301 print 'Files with the same content:'
\r
302 print ' ', filenames.pop()
\r
303 duplicates +=3D len(filenames)
\r
304 for filename in filenames:
\r
305 del realpaths[filename]
\r
306 # os.remane(filename)
\r
308 for real, filenames in realpaths.iteritems():
\r
309 if len(filenames) > 1:
\r
310 print 'Files pointing to the same message:'
\r
311 print ' ', filenames.pop()
\r
312 duplicates +=3D len(filenames)
\r
313 # for filename in filenames:
\r
314 # os.remane(filename)
\r
316 return (duplicates, len(realpaths) - 1)
\r
319 def dedupQuery(query):
\r
320 print 'Number of messages: %s' % query.count_messages()
\r
322 suspected_count =3D 0
\r
323 duplicates_count =3D 0
\r
324 timestamp =3D time.time()
\r
325 msgs =3D query.search_messages()
\r
328 d, s =3D dedupMessage(msg)
\r
329 duplicates_count +=3D d
\r
330 suspected_count +=3D d
\r
332 new_timestamp =3D time.time()
\r
333 if new_timestamp - timestamp > 1:
\r
334 timestamp =3D new_timestamp
\r
335 sys.stdout.write('\rProcessed %s messages, removed %s duplicate=
\r
337 % (msg_count, duplicates_count))
\r
340 print '\rFinished. Processed %s messages, removed %s duplicates.' % (
\r
341 msg_count, duplicates_count)
\r
342 if duplicates_count > 0:
\r
343 print 'You might want to run "notmuch new" now.'
\r
345 if suspected_duplicates_count > 0:
\r
347 Found %d messages with duplicate IDs but different content.
\r
348 Perhaps we should ignore more headers.""" % suspected_count
\r
352 if len(argv) =3D=3D 1:
\r
354 elif len(argv) =3D=3D 2:
\r
360 db =3D notmuch.Database()
\r
361 query =3D db.create_query(query)
\r
362 dedupQuery(db, query)
\r
366 if __name__ =3D=3D '__main__':
\r
367 sys.exit(main(sys.argv))
\r
373 .o. | Liege of Serenely Enlightened Majesty of o' \,=3D./ `o
\r
374 ..o | Computer Science, Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz =
\r
376 ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--
\r
378 Content-Type: multipart/signed; boundary="==-=-=";
\r
379 micalg=pgp-sha1; protocol="application/pgp-signature"
\r
382 Content-Type: text/plain
\r
386 Content-Type: application/pgp-signature
\r
388 -----BEGIN PGP SIGNATURE-----
\r
389 Version: GnuPG v1.4.10 (GNU/Linux)
\r
391 iQIcBAEBAgAGBQJQRln5AAoJECBgQBJQdR/0EQQP/17AJmk0zYfSsuIz4I7H3Ykm
\r
392 9YMt/yK1hn/u6yDtylSbnjTgWT0t6OWbOplIzmW9q6iqwhMcdqP20HXMEkqSbNLa
\r
393 7WJ+4VXKRLb6PC3bQmM8eqYNtflgEAEWyAuZSQv9f893e6vH/e+7yoFtxaUypcUW
\r
394 Wf8qm3T1ljle+2S1xGbteoVDUSHY5epesXlWR6hVA9Qclc/4xpVLNapx3EKRkxBh
\r
395 vOpe+u5ATa04DYvIOoGVl723PBIHpm25cGen5lc8vOjXKwqhG0G7di5E29BhAyVT
\r
396 yZorKrfsBRTTIlYEErakrzGhiMP3zRnCQmFWvIj/ASbiOUnX8ktFMjfqe+DNW3zq
\r
397 T/2jpdzhBdVyioLhBIsMGLdsW6yIk3LURcw4uTijEG2ITj9kdQspydGFahTJk1Ly
\r
398 cIls19AMCK7xfGBt8o3xYMX6v/bOxpz/Hot0e+SdHQtiByIUKfJMF7gMo6YyxRfh
\r
399 cq1mgoLm+L4/zdrf4IMZDUpoMM8q4yr3eJibINlLxAmRnD3CpnVE2wf6mExLnYxy
\r
400 PTIQ9p3pRHsxbRuHvYylJfNNlGpjsRFSgKeRF50iFY+TnzUh+40Tp18BTbL9Dd7R
\r
401 UGuIMScxZ6qKt5MQhfBw1F+JpaaIsLTMSh1MjdzCvNVVRvkx7MQuUxiEcOj7wyEf
\r
402 0zip9GVi2hAsumsgL0Wz
\r
404 -----END PGP SIGNATURE-----
\r