Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 68D93431FB6 for ; Tue, 4 Sep 2012 12:44:04 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2ysK6ajrBbF0 for ; Tue, 4 Sep 2012 12:44:03 -0700 (PDT) Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com [74.125.83.53]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id CD1AA431FAF for ; Tue, 4 Sep 2012 12:44:02 -0700 (PDT) Received: by eekb47 with SMTP id b47so2969538eek.26 for ; Tue, 04 Sep 2012 12:44:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:in-reply-to:organization:references :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version :content-type; bh=k3cSusr4oXo9pUf0EKe6avU/BtoWgLojYDKKQ1dCp/E=; b=FbO3vYN4Wzgm/NDjOLlRLs3RwqNcWLVu/dHcnt43tBXCyk1oud8BrIRi3z41Yq9bYX ee5/Ekq9tybfaLtzMOt6snW9H7qI+WEmK7PMOFuA3IQdt0REsb+bjNN5SxAxbvo46yec M3rWauDcweoV9gB7WrvU+ElKHLpIsflfY408+LY/838DEcp2pIwAquL818pxuAs2RB7g CE021F2BJ5AdkKKwZJICAr0ViNSl8l8N+5hTm5hT7iFsSx0Eu5qj05XOiycy3h4I75tx adiFLOf12io527H0ZAN0nyyjXMW4OuE6fY0JSg15VGnlC5BIIQGRFHgt/EwRi0xnnlfu 93jA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:in-reply-to:organization:references :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version :content-type:x-gm-message-state; bh=k3cSusr4oXo9pUf0EKe6avU/BtoWgLojYDKKQ1dCp/E=; b=QR50DENAyjo5YeT44qbzJR9oQkOfOQJlLoXLL8pciUICJ6lgUEaF0kq+Gv4CmmOPC0 +cgH8N/zjS5gas29+UiqxGW5YHnY/JFiTBNw9HK+tduO/dlKZihJgiwTwTq6NH0sZ5Lg 4PyWK0d21OTQ3IoTZp6Ckm0hYewPydhv9GSukrgy6qbD2YDcIfLIRXrXqXShME17OR+M Izlff+IYp9BFPzu+tK1rpq+dyRkIz0dybTkRwZRxi+X0YKSw7b9wvnLnWlDSa6gUQxg4 gsDpmIebmWl7of+tYE1HlqZSdlzDeYTyUfJXXZoZ4o29R9tpElh3Te4KP+gYtZ2FvLsp zORg== Received: by 10.14.172.193 with SMTP id t41mr27637811eel.25.1346787841727; Tue, 04 Sep 2012 12:44:01 -0700 (PDT) Received: by 10.14.172.193 with SMTP id t41mr27637799eel.25.1346787841546; Tue, 04 Sep 2012 12:44:01 -0700 (PDT) Received: from mpn-glaptop ([2620:0:105f:5:f2de:f1ff:fe35:1a72]) by mx.google.com with ESMTPS id v3sm47922341eep.10.2012.09.04.12.43.59 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 04 Sep 2012 12:44:00 -0700 (PDT) Sender: Michal Nazarewicz From: Michal Nazarewicz To: Dmitry Kurochkin , notmuch@notmuchmail.org Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib. In-Reply-To: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com> Organization: http://mina86.com/ References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com> User-Agent: Notmuch/0.14+2~g416b120 (http://notmuchmail.org) Emacs/24.2.50.1 (x86_64-unknown-linux-gnu) X-Face: PbkBB1w#)bOqd`iCe"Ds{e+!C7`pkC9a|f)Qo^BMQvy\q5x3?vDQJeN(DS?|-^$uMti[3D*#^_Ts"pU$jBQLq~Ud6iNwAw_r_o_4]|JO?]}P_}Nc&"p#D(ZgUb4uCNPe7~a[DbPG0T~!&c.y$Ur,=N4RT>]dNpd; KFrfMCylc}gc??'U2j,!8%xdD Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACQElEQVQ4jW3TMWvbQBQHcBk1xE6WyALX1069oZBMlq+ouUwpEQQ6uRjttkWP4CmBgGM0BQLBdPFZYPsyFUo6uEtKDQ7oy/U96XR2Ux8ehH/89Z6enqxBcS7Lg81jmSuujrfCZcLI/TYYvbGj+jbgFpHJ/bqQAUISj8iLyu4LuFHJTosxsucO4jSDNE0Hq3hwK/ceQ5sx97b8LcUDsILfk+ovHkOIsMbBfg43VuQ5Ln9YAGCkUdKJoXR9EclFBhixy3EGVz1K6eEkhxCAkeMMnqoAhAKwhoUJkDrCqvbecaYINlFKSRS1i12VKH1XpUd4qxL876EkMcDvHj3s5RBajHHMlA5iK32e0C7VgG0RlzFPvoYHZLRmAC0BmNcBruhkE0KsMsbEc62ZwUJDxWUdMsMhVqovoT96i/DnX/ASvz/6hbCabELLk/6FF/8PNpPCGqcZTGFcBhhAaZZDbQPaAB3+KrWWy2XgbYDNIinkdWAFcCpraDE/knwe5DBqGmgzESl1p2E4MWAz0VUPgYYzmfWb9yS4vCvgsxJriNTHoIBz5YteBvg+VGISQWUqhMiByPIPpygeDBE6elD973xWwKkEiHZAHKjhuPsFnBuArrzxtakRcISv+XMIPl4aGBUJm8Emk7qBYU8IlgNEIpiJhk/No24jHwkKTFHDWfPniR4iw5vJaw2nzSjfq2zffcE/GDjRC2dn0J0XwPAbDL84TvaFCJEU4Oml9pRyEUhR3Cl2t01AoEjRbs0sYugp14/4X5n4pU4EHHnMAAAAAElFTkSuQmCC X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 Date: Tue, 04 Sep 2012 21:43:53 +0200 Message-ID: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Gm-Message-State: ALoCoQlVnkhgHJ0LZZPUpTWbE6G/J1gRct7k8MTReEQQx8SMsxxf8DVkw91sbuO04tPVb/dEgT2RYYt7cU2/CgwxISsOmNdZstU1qExzkrAiFdK3foWxXFdhYyYc4OdJoXmZ5744jTqBhqYU1dc2cBIRxLmdcu2ahUFQuIgKopDTwfAyfGsERq0GsHCaR0q8ha1GlZR6/kzGFvNu2JnzkVtnvHHgz2oWlA== X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Sep 2012 19:44:04 -0000 --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On Tue, Sep 04 2012, Dmitry Kurochkin wrote: > The script removes duplicate message files. It takes no options. > > Files are assumed duplicates if their content is the same except for > ignored headers. Currently, the only ignored header is Received:. > --- > contrib/notmuch-remove-duplicates.py | 95 ++++++++++++++++++++++++++++= ++++++ > 1 file changed, 95 insertions(+) > create mode 100755 contrib/notmuch-remove-duplicates.py > > diff --git a/contrib/notmuch-remove-duplicates.py b/contrib/notmuch-remov= e-duplicates.py > new file mode 100755 > index 0000000..dbe2e25 > --- /dev/null > +++ b/contrib/notmuch-remove-duplicates.py > @@ -0,0 +1,95 @@ > +#!/usr/bin/env python > + > +import sys > + > +IGNORED_HEADERS =3D [ "Received:" ] > + > +if len(sys.argv) !=3D 1: > + print "Usage: %s" % sys.argv[0] > + print > + print "The script removes duplicate message files. Takes no options= ." > + print "Requires notmuch python module." > + print > + print "Files are assumed duplicates if their content is the same" > + print "except for the following headers: %s." % ", ".join(IGNORED_HE= ADERS) > + exit(1) It's much better put inside a main() function, which is than called only if the script is run directly. > + > +import notmuch > +import os > +import time > + > +class MailComparator: > + """Checks if mail files are duplicates.""" > + def __init__(self, filename): > + self.filename =3D filename > + self.mail =3D self.readFile(self.filename) > + > + def isDuplicate(self, filename): > + return self.mail =3D=3D self.readFile(filename) > + > + @staticmethod > + def readFile(filename): > + with open(filename) as f: > + data =3D "" > + while True: > + line =3D f.readline() > + for header in IGNORED_HEADERS: > + if line.startswith(header): Case of headers should be ignored, but this does not ignore it. > + # skip header continuation lines > + while True: > + line =3D f.readline() > + if len(line) =3D=3D 0 or line[0] not in [" "= , "\t"]: > + break > + break This will ignore line just after the ignored header. > + else: > + data +=3D line > + if line =3D=3D "\n": > + break > + data +=3D f.read() > + return data > + > +db =3D notmuch.Database() > +query =3D db.create_query('*') > +print "Number of messages: %s" % query.count_messages() > + > +files_count =3D 0 > +for root, dirs, files in os.walk(db.get_path()): > + if not root.startswith(os.path.join(db.get_path(), ".notmuch/")): > + files_count +=3D len(files) > +print "Number of files: %s" % files_count > +print "Estimated number of duplicates: %s" % (files_count - query.count_= messages()) > + > +msgs =3D query.search_messages() > +msg_count =3D 0 > +suspected_duplicates_count =3D 0 > +duplicates_count =3D 0 > +timestamp =3D time.time() > +for msg in msgs: > + msg_count +=3D 1 > + if len(msg.get_filenames()) > 1: > + filenames =3D msg.get_filenames() > + comparator =3D MailComparator(filenames.next()) > + for filename in filenames: Strictly speaking, you need to compare each file to each file, and not just every file to the first file. > + if os.path.realpath(comparator.filename) =3D=3D os.path.real= path(filename): > + print "Message '%s' has filenames pointing to the > same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename, > filename) So why aren't those removed? > + elif comparator.isDuplicate(filename): > + os.remove(filename) > + duplicates_count +=3D 1 > + else: > + #print "Potential duplicates: %s" % msg.get_message_id() > + suspected_duplicates_count +=3D 1 > + > + new_timestamp =3D time.time() > + if new_timestamp - timestamp > 1: > + timestamp =3D new_timestamp > + sys.stdout.write("\rProcessed %s messages, removed %s duplicates= ..." % (msg_count, duplicates_count)) > + sys.stdout.flush() > + > +print "\rFinished. Processed %s messages, removed %s duplicates." % (msg= _count, duplicates_count) > +if duplicates_count > 0: > + print "You might want to run 'notmuch new' now." > + > +if suspected_duplicates_count > 0: > + print > + print "Found %s messages with duplicate IDs but different content." = % suspected_duplicates_count > + print "Perhaps we should ignore more headers." Please consider the following instead (not tested): #!/usr/bin/env python import collections import notmuch import os import re import sys import time IGNORED_HEADERS =3D [ 'Received' ] isIgnoredHeadersLine =3D re.compile( r'^(?:%s)\s*:' % '|'.join(IGNORED_HEADERS), re.IGNORECASE).search doesStartWithWS =3D re.compile(r'^\s').search def usage(argv0): print """Usage: %s [] The script removes duplicate message files. Takes no options." Requires notmuch python module." Files are assumed duplicates if their content is the same" except for the following headers: %s.""" % (argv0, ', '.join(IGNORED_HEADER= S)) def readMailFile(filename): with open(filename) as fd: data =3D [] skip_header =3D False for line in fd: if doesStartWithWS(line): if not skip_header: data.append(line) elif isIgnoredHeadersLine(line): skip_header =3D True else: data.append(line) if line =3D=3D '\n': break data.append(fd.read()) return ''.join(data) def dedupMessage(msg): filenames =3D msg.get_filenames() if len(filenames) <=3D 1: return (0, 0) realpaths =3D collections.defaultdict(list) contents =3D collections.defaultdict(list) for filename in filenames: real =3D os.path.realpath(filename) lst =3D realpaths[real] lst.append(filename) if len(lst) =3D=3D 1: contents[readMailFile(real)].append(real) duplicates =3D 0 for filenames in contents.itervalues(): if len(filenames) > 1: print 'Files with the same content:' print ' ', filenames.pop() duplicates +=3D len(filenames) for filename in filenames: del realpaths[filename] # os.remane(filename) for real, filenames in realpaths.iteritems(): if len(filenames) > 1: print 'Files pointing to the same message:' print ' ', filenames.pop() duplicates +=3D len(filenames) # for filename in filenames: # os.remane(filename) return (duplicates, len(realpaths) - 1) def dedupQuery(query): print 'Number of messages: %s' % query.count_messages() msg_count =3D 0 suspected_count =3D 0 duplicates_count =3D 0 timestamp =3D time.time() msgs =3D query.search_messages() for msg in msgs: msg_count +=3D 1 d, s =3D dedupMessage(msg) duplicates_count +=3D d suspected_count +=3D d new_timestamp =3D time.time() if new_timestamp - timestamp > 1: timestamp =3D new_timestamp sys.stdout.write('\rProcessed %s messages, removed %s duplicate= s...' % (msg_count, duplicates_count)) sys.stdout.flush() print '\rFinished. Processed %s messages, removed %s duplicates.' % ( msg_count, duplicates_count) if duplicates_count > 0: print 'You might want to run "notmuch new" now.' if suspected_duplicates_count > 0: print """ Found %d messages with duplicate IDs but different content. Perhaps we should ignore more headers.""" % suspected_count def main(argv): if len(argv) =3D=3D 1: query =3D '*' elif len(argv) =3D=3D 2: query =3D argv[1] else: usage(argv[0]) return 1 db =3D notmuch.Database() query =3D db.create_query(query) dedupQuery(db, query) return 0 if __name__ =3D=3D '__main__': sys.exit(main(sys.argv)) --=20 Best regards, _ _ .o. | Liege of Serenely Enlightened Majesty of o' \,=3D./ `o ..o | Computer Science, Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz = (o o) ooo +------------------ooO--(_)--Ooo-- --=-=-= Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha1; protocol="application/pgp-signature" --==-=-= Content-Type: text/plain --==-=-= Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAEBAgAGBQJQRln5AAoJECBgQBJQdR/0EQQP/17AJmk0zYfSsuIz4I7H3Ykm 9YMt/yK1hn/u6yDtylSbnjTgWT0t6OWbOplIzmW9q6iqwhMcdqP20HXMEkqSbNLa 7WJ+4VXKRLb6PC3bQmM8eqYNtflgEAEWyAuZSQv9f893e6vH/e+7yoFtxaUypcUW Wf8qm3T1ljle+2S1xGbteoVDUSHY5epesXlWR6hVA9Qclc/4xpVLNapx3EKRkxBh vOpe+u5ATa04DYvIOoGVl723PBIHpm25cGen5lc8vOjXKwqhG0G7di5E29BhAyVT yZorKrfsBRTTIlYEErakrzGhiMP3zRnCQmFWvIj/ASbiOUnX8ktFMjfqe+DNW3zq T/2jpdzhBdVyioLhBIsMGLdsW6yIk3LURcw4uTijEG2ITj9kdQspydGFahTJk1Ly cIls19AMCK7xfGBt8o3xYMX6v/bOxpz/Hot0e+SdHQtiByIUKfJMF7gMo6YyxRfh cq1mgoLm+L4/zdrf4IMZDUpoMM8q4yr3eJibINlLxAmRnD3CpnVE2wf6mExLnYxy PTIQ9p3pRHsxbRuHvYylJfNNlGpjsRFSgKeRF50iFY+TnzUh+40Tp18BTbL9Dd7R UGuIMScxZ6qKt5MQhfBw1F+JpaaIsLTMSh1MjdzCvNVVRvkx7MQuUxiEcOj7wyEf 0zip9GVi2hAsumsgL0Wz =mXjE -----END PGP SIGNATURE----- --==-=-=-- --=-=-=--