Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id AD3E7431FB6 for ; Tue, 4 Sep 2012 13:33:14 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.799 X-Spam-Level: X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rc2qtB-UN8JA for ; Tue, 4 Sep 2012 13:33:14 -0700 (PDT) Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com [74.125.83.53]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id DCA24431FAF for ; Tue, 4 Sep 2012 13:33:13 -0700 (PDT) Received: by eekb47 with SMTP id b47so2989397eek.26 for ; Tue, 04 Sep 2012 13:33:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:in-reply-to:references:user-agent:date:message-id :mime-version:content-type:content-transfer-encoding; bh=adZvOEGHSOE5vQKYmPradQUCmd2HfnHq3QSYE2wFnv8=; b=BuqAzy4EnNtgSmG6Zi8xZzE78jEPJXqQd2pa64UQFuO3cK1jY8iZN8U6yw6RR6bFqi CsdZxxFUt3B6t61XB5aSrXkiwQwvOXhwJ5ej4lyQq23KWoWpTIoltWxKiygQ9gCIfoxt qfHGF4sY2oRa1VBKRJBL/bdCNCELEG0MjxU8lpUgx6rP0eKTaka4srR7EqQVJbiH9J4q 9rTwIJqVMky8pbb6w8wC3Noz2J07H+x3lB0zTjg+LJLPT6JSFamTbr5o8GJXuJqiKlKd Aua4DcWnGkj3KILYhKbaPVOE7c76367SmwetPiipP8C+Qn8bESTKZ+RURxC/kIMQS6pQ qd/g== Received: by 10.14.173.9 with SMTP id u9mr27733873eel.8.1346790792770; Tue, 04 Sep 2012 13:33:12 -0700 (PDT) Received: from localhost ([2001:470:1f0b:14dd:224:d7ff:fee2:c588]) by mx.google.com with ESMTPS id k41sm48201821eep.13.2012.09.04.13.33.11 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 04 Sep 2012 13:33:12 -0700 (PDT) From: Dmitry Kurochkin To: Michal Nazarewicz , notmuch@notmuchmail.org Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib. In-Reply-To: References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com> <87d321sg20.fsf@gmail.com> User-Agent: Notmuch/0.14+18~g79a73cd (http://notmuchmail.org) Emacs/23.4.1 (x86_64-pc-linux-gnu) Date: Wed, 05 Sep 2012 00:33:10 +0400 Message-ID: <87a9x5sf3t.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Sep 2012 20:33:14 -0000 Michal Nazarewicz writes: >>> On Tue, Sep 04 2012, Dmitry Kurochkin wrote: >>>> +class MailComparator: >>>> + """Checks if mail files are duplicates.""" >>>> + def __init__(self, filename): >>>> + self.filename =3D filename >>>> + self.mail =3D self.readFile(self.filename) >>>> + >>>> + def isDuplicate(self, filename): >>>> + return self.mail =3D=3D self.readFile(filename) >>>> + >>>> + @staticmethod >>>> + def readFile(filename): >>>> + with open(filename) as f: >>>> + data =3D "" >>>> + while True: >>>> + line =3D f.readline() >>>> + for header in IGNORED_HEADERS: >>>> + if line.startswith(header): > >> Michal Nazarewicz writes: >>> Case of headers should be ignored, but this does not ignore it. > > On Tue, Sep 04 2012, Dmitry Kurochkin wrote: >> It does. > > Wait, how? If line is =E2=80=9Creceived:=E2=80=9D how does it starts wit= h =E2=80=9CReceived:=E2=80=9D? > Sorry, I misunderstood your comment. It does not ignore the case indeed. >>>> + if os.path.realpath(comparator.filename) =3D=3D os.path.r= ealpath(filename): >>>> + print "Message '%s' has filenames pointing to the >>>> same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename, >>>> filename) >>> >>> So why aren't those removed? >>> >> >> Because it is the same file indexed twice (probably because of >> symlinks). We do not want to remove the only message file. > > Ah, right, with symlinks this is troublesome, but than again, we can > check if there is at least one non-symlink. If there is, delete > everything else, if there is not, delete all but one arbitrarily chosen > symlink. > Sure, we could do that. >>>> + elif comparator.isDuplicate(filename): >>>> + os.remove(filename) >>>> + duplicates_count +=3D 1 >>>> + else: >>>> + #print "Potential duplicates: %s" % msg.get_message_i= d() >>>> + suspected_duplicates_count +=3D 1 >>>> + >>>> + new_timestamp =3D time.time() >>>> + if new_timestamp - timestamp > 1: >>>> + timestamp =3D new_timestamp >>>> + sys.stdout.write("\rProcessed %s messages, removed %s duplica= tes..." % (msg_count, duplicates_count)) >>>> + sys.stdout.flush() >>>> + >>>> +print "\rFinished. Processed %s messages, removed %s duplicates." % (= msg_count, duplicates_count) >>>> +if duplicates_count > 0: >>>> + print "You might want to run 'notmuch new' now." >>>> + >>>> +if suspected_duplicates_count > 0: >>>> + print >>>> + print "Found %s messages with duplicate IDs but different content= ." % suspected_duplicates_count >>>> + print "Perhaps we should ignore more headers." >>> >>> Please consider the following instead (not tested): > >> Thanks for reviewing my poor python code :) I am afraid I do not have >> enough interest in improving it. I just implemented a simple solution >> for my problem. Though it looks like you already took time to rewrite >> the script. Would be great if you send it as a proper patch obsoleting >> this one. > > Bah, I'll probably won't have time to properly test it. > Same problem :) Regards, Dmitry > --=20 > Best regards, _ _ > .o. | Liege of Serenely Enlightened Majesty of o' \,=3D./ `o > ..o | Computer Science, Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz = (o o) > ooo +------------------ooO--(_)--Ooo--