Re: Deduplication ?

author Vladimir Marek <Vladimir.Marek@oracle.com>

Fri, 6 Jun 2014 10:40:18 +0000 (12:40 +0200)

committer W. Trevor King <wking@tremily.us>

Fri, 7 Nov 2014 18:03:09 +0000 (10:03 -0800)
author Vladimir Marek <Vladimir.Marek@oracle.com>
Fri, 6 Jun 2014 10:40:18 +0000 (12:40 +0200)
committer W. Trevor King <wking@tremily.us>
Fri, 7 Nov 2014 18:03:09 +0000 (10:03 -0800)
diff --git a/63/1be5dca412e2e156d02a19014e9b19f9d13654 b/63/1be5dca412e2e156d02a19014e9b19f9d13654

new file mode 100644 (file)

index 0000000..c6cb733
--- /dev/null
+++ b/63/1be5dca412e2e156d02a19014e9b19f9d13654
@@ -0,0 +1,359 @@
+Return-Path: <Vladimir.Marek@oracle.com>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+       by olra.theworths.org (Postfix) with ESMTP id 1D0AB40DAD6\r
+       for <notmuch@notmuchmail.org>; Fri,  6 Jun 2014 03:40:52 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: -2.299\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5\r
+       tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001]\r
+       autolearn=disabled\r
+Received: from olra.theworths.org ([127.0.0.1])\r
+       by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
+       with ESMTP id hcLz1mheg5DB for <notmuch@notmuchmail.org>;\r
+       Fri,  6 Jun 2014 03:40:44 -0700 (PDT)\r
+Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69])\r
+       (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))\r
+       (No client certificate requested)\r
+       by olra.theworths.org (Postfix) with ESMTPS id 22E7945499F\r
+       for <notmuch@notmuchmail.org>; Fri,  6 Jun 2014 03:40:44 -0700 (PDT)\r
+Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237])\r
+       by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with\r
+       ESMTP id s56AeQ9N029868\r
+       (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);\r
+       Fri, 6 Jun 2014 10:40:27 GMT\r
+Received: from userz7021.oracle.com (userz7021.oracle.com [156.151.31.85])\r
+       by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id\r
+       s56AeMUW013677\r
+       (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL);\r
+       Fri, 6 Jun 2014 10:40:23 GMT\r
+Received: from abhmp0010.oracle.com (abhmp0010.oracle.com [141.146.116.16])\r
+       by userz7021.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id\r
+       s56AeLVH007113; Fri, 6 Jun 2014 10:40:21 GMT\r
+Received: from virt.cz.oracle.com (/10.163.102.127)\r
+       by default (Oracle Beehive Gateway v4.0)\r
+       with ESMTP ; Fri, 06 Jun 2014 03:40:20 -0700\r
+Date: Fri, 6 Jun 2014 12:40:18 +0200\r
+From: Vladimir Marek <Vladimir.Marek@oracle.com>\r
+To: David Edmondson <david.edmondson@oracle.com>\r
+Subject: Re: Deduplication ?\r
+Message-ID: <20140606104018.GJ2154@virt.cz.oracle.com>\r
+References: <20140602123212.GA12639@virt.cz.oracle.com>\r
+       <87d2ers9mi.fsf@qmul.ac.uk> <m2ppirs8ea.fsf@guru.guru-group.fi>\r
+       <87ppirqtfa.fsf@qmul.ac.uk> <87y4xfz1fi.fsf@nikula.org>\r
+       <cunegz71aw9.fsf@gargravarr.hh.sledj.net>\r
+MIME-Version: 1.0\r
+Content-Type: multipart/mixed; boundary="mJm6k4Vb/yFcL9ZU"\r
+Content-Disposition: inline\r
+In-Reply-To: <cunegz71aw9.fsf@gargravarr.hh.sledj.net>\r
+User-Agent: Mutt/1.5.22.1-rc1 (2013-10-16)\r
+X-Source-IP: acsinet21.oracle.com [141.146.126.237]\r
+Cc: Tomi Ollila <tomi.ollila@iki.fi>, notmuch@notmuchmail.org\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.13\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+       <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+       <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Fri, 06 Jun 2014 10:40:52 -0000\r
+\r
+\r
+--mJm6k4Vb/yFcL9ZU\r
+Content-Type: text/plain; charset=utf-8\r
+Content-Disposition: inline\r
+\r
+Hi,\r
+\r
+\r
+So I wrote some code which works for me well. I have erased ~40k\r
+messages out of 500k. It does not try to be complete solution, it only\r
+detects and removes the obvious cases. The idea is to help me control\r
+the number of duplicates when I import big mail archives which surely\r
+contain many duplicates into my mail database.\r
+\r
+> Thinking about this a bit...\r
+\r
+> The headers are likely to be different, so you could remove them (get\r
+> rid of everything up to the first empty line).\r
+\r
+Yes, that's what I ended up doing. And I delete the files which have\r
+less 'Received:' headers.\r
+\r
+\r
+> Various mailing lists add footers, so you would need to remove them (a\r
+> regular expression based approach would catch most of them easily).\r
+\r
+I defined a list of known footers. Then I take the two mails with the\r
+same message-id, create diff between them and  compare it to the list of\r
+footers.\r
+\r
+\r
+> The remaining content should be the same for identical messages, so a\r
+> sensible hash (md5) could be used to compare.\r
+> \r
+> Although, some MTAs modify the body of the message when manipulating\r
+> encoding. I don't know how to address this.\r
+\r
+I'm attaching my perl script if anyone is interested. It's in no way\r
+complete solution. It is supposed to be used as\r
+\r
+notmuch search --output=files --duplicate=2 '*' > dups\r
+./dedup # It opens the file 'dups'\r
+\r
+The attached version does not remove anyting (the 'unlink' command is\r
+commented out).\r
+\r
+\r
+Interestingly this does not work (it seems to return all messages):\r
+notmuch search --output=messages --duplicate=2 '*'\r
+\r
+Also I have found that if I run 'notmuch search' and 'notmuch new' at\r
+the same time, the notmuch search crashes sometimes. That's why I don't\r
+use\r
+\r
+notmuch search ... | ./dedup\r
+\r
+Use with care :)\r
+\r
+Thank you for your help\r
+-- \r
+       Vlad\r
+\r
+--mJm6k4Vb/yFcL9ZU\r
+Content-Type: text/plain; charset=utf-8\r
+Content-Disposition: attachment; filename=dedup\r
+\r
+#!/usr/bin/perl\r
+\r
+use Data::Dumper;\r
+use List::Util;\r
+\r
+\r
+@TO_IGNORE= (\r
+\r
+<<'EOT'\r
+> _______________________________________________\r
+> notmuch mailing list\r
+> notmuch@notmuchmail.org\r
+> http://notmuchmail.org/mailman/listinfo/notmuch\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> _______________________________________________\r
+> Userland-perl mailing list\r
+> Userland-perl@userland.us.oracle.com\r
+> http://userland.us.oracle.com/mailman/listinfo/userland-perl\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> _______________________________________________\r
+> Mercurial mailing list\r
+> Mercurial@selenic.com\r
+> http://selenic.com/mailman/listinfo/mercurial\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> --    \r
+> To unsubscribe from this list go to the following URL and read the\r
+> instructions:  https://lists.samba.org/mailman/options/samba\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> \r
+EOT\r
+\r
+);\r
+\r
+sub rm($$) {\r
+       my ($file, $comment) = @_;\r
+       print "-> $file\n";\r
+       print $comment;\r
+       # unlink $file;\r
+}\r
+\r
+sub check_mail_id($) {\r
+       $ID = $_[0];\r
+\r
+       unless (open ID, "-|", "./notmuch", "search", "--output=files", "id:$ID") {\r
+               warn "Can not fork: $!";\r
+               return;\r
+       }\r
+       chomp(@FILES = <ID>);\r
+       close ID;\r
+\r
+       if (scalar @FILES <= 1) {\r
+               warn "Not enough files for ID:$ID\n";\r
+               return;\r
+       }\r
+\r
+       my ($F1, $F2) = @FILES;\r
+\r
+       unless (-r $F1) {\r
+               warn "Can not read $F1 in ID:$ID\n";\r
+               return;\r
+       }\r
+       unless (-r $F2) {\r
+               warn "Can not read $F2 in ID:$ID\n";\r
+               return;\r
+       }\r
+       if ($F1 eq $F2) {\r
+               warn "Same filename $F1\n in ID:$ID\n";\r
+               return;\r
+       }\r
+\r
+       unless (open DIFF_WHOLE, "-|", $diff, $F1, $F2) {\r
+               warn "Can not fork $diff: $!\n";\r
+               return;\r
+       }\r
+       $DIFF_WHOLE = join "", <DIFF_WHOLE>;\r
+       close DIFF_WHOLE;\r
+\r
+       if ( length($DIFF_WHOLE) == 0 ) {\r
+               rm $F2, "deleting_1\nID:$ID\n\n";\r
+               return;\r
+       }\r
+\r
+       # 35a36\r
+       # > Content-Length: 893\r
+       if (\r
+               $DIFF_WHOLE =~ /^\d+a\d+\n> Content-Length: \d+$/\r
+               or\r
+               $DIFF_WHOLE =~ /^\d+d\d+\n< Content-Length: \d+$/\r
+       ) {\r
+               rm $F2, "deleting_2\nID:$ID\n\n";\r
+               return;\r
+       }\r
+\r
+\r
+\r
+       # $r="[a-zA-Z0-9 ()[\]\.\+:/=;,\t-]+";\r
+       # if (\r
+       #       $DIFF_WHOLE =~ /1,7d0\n< Received:$r\n< \t$r\n< \t$r\n< Received:$r\n< \t$r\n< \t$r\n< \t$r\n\d+a\d+,\d+\n> Content-Length:$r\n> Lines:$r/\r
+       # ) {\r
+       #       printf "deleting_3\nID:$ID\n$DIFF_WHOLE\n\n";\r
+       #       return;\r
+       # }\r
+\r
+       unless (open DIFF_BODY, "-|", "bash", "-c", "$diff <(sed -e 1,/^\$/d \"\$1\" ) <(sed -e 1,/^\$/d \"\$2\" )", "", $F1, $F2) {\r
+               warn "Can not fork $diff (2): $!\n";\r
+               return;\r
+       }\r
+       $DIFF_BODY = join "", <DIFF_BODY>;\r
+       close DIFF_BODY;\r
+\r
+       if ( length($DIFF_BODY) == 0 ) {\r
+               # The bodies are the same - let's find which one has less\r
+               # Received: headers and delete that\r
+               unless (open F, $F1) \r
+               {\r
+                       warn "Can't open F1 '$F1': $!";\r
+                       return;\r
+               }\r
+               my $count1 = grep { /^Received: / } <F>;\r
+               close F;\r
+               unless (open F, $F2) \r
+               {\r
+                       warn "Can't open F2 '$F2': $!";\r
+                       return;\r
+               }\r
+               my $count2 = grep { /^Received: / } <F>;\r
+               close F;\r
+\r
+               if ($count1 > $count2) {\r
+                       rm $F2, "deleting_4a\nID:$ID\n\n";\r
+               } else {\r
+                       rm $F1, "deleting_4b\nID:$ID\n\n";\r
+               }\r
+               return;\r
+       }\r
+\r
+\r
+       for (@TO_IGNORE) {\r
+               next unless $DIFF_BODY =~ $_;\r
+               # Remove the first one as the second is adding lines\r
+               rm $F1, "deleting_5\nID:$ID\n\n";\r
+               return;\r
+       }\r
+\r
+       for (@TO_IGNORE_REVERSE) {\r
+               next unless $DIFF_BODY =~ $_;\r
+               # Remove the second as it is removing some lines\r
+               rm $F2, "deleting_6\nID:$ID\n\n";\r
+               return;\r
+       }\r
+\r
+       #--------------------------------------------------\r
+       # '2c2\r
+       # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)\r
+       # ---\r
+       # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)\r
+       # 39c39\r
+       # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)\r
+       # ---\r
+       # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)\r
+       # 55c55\r
+       # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)--\r
+       # ---\r
+       # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)--\r
+       #-------------------------------------------------- \r
+       $re = qr/(\d+)c\1\n< --Boundary_\(\S+\)(?:--)?\n---\n> --Boundary_\(\S+\)(?:--)?\n/;\r
+       if ( $DIFF_BODY =~ m/^(?:$re+)$/ ) {\r
+               # Change in boundary strings\r
+               rm $F2, "deleting_7\nID:$ID\n\n";\r
+               return;\r
+       }\r
+\r
+       print "DIFF_BODY (ID: $ID):\n'$DIFF_BODY'\n\n" if length $DIFF_BODY < 300;\r
+}\r
+\r
+$diff = 'diff';\r
+$diff = 'gdiff' if -x '/usr/bin/gdiff'; # Solaris\r
+\r
+# First create reverse regexps (removing lines from the mail) so that we don't\r
+# overwrite the original @TO_IGNORE\r
+@TO_IGNORE_REVERSE = map {\r
+       $x = $_;                       # Make sure we don't change the @TO_IGNORE array\r
+       $x =~ s/^>/</mg;               # Make sure all the lines are adding a text\r
+       qr/^(?:\d+,)?\d+d\d+\n\Q$x\E$/ # 1,2d3 or 2d3\r
+} @TO_IGNORE;\r
+\r
+# Now map the positive regexp (adding lines to the mail)\r
+@TO_IGNORE = map {\r
+       s/^</>/mg;                      # Make sure all the lines are removing text\r
+       qr/^\d+a\d+?(?:,\d+)?\n\Q$_\E$/ # 115a116,119 or 114a116\r
+} @TO_IGNORE;\r
+\r
+# File 'dups' is created via\r
+# notmuch search --output=files --duplicate=2 '*' > dups\r
+\r
+open INPUT, "dups" or die "Can't open dups: $!\n";\r
+while (<INPUT>) {\r
+       chomp;\r
+       if (open FILE, $_) {\r
+               $id =  List::Util::first { s/^message-id:.*<(.*)>\n$/\1/i } <FILE>;\r
+               close FILE;\r
+               check_mail_id $id if defined $id;\r
+       } else {\r
+               print "Can't find '$_\n'";\r
+       }\r
+}\r
+close INPUT;\r
+\r
+--mJm6k4Vb/yFcL9ZU--\r
author	Vladimir Marek <Vladimir.Marek@oracle.com>
	Fri, 6 Jun 2014 10:40:18 +0000 (12:40 +0200)
committer	W. Trevor King <wking@tremily.us>
	Fri, 7 Nov 2014 18:03:09 +0000 (10:03 -0800)