--- /dev/null
+Return-Path: <Vladimir.Marek@oracle.com>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+ by olra.theworths.org (Postfix) with ESMTP id 1D0AB40DAD6\r
+ for <notmuch@notmuchmail.org>; Fri, 6 Jun 2014 03:40:52 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: -2.299\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5\r
+ tests=[RCVD_IN_DNSWL_MED=-2.3, UNPARSEABLE_RELAY=0.001]\r
+ autolearn=disabled\r
+Received: from olra.theworths.org ([127.0.0.1])\r
+ by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
+ with ESMTP id hcLz1mheg5DB for <notmuch@notmuchmail.org>;\r
+ Fri, 6 Jun 2014 03:40:44 -0700 (PDT)\r
+Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69])\r
+ (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))\r
+ (No client certificate requested)\r
+ by olra.theworths.org (Postfix) with ESMTPS id 22E7945499F\r
+ for <notmuch@notmuchmail.org>; Fri, 6 Jun 2014 03:40:44 -0700 (PDT)\r
+Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237])\r
+ by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with\r
+ ESMTP id s56AeQ9N029868\r
+ (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);\r
+ Fri, 6 Jun 2014 10:40:27 GMT\r
+Received: from userz7021.oracle.com (userz7021.oracle.com [156.151.31.85])\r
+ by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id\r
+ s56AeMUW013677\r
+ (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL);\r
+ Fri, 6 Jun 2014 10:40:23 GMT\r
+Received: from abhmp0010.oracle.com (abhmp0010.oracle.com [141.146.116.16])\r
+ by userz7021.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id\r
+ s56AeLVH007113; Fri, 6 Jun 2014 10:40:21 GMT\r
+Received: from virt.cz.oracle.com (/10.163.102.127)\r
+ by default (Oracle Beehive Gateway v4.0)\r
+ with ESMTP ; Fri, 06 Jun 2014 03:40:20 -0700\r
+Date: Fri, 6 Jun 2014 12:40:18 +0200\r
+From: Vladimir Marek <Vladimir.Marek@oracle.com>\r
+To: David Edmondson <david.edmondson@oracle.com>\r
+Subject: Re: Deduplication ?\r
+Message-ID: <20140606104018.GJ2154@virt.cz.oracle.com>\r
+References: <20140602123212.GA12639@virt.cz.oracle.com>\r
+ <87d2ers9mi.fsf@qmul.ac.uk> <m2ppirs8ea.fsf@guru.guru-group.fi>\r
+ <87ppirqtfa.fsf@qmul.ac.uk> <87y4xfz1fi.fsf@nikula.org>\r
+ <cunegz71aw9.fsf@gargravarr.hh.sledj.net>\r
+MIME-Version: 1.0\r
+Content-Type: multipart/mixed; boundary="mJm6k4Vb/yFcL9ZU"\r
+Content-Disposition: inline\r
+In-Reply-To: <cunegz71aw9.fsf@gargravarr.hh.sledj.net>\r
+User-Agent: Mutt/1.5.22.1-rc1 (2013-10-16)\r
+X-Source-IP: acsinet21.oracle.com [141.146.126.237]\r
+Cc: Tomi Ollila <tomi.ollila@iki.fi>, notmuch@notmuchmail.org\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.13\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+ <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Fri, 06 Jun 2014 10:40:52 -0000\r
+\r
+\r
+--mJm6k4Vb/yFcL9ZU\r
+Content-Type: text/plain; charset=utf-8\r
+Content-Disposition: inline\r
+\r
+Hi,\r
+\r
+\r
+So I wrote some code which works for me well. I have erased ~40k\r
+messages out of 500k. It does not try to be complete solution, it only\r
+detects and removes the obvious cases. The idea is to help me control\r
+the number of duplicates when I import big mail archives which surely\r
+contain many duplicates into my mail database.\r
+\r
+> Thinking about this a bit...\r
+\r
+> The headers are likely to be different, so you could remove them (get\r
+> rid of everything up to the first empty line).\r
+\r
+Yes, that's what I ended up doing. And I delete the files which have\r
+less 'Received:' headers.\r
+\r
+\r
+> Various mailing lists add footers, so you would need to remove them (a\r
+> regular expression based approach would catch most of them easily).\r
+\r
+I defined a list of known footers. Then I take the two mails with the\r
+same message-id, create diff between them and compare it to the list of\r
+footers.\r
+\r
+\r
+> The remaining content should be the same for identical messages, so a\r
+> sensible hash (md5) could be used to compare.\r
+> \r
+> Although, some MTAs modify the body of the message when manipulating\r
+> encoding. I don't know how to address this.\r
+\r
+I'm attaching my perl script if anyone is interested. It's in no way\r
+complete solution. It is supposed to be used as\r
+\r
+notmuch search --output=files --duplicate=2 '*' > dups\r
+./dedup # It opens the file 'dups'\r
+\r
+The attached version does not remove anyting (the 'unlink' command is\r
+commented out).\r
+\r
+\r
+Interestingly this does not work (it seems to return all messages):\r
+notmuch search --output=messages --duplicate=2 '*'\r
+\r
+Also I have found that if I run 'notmuch search' and 'notmuch new' at\r
+the same time, the notmuch search crashes sometimes. That's why I don't\r
+use\r
+\r
+notmuch search ... | ./dedup\r
+\r
+Use with care :)\r
+\r
+Thank you for your help\r
+-- \r
+ Vlad\r
+\r
+--mJm6k4Vb/yFcL9ZU\r
+Content-Type: text/plain; charset=utf-8\r
+Content-Disposition: attachment; filename=dedup\r
+\r
+#!/usr/bin/perl\r
+\r
+use Data::Dumper;\r
+use List::Util;\r
+\r
+\r
+@TO_IGNORE= (\r
+\r
+<<'EOT'\r
+> _______________________________________________\r
+> notmuch mailing list\r
+> notmuch@notmuchmail.org\r
+> http://notmuchmail.org/mailman/listinfo/notmuch\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> _______________________________________________\r
+> Userland-perl mailing list\r
+> Userland-perl@userland.us.oracle.com\r
+> http://userland.us.oracle.com/mailman/listinfo/userland-perl\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> _______________________________________________\r
+> Mercurial mailing list\r
+> Mercurial@selenic.com\r
+> http://selenic.com/mailman/listinfo/mercurial\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> -- \r
+> To unsubscribe from this list go to the following URL and read the\r
+> instructions: https://lists.samba.org/mailman/options/samba\r
+EOT\r
+\r
+,\r
+\r
+<<'EOT'\r
+> \r
+EOT\r
+\r
+);\r
+\r
+sub rm($$) {\r
+ my ($file, $comment) = @_;\r
+ print "-> $file\n";\r
+ print $comment;\r
+ # unlink $file;\r
+}\r
+\r
+sub check_mail_id($) {\r
+ $ID = $_[0];\r
+\r
+ unless (open ID, "-|", "./notmuch", "search", "--output=files", "id:$ID") {\r
+ warn "Can not fork: $!";\r
+ return;\r
+ }\r
+ chomp(@FILES = <ID>);\r
+ close ID;\r
+\r
+ if (scalar @FILES <= 1) {\r
+ warn "Not enough files for ID:$ID\n";\r
+ return;\r
+ }\r
+\r
+ my ($F1, $F2) = @FILES;\r
+\r
+ unless (-r $F1) {\r
+ warn "Can not read $F1 in ID:$ID\n";\r
+ return;\r
+ }\r
+ unless (-r $F2) {\r
+ warn "Can not read $F2 in ID:$ID\n";\r
+ return;\r
+ }\r
+ if ($F1 eq $F2) {\r
+ warn "Same filename $F1\n in ID:$ID\n";\r
+ return;\r
+ }\r
+\r
+ unless (open DIFF_WHOLE, "-|", $diff, $F1, $F2) {\r
+ warn "Can not fork $diff: $!\n";\r
+ return;\r
+ }\r
+ $DIFF_WHOLE = join "", <DIFF_WHOLE>;\r
+ close DIFF_WHOLE;\r
+\r
+ if ( length($DIFF_WHOLE) == 0 ) {\r
+ rm $F2, "deleting_1\nID:$ID\n\n";\r
+ return;\r
+ }\r
+\r
+ # 35a36\r
+ # > Content-Length: 893\r
+ if (\r
+ $DIFF_WHOLE =~ /^\d+a\d+\n> Content-Length: \d+$/\r
+ or\r
+ $DIFF_WHOLE =~ /^\d+d\d+\n< Content-Length: \d+$/\r
+ ) {\r
+ rm $F2, "deleting_2\nID:$ID\n\n";\r
+ return;\r
+ }\r
+\r
+\r
+\r
+ # $r="[a-zA-Z0-9 ()[\]\.\+:/=;,\t-]+";\r
+ # if (\r
+ # $DIFF_WHOLE =~ /1,7d0\n< Received:$r\n< \t$r\n< \t$r\n< Received:$r\n< \t$r\n< \t$r\n< \t$r\n\d+a\d+,\d+\n> Content-Length:$r\n> Lines:$r/\r
+ # ) {\r
+ # printf "deleting_3\nID:$ID\n$DIFF_WHOLE\n\n";\r
+ # return;\r
+ # }\r
+\r
+ unless (open DIFF_BODY, "-|", "bash", "-c", "$diff <(sed -e 1,/^\$/d \"\$1\" ) <(sed -e 1,/^\$/d \"\$2\" )", "", $F1, $F2) {\r
+ warn "Can not fork $diff (2): $!\n";\r
+ return;\r
+ }\r
+ $DIFF_BODY = join "", <DIFF_BODY>;\r
+ close DIFF_BODY;\r
+\r
+ if ( length($DIFF_BODY) == 0 ) {\r
+ # The bodies are the same - let's find which one has less\r
+ # Received: headers and delete that\r
+ unless (open F, $F1) \r
+ {\r
+ warn "Can't open F1 '$F1': $!";\r
+ return;\r
+ }\r
+ my $count1 = grep { /^Received: / } <F>;\r
+ close F;\r
+ unless (open F, $F2) \r
+ {\r
+ warn "Can't open F2 '$F2': $!";\r
+ return;\r
+ }\r
+ my $count2 = grep { /^Received: / } <F>;\r
+ close F;\r
+\r
+ if ($count1 > $count2) {\r
+ rm $F2, "deleting_4a\nID:$ID\n\n";\r
+ } else {\r
+ rm $F1, "deleting_4b\nID:$ID\n\n";\r
+ }\r
+ return;\r
+ }\r
+\r
+\r
+ for (@TO_IGNORE) {\r
+ next unless $DIFF_BODY =~ $_;\r
+ # Remove the first one as the second is adding lines\r
+ rm $F1, "deleting_5\nID:$ID\n\n";\r
+ return;\r
+ }\r
+\r
+ for (@TO_IGNORE_REVERSE) {\r
+ next unless $DIFF_BODY =~ $_;\r
+ # Remove the second as it is removing some lines\r
+ rm $F2, "deleting_6\nID:$ID\n\n";\r
+ return;\r
+ }\r
+\r
+ #--------------------------------------------------\r
+ # '2c2\r
+ # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)\r
+ # ---\r
+ # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)\r
+ # 39c39\r
+ # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)\r
+ # ---\r
+ # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)\r
+ # 55c55\r
+ # < --Boundary_(ID_DK6KMxNlhttcScVv/QSi8A)--\r
+ # ---\r
+ # > --Boundary_(ID_dlReFj9tgdNWy+1SUxwTeQ)--\r
+ #-------------------------------------------------- \r
+ $re = qr/(\d+)c\1\n< --Boundary_\(\S+\)(?:--)?\n---\n> --Boundary_\(\S+\)(?:--)?\n/;\r
+ if ( $DIFF_BODY =~ m/^(?:$re+)$/ ) {\r
+ # Change in boundary strings\r
+ rm $F2, "deleting_7\nID:$ID\n\n";\r
+ return;\r
+ }\r
+\r
+ print "DIFF_BODY (ID: $ID):\n'$DIFF_BODY'\n\n" if length $DIFF_BODY < 300;\r
+}\r
+\r
+$diff = 'diff';\r
+$diff = 'gdiff' if -x '/usr/bin/gdiff'; # Solaris\r
+\r
+# First create reverse regexps (removing lines from the mail) so that we don't\r
+# overwrite the original @TO_IGNORE\r
+@TO_IGNORE_REVERSE = map {\r
+ $x = $_; # Make sure we don't change the @TO_IGNORE array\r
+ $x =~ s/^>/</mg; # Make sure all the lines are adding a text\r
+ qr/^(?:\d+,)?\d+d\d+\n\Q$x\E$/ # 1,2d3 or 2d3\r
+} @TO_IGNORE;\r
+\r
+# Now map the positive regexp (adding lines to the mail)\r
+@TO_IGNORE = map {\r
+ s/^</>/mg; # Make sure all the lines are removing text\r
+ qr/^\d+a\d+?(?:,\d+)?\n\Q$_\E$/ # 115a116,119 or 114a116\r
+} @TO_IGNORE;\r
+\r
+# File 'dups' is created via\r
+# notmuch search --output=files --duplicate=2 '*' > dups\r
+\r
+open INPUT, "dups" or die "Can't open dups: $!\n";\r
+while (<INPUT>) {\r
+ chomp;\r
+ if (open FILE, $_) {\r
+ $id = List::Util::first { s/^message-id:.*<(.*)>\n$/\1/i } <FILE>;\r
+ close FILE;\r
+ check_mail_id $id if defined $id;\r
+ } else {\r
+ print "Can't find '$_\n'";\r
+ }\r
+}\r
+close INPUT;\r
+\r
+--mJm6k4Vb/yFcL9ZU--\r