Re: [PATCH 9/9] add has: query prefix to search for specific properties
[notmuch-archives.git] / b4 / bc2fd6b030b3a5a8c74bdfab3bf3d6fc3558a5
1 Return-Path: <mpn@google.com>\r
2 X-Original-To: notmuch@notmuchmail.org\r
3 Delivered-To: notmuch@notmuchmail.org\r
4 Received: from localhost (localhost [127.0.0.1])\r
5         by olra.theworths.org (Postfix) with ESMTP id 68D93431FB6\r
6         for <notmuch@notmuchmail.org>; Tue,  4 Sep 2012 12:44:04 -0700 (PDT)\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org\r
8 X-Spam-Flag: NO\r
9 X-Spam-Score: -0.7\r
10 X-Spam-Level: \r
11 X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5\r
12         tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7]\r
13         autolearn=disabled\r
14 Received: from olra.theworths.org ([127.0.0.1])\r
15         by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)\r
16         with ESMTP id 2ysK6ajrBbF0 for <notmuch@notmuchmail.org>;\r
17         Tue,  4 Sep 2012 12:44:03 -0700 (PDT)\r
18 Received: from mail-ee0-f53.google.com (mail-ee0-f53.google.com\r
19  [74.125.83.53])        (using TLSv1 with cipher RC4-SHA (128/128 bits))        (No client\r
20  certificate requested) by olra.theworths.org (Postfix) with ESMTPS id\r
21  CD1AA431FAF    for <notmuch@notmuchmail.org>; Tue,  4 Sep 2012 12:44:02 -0700\r
22  (PDT)\r
23 Received: by eekb47 with SMTP id b47so2969538eek.26\r
24         for <notmuch@notmuchmail.org>; Tue, 04 Sep 2012 12:44:01 -0700 (PDT)\r
25 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;\r
26  s=20120113;    h=sender:from:to:subject:in-reply-to:organization:references\r
27         :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version\r
28         :content-type; bh=k3cSusr4oXo9pUf0EKe6avU/BtoWgLojYDKKQ1dCp/E=;\r
29         b=FbO3vYN4Wzgm/NDjOLlRLs3RwqNcWLVu/dHcnt43tBXCyk1oud8BrIRi3z41Yq9bYX\r
30         ee5/Ekq9tybfaLtzMOt6snW9H7qI+WEmK7PMOFuA3IQdt0REsb+bjNN5SxAxbvo46yec\r
31         M3rWauDcweoV9gB7WrvU+ElKHLpIsflfY408+LY/838DEcp2pIwAquL818pxuAs2RB7g\r
32         CE021F2BJ5AdkKKwZJICAr0ViNSl8l8N+5hTm5hT7iFsSx0Eu5qj05XOiycy3h4I75tx\r
33         adiFLOf12io527H0ZAN0nyyjXMW4OuE6fY0JSg15VGnlC5BIIQGRFHgt/EwRi0xnnlfu    93jA==\r
34 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;\r
35         d=google.com; s=20120113;\r
36         h=sender:from:to:subject:in-reply-to:organization:references\r
37         :user-agent:x-face:face:x-pgp:x-pgp-fp:date:message-id:mime-version\r
38         :content-type:x-gm-message-state;\r
39         bh=k3cSusr4oXo9pUf0EKe6avU/BtoWgLojYDKKQ1dCp/E=;\r
40         b=QR50DENAyjo5YeT44qbzJR9oQkOfOQJlLoXLL8pciUICJ6lgUEaF0kq+Gv4CmmOPC0\r
41         +cgH8N/zjS5gas29+UiqxGW5YHnY/JFiTBNw9HK+tduO/dlKZihJgiwTwTq6NH0sZ5Lg\r
42         4PyWK0d21OTQ3IoTZp6Ckm0hYewPydhv9GSukrgy6qbD2YDcIfLIRXrXqXShME17OR+M\r
43         Izlff+IYp9BFPzu+tK1rpq+dyRkIz0dybTkRwZRxi+X0YKSw7b9wvnLnWlDSa6gUQxg4\r
44         gsDpmIebmWl7of+tYE1HlqZSdlzDeYTyUfJXXZoZ4o29R9tpElh3Te4KP+gYtZ2FvLsp\r
45         zORg==\r
46 Received: by 10.14.172.193 with SMTP id t41mr27637811eel.25.1346787841727;\r
47         Tue, 04 Sep 2012 12:44:01 -0700 (PDT)\r
48 Received: by 10.14.172.193 with SMTP id t41mr27637799eel.25.1346787841546;\r
49         Tue, 04 Sep 2012 12:44:01 -0700 (PDT)\r
50 Received: from mpn-glaptop ([2620:0:105f:5:f2de:f1ff:fe35:1a72])\r
51         by mx.google.com with ESMTPS id v3sm47922341eep.10.2012.09.04.12.43.59\r
52         (version=TLSv1/SSLv3 cipher=OTHER);\r
53         Tue, 04 Sep 2012 12:44:00 -0700 (PDT)\r
54 Sender: Michal Nazarewicz <mpn@google.com>\r
55 From: Michal Nazarewicz <mina86@mina86.com>\r
56 To: Dmitry Kurochkin <dmitry.kurochkin@gmail.com>, notmuch@notmuchmail.org\r
57 Subject: Re: [PATCH] Add notmuch-remove-duplicates.py script to contrib.\r
58 In-Reply-To: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>\r
59 Organization: http://mina86.com/\r
60 References: <1346784785-19746-1-git-send-email-dmitry.kurochkin@gmail.com>\r
61 User-Agent: Notmuch/0.14+2~g416b120 (http://notmuchmail.org) Emacs/24.2.50.1\r
62         (x86_64-unknown-linux-gnu)\r
63 X-Face: PbkBB1w#)bOqd`iCe"Ds{e+!C7`pkC9a|f)Qo^BMQvy\q5x3?vDQJeN(DS?|-^$uMti[3D*#^_Ts"pU$jBQLq~Ud6iNwAw_r_o_4]|JO?]}P_}Nc&"p#D(ZgUb4uCNPe7~a[DbPG0T~!&c.y$Ur,=N4RT>]dNpd;        KFrfMCylc}gc??'U2j,!8%xdD\r
64 Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACQElEQVQ4jW3TMWvbQBQHcBk1xE6WyALX1069oZBMlq+ouUwpEQQ6uRjttkWP4CmBgGM0BQLBdPFZYPsyFUo6uEtKDQ7oy/U96XR2Ux8ehH/89Z6enqxBcS7Lg81jmSuujrfCZcLI/TYYvbGj+jbgFpHJ/bqQAUISj8iLyu4LuFHJTosxsucO4jSDNE0Hq3hwK/ceQ5sx97b8LcUDsILfk+ovHkOIsMbBfg43VuQ5Ln9YAGCkUdKJoXR9EclFBhixy3EGVz1K6eEkhxCAkeMMnqoAhAKwhoUJkDrCqvbecaYINlFKSRS1i12VKH1XpUd4qxL876EkMcDvHj3s5RBajHHMlA5iK32e0C7VgG0RlzFPvoYHZLRmAC0BmNcBruhkE0KsMsbEc62ZwUJDxWUdMsMhVqovoT96i/DnX/ASvz/6hbCabELLk/6FF/8PNpPCGqcZTGFcBhhAaZZDbQPaAB3+KrWWy2XgbYDNIinkdWAFcCpraDE/knwe5DBqGmgzESl1p2E4MWAz0VUPgYYzmfWb9yS4vCvgsxJriNTHoIBz5YteBvg+VGISQWUqhMiByPIPpygeDBE6elD973xWwKkEiHZAHKjhuPsFnBuArrzxtakRcISv+XMIPl4aGBUJm8Emk7qBYU8IlgNEIpiJhk/No24jHwkKTFHDWfPniR4iw5vJaw2nzSjfq2zffcE/GDjRC2dn0J0XwPAbDL84TvaFCJEU4Oml9pRyEUhR3Cl2t01AoEjRbs0sYugp14/4X5n4pU4EHHnMAAAAAElFTkSuQmCC\r
65 X-PGP: 50751FF4\r
66 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4\r
67 Date: Tue, 04 Sep 2012 21:43:53 +0200\r
68 Message-ID: <xa1tligpk1za.fsf@mina86.com>\r
69 MIME-Version: 1.0\r
70 Content-Type: multipart/mixed; boundary="=-=-="\r
71 X-Gm-Message-State: ALoCoQlVnkhgHJ0LZZPUpTWbE6G/J1gRct7k8MTReEQQx8SMsxxf8DVkw91sbuO04tPVb/dEgT2RYYt7cU2/CgwxISsOmNdZstU1qExzkrAiFdK3foWxXFdhYyYc4OdJoXmZ5744jTqBhqYU1dc2cBIRxLmdcu2ahUFQuIgKopDTwfAyfGsERq0GsHCaR0q8ha1GlZR6/kzGFvNu2JnzkVtnvHHgz2oWlA==\r
72 X-BeenThere: notmuch@notmuchmail.org\r
73 X-Mailman-Version: 2.1.13\r
74 Precedence: list\r
75 List-Id: "Use and development of the notmuch mail system."\r
76         <notmuch.notmuchmail.org>\r
77 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
78         <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
79 List-Archive: <http://notmuchmail.org/pipermail/notmuch>\r
80 List-Post: <mailto:notmuch@notmuchmail.org>\r
81 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
82 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
83         <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
84 X-List-Received-Date: Tue, 04 Sep 2012 19:44:04 -0000\r
85 \r
86 --=-=-=\r
87 Content-Type: text/plain; charset=utf-8\r
88 Content-Transfer-Encoding: quoted-printable\r
89 \r
90 On Tue, Sep 04 2012, Dmitry Kurochkin wrote:\r
91 > The script removes duplicate message files.  It takes no options.\r
92 >\r
93 > Files are assumed duplicates if their content is the same except for\r
94 > ignored headers.  Currently, the only ignored header is Received:.\r
95 > ---\r
96 >  contrib/notmuch-remove-duplicates.py |   95 ++++++++++++++++++++++++++++=\r
97 ++++++\r
98 >  1 file changed, 95 insertions(+)\r
99 >  create mode 100755 contrib/notmuch-remove-duplicates.py\r
100 >\r
101 > diff --git a/contrib/notmuch-remove-duplicates.py b/contrib/notmuch-remov=\r
102 e-duplicates.py\r
103 > new file mode 100755\r
104 > index 0000000..dbe2e25\r
105 > --- /dev/null\r
106 > +++ b/contrib/notmuch-remove-duplicates.py\r
107 > @@ -0,0 +1,95 @@\r
108 > +#!/usr/bin/env python\r
109 > +\r
110 > +import sys\r
111 > +\r
112 > +IGNORED_HEADERS =3D [ "Received:" ]\r
113 > +\r
114 > +if len(sys.argv) !=3D 1:\r
115 > +    print "Usage: %s" % sys.argv[0]\r
116 > +    print\r
117 > +    print "The script removes duplicate message files.  Takes no options=\r
118 ."\r
119 > +    print "Requires notmuch python module."\r
120 > +    print\r
121 > +    print "Files are assumed duplicates if their content is the same"\r
122 > +    print "except for the following headers: %s." % ", ".join(IGNORED_HE=\r
123 ADERS)\r
124 > +    exit(1)\r
125 \r
126 It's much better put inside a main() function, which is than called only\r
127 if the script is run directly.\r
128 \r
129 > +\r
130 > +import notmuch\r
131 > +import os\r
132 > +import time\r
133 > +\r
134 > +class MailComparator:\r
135 > +    """Checks if mail files are duplicates."""\r
136 > +    def __init__(self, filename):\r
137 > +        self.filename =3D filename\r
138 > +        self.mail =3D self.readFile(self.filename)\r
139 > +\r
140 > +    def isDuplicate(self, filename):\r
141 > +        return self.mail =3D=3D self.readFile(filename)\r
142 > +\r
143 > +    @staticmethod\r
144 > +    def readFile(filename):\r
145 > +        with open(filename) as f:\r
146 > +            data =3D ""\r
147 > +            while True:\r
148 > +                line =3D f.readline()\r
149 > +                for header in IGNORED_HEADERS:\r
150 > +                    if line.startswith(header):\r
151 \r
152 Case of headers should be ignored, but this does not ignore it.\r
153 \r
154 > +                        # skip header continuation lines\r
155 > +                        while True:\r
156 > +                            line =3D f.readline()\r
157 > +                            if len(line) =3D=3D 0 or line[0] not in [" "=\r
158 , "\t"]:\r
159 > +                                break\r
160 > +                        break\r
161 \r
162 This will ignore line just after the ignored header.\r
163 \r
164 > +                else:\r
165 > +                    data +=3D line\r
166 > +                    if line =3D=3D "\n":\r
167 > +                        break\r
168 > +            data +=3D f.read()\r
169 > +            return data\r
170 > +\r
171 > +db =3D notmuch.Database()\r
172 > +query =3D db.create_query('*')\r
173 > +print "Number of messages: %s" % query.count_messages()\r
174 > +\r
175 > +files_count =3D 0\r
176 > +for root, dirs, files in os.walk(db.get_path()):\r
177 > +    if not root.startswith(os.path.join(db.get_path(), ".notmuch/")):\r
178 > +        files_count +=3D len(files)\r
179 > +print "Number of files: %s" % files_count\r
180 > +print "Estimated number of duplicates: %s" % (files_count - query.count_=\r
181 messages())\r
182 > +\r
183 > +msgs =3D query.search_messages()\r
184 > +msg_count =3D 0\r
185 > +suspected_duplicates_count =3D 0\r
186 > +duplicates_count =3D 0\r
187 > +timestamp =3D time.time()\r
188 > +for msg in msgs:\r
189 > +    msg_count +=3D 1\r
190 > +    if len(msg.get_filenames()) > 1:\r
191 > +        filenames =3D msg.get_filenames()\r
192 > +        comparator =3D MailComparator(filenames.next())\r
193 > +        for filename in filenames:\r
194 \r
195 Strictly speaking, you need to compare each file to each file, and not\r
196 just every file to the first file.\r
197 \r
198 > +            if os.path.realpath(comparator.filename) =3D=3D os.path.real=\r
199 path(filename):\r
200 > +                print "Message '%s' has filenames pointing to the\r
201 > same file: '%s' '%s'" % (msg.get_message_id(), comparator.filename,\r
202 > filename)\r
203 \r
204 So why aren't those removed?\r
205 \r
206 > +            elif comparator.isDuplicate(filename):\r
207 > +                os.remove(filename)\r
208 > +                duplicates_count +=3D 1\r
209 > +            else:\r
210 > +                #print "Potential duplicates: %s" % msg.get_message_id()\r
211 > +                suspected_duplicates_count +=3D 1\r
212 > +\r
213 > +    new_timestamp =3D time.time()\r
214 > +    if new_timestamp - timestamp > 1:\r
215 > +        timestamp =3D new_timestamp\r
216 > +        sys.stdout.write("\rProcessed %s messages, removed %s duplicates=\r
217 ..." % (msg_count, duplicates_count))\r
218 > +        sys.stdout.flush()\r
219 > +\r
220 > +print "\rFinished. Processed %s messages, removed %s duplicates." % (msg=\r
221 _count, duplicates_count)\r
222 > +if duplicates_count > 0:\r
223 > +    print "You might want to run 'notmuch new' now."\r
224 > +\r
225 > +if suspected_duplicates_count > 0:\r
226 > +    print\r
227 > +    print "Found %s messages with duplicate IDs but different content." =\r
228 % suspected_duplicates_count\r
229 > +    print "Perhaps we should ignore more headers."\r
230 \r
231 Please consider the following instead (not tested):\r
232 \r
233 \r
234 #!/usr/bin/env python\r
235 \r
236 import collections\r
237 import notmuch\r
238 import os\r
239 import re\r
240 import sys\r
241 import time\r
242 \r
243 \r
244 IGNORED_HEADERS =3D [ 'Received' ]\r
245 \r
246 \r
247 isIgnoredHeadersLine =3D re.compile(\r
248     r'^(?:%s)\s*:' % '|'.join(IGNORED_HEADERS),\r
249     re.IGNORECASE).search\r
250 \r
251 doesStartWithWS =3D re.compile(r'^\s').search\r
252 \r
253 \r
254 def usage(argv0):\r
255     print """Usage: %s [<query-string>]\r
256 \r
257 The script removes duplicate message files.  Takes no options."\r
258 Requires notmuch python module."\r
259 \r
260 Files are assumed duplicates if their content is the same"\r
261 except for the following headers: %s.""" % (argv0, ', '.join(IGNORED_HEADER=\r
262 S))\r
263 \r
264 \r
265 def readMailFile(filename):\r
266     with open(filename) as fd:\r
267         data =3D []\r
268         skip_header =3D False\r
269         for line in fd:\r
270             if doesStartWithWS(line):\r
271                 if not skip_header:\r
272                     data.append(line)\r
273             elif isIgnoredHeadersLine(line):\r
274                 skip_header =3D True\r
275             else:\r
276                 data.append(line)\r
277                 if line =3D=3D '\n':\r
278                     break\r
279         data.append(fd.read())\r
280         return ''.join(data)\r
281 \r
282 \r
283 def dedupMessage(msg):\r
284     filenames =3D msg.get_filenames()\r
285     if len(filenames) <=3D 1:\r
286         return (0, 0)\r
287 \r
288     realpaths =3D collections.defaultdict(list)\r
289     contents =3D collections.defaultdict(list)\r
290     for filename in filenames:\r
291         real =3D os.path.realpath(filename)\r
292         lst =3D realpaths[real]\r
293         lst.append(filename)\r
294         if len(lst) =3D=3D 1:\r
295             contents[readMailFile(real)].append(real)\r
296 \r
297     duplicates =3D 0\r
298 \r
299     for filenames in contents.itervalues():\r
300         if len(filenames) > 1:\r
301             print 'Files with the same content:'\r
302             print ' ', filenames.pop()\r
303             duplicates +=3D len(filenames)\r
304             for filename in filenames:\r
305                 del realpaths[filename]\r
306             #     os.remane(filename)\r
307 \r
308     for real, filenames in realpaths.iteritems():\r
309         if len(filenames) > 1:\r
310             print 'Files pointing to the same message:'\r
311             print ' ', filenames.pop()\r
312             duplicates +=3D len(filenames)\r
313             # for filename in filenames:\r
314             #     os.remane(filename)\r
315 \r
316     return (duplicates, len(realpaths) - 1)\r
317 \r
318 \r
319 def dedupQuery(query):\r
320     print 'Number of messages: %s' % query.count_messages()\r
321     msg_count =3D 0\r
322     suspected_count =3D 0\r
323     duplicates_count =3D 0\r
324     timestamp =3D time.time()\r
325     msgs =3D query.search_messages()\r
326     for msg in msgs:\r
327         msg_count +=3D 1\r
328         d, s =3D dedupMessage(msg)\r
329         duplicates_count +=3D d\r
330         suspected_count +=3D d\r
331 \r
332         new_timestamp =3D time.time()\r
333         if new_timestamp - timestamp > 1:\r
334             timestamp =3D new_timestamp\r
335             sys.stdout.write('\rProcessed %s messages, removed %s duplicate=\r
336 s...'\r
337                              % (msg_count, duplicates_count))\r
338             sys.stdout.flush()\r
339 \r
340     print '\rFinished. Processed %s messages, removed %s duplicates.' % (\r
341         msg_count, duplicates_count)\r
342     if duplicates_count > 0:\r
343         print 'You might want to run "notmuch new" now.'\r
344 \r
345     if suspected_duplicates_count > 0:\r
346         print """\r
347 Found %d messages with duplicate IDs but different content.\r
348 Perhaps we should ignore more headers.""" % suspected_count\r
349 \r
350 \r
351 def main(argv):\r
352     if len(argv) =3D=3D 1:\r
353         query =3D '*'\r
354     elif len(argv) =3D=3D 2:\r
355         query =3D argv[1]\r
356     else:\r
357         usage(argv[0])\r
358         return 1\r
359 \r
360     db =3D notmuch.Database()\r
361     query =3D db.create_query(query)\r
362     dedupQuery(db, query)\r
363     return 0\r
364 \r
365 \r
366 if __name__ =3D=3D '__main__':\r
367     sys.exit(main(sys.argv))\r
368 \r
369 \r
370 \r
371 --=20\r
372 Best regards,                                         _     _\r
373 .o. | Liege of Serenely Enlightened Majesty of      o' \,=3D./ `o\r
374 ..o | Computer Science,  Micha=C5=82 =E2=80=9Cmina86=E2=80=9D Nazarewicz   =\r
375  (o o)\r
376 ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--\r
377 --=-=-=\r
378 Content-Type: multipart/signed; boundary="==-=-=";\r
379         micalg=pgp-sha1; protocol="application/pgp-signature"\r
380 \r
381 --==-=-=\r
382 Content-Type: text/plain\r
383 \r
384 \r
385 --==-=-=\r
386 Content-Type: application/pgp-signature\r
387 \r
388 -----BEGIN PGP SIGNATURE-----\r
389 Version: GnuPG v1.4.10 (GNU/Linux)\r
390 \r
391 iQIcBAEBAgAGBQJQRln5AAoJECBgQBJQdR/0EQQP/17AJmk0zYfSsuIz4I7H3Ykm\r
392 9YMt/yK1hn/u6yDtylSbnjTgWT0t6OWbOplIzmW9q6iqwhMcdqP20HXMEkqSbNLa\r
393 7WJ+4VXKRLb6PC3bQmM8eqYNtflgEAEWyAuZSQv9f893e6vH/e+7yoFtxaUypcUW\r
394 Wf8qm3T1ljle+2S1xGbteoVDUSHY5epesXlWR6hVA9Qclc/4xpVLNapx3EKRkxBh\r
395 vOpe+u5ATa04DYvIOoGVl723PBIHpm25cGen5lc8vOjXKwqhG0G7di5E29BhAyVT\r
396 yZorKrfsBRTTIlYEErakrzGhiMP3zRnCQmFWvIj/ASbiOUnX8ktFMjfqe+DNW3zq\r
397 T/2jpdzhBdVyioLhBIsMGLdsW6yIk3LURcw4uTijEG2ITj9kdQspydGFahTJk1Ly\r
398 cIls19AMCK7xfGBt8o3xYMX6v/bOxpz/Hot0e+SdHQtiByIUKfJMF7gMo6YyxRfh\r
399 cq1mgoLm+L4/zdrf4IMZDUpoMM8q4yr3eJibINlLxAmRnD3CpnVE2wf6mExLnYxy\r
400 PTIQ9p3pRHsxbRuHvYylJfNNlGpjsRFSgKeRF50iFY+TnzUh+40Tp18BTbL9Dd7R\r
401 UGuIMScxZ6qKt5MQhfBw1F+JpaaIsLTMSh1MjdzCvNVVRvkx7MQuUxiEcOj7wyEf\r
402 0zip9GVi2hAsumsgL0Wz\r
403 =mXjE\r
404 -----END PGP SIGNATURE-----\r
405 --==-=-=--\r
406 \r
407 --=-=-=--\r