1 Return-Path: <thomas@schwinge.name>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by olra.theworths.org (Postfix) with ESMTP id 410EC429E26
\r
6 for <notmuch@notmuchmail.org>; Sat, 29 Oct 2011 03:40:28 -0700 (PDT)
\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
\r
11 X-Spam-Status: No, score=2.847 tagged_above=-999 required=5
\r
12 tests=[PERCENT_RANDOM=2.837, RCVD_IN_DNSWL_NONE=-0.0001,
\r
13 T_LOTS_OF_MONEY=0.01] autolearn=disabled
\r
14 Received: from olra.theworths.org ([127.0.0.1])
\r
15 by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
\r
16 with ESMTP id B88PK75O3n3c for <notmuch@notmuchmail.org>;
\r
17 Sat, 29 Oct 2011 03:40:27 -0700 (PDT)
\r
18 Received: from smtprelay03.ispgateway.de (smtprelay03.ispgateway.de
\r
20 by olra.theworths.org (Postfix) with ESMTP id D9AD1431FB6
\r
21 for <notmuch@notmuchmail.org>; Sat, 29 Oct 2011 03:40:26 -0700 (PDT)
\r
22 Received: from [87.180.87.168] (helo=stokes.schwinge.homeip.net)
\r
23 by smtprelay03.ispgateway.de with esmtpa (Exim 4.68)
\r
24 (envelope-from <thomas@schwinge.name>) id 1RK6Ku-0007g6-6v
\r
25 for notmuch@notmuchmail.org; Sat, 29 Oct 2011 12:40:24 +0200
\r
26 Received: (qmail 28875 invoked from network); 29 Oct 2011 10:40:15 -0000
\r
27 Received: from kepler.schwinge.homeip.net (192.168.111.7)
\r
28 by stokes.schwinge.homeip.net with QMQP; 29 Oct 2011 10:40:15 -0000
\r
29 Received: (nullmailer pid 7240 invoked by uid 1000);
\r
30 Sat, 29 Oct 2011 10:40:15 -0000
\r
31 From: Thomas Schwinge <thomas@schwinge.name>
\r
32 To: notmuch@notmuchmail.org
\r
33 Subject: [PATCH] restore: Be more liberal in which data to accept.
\r
34 Date: Sat, 29 Oct 2011 12:40:07 +0200
\r
35 Message-Id: <1319884807-7206-1-git-send-email-thomas@schwinge.name>
\r
36 X-Mailer: git-send-email 1.7.6.3
\r
38 Content-Type: text/plain; charset=UTF-8
\r
39 Content-Transfer-Encoding: 8bit
\r
40 X-Df-Sender: dGhvbWFzQHNjaHdpbmdlLm5hbWU=
\r
41 X-BeenThere: notmuch@notmuchmail.org
\r
42 X-Mailman-Version: 2.1.13
\r
44 List-Id: "Use and development of the notmuch mail system."
\r
45 <notmuch.notmuchmail.org>
\r
46 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
47 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
48 List-Archive: <http://notmuchmail.org/pipermail/notmuch>
\r
49 List-Post: <mailto:notmuch@notmuchmail.org>
\r
50 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
51 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
52 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
53 X-List-Received-Date: Sat, 29 Oct 2011 10:40:28 -0000
\r
55 From: Thomas Schwinge <thomas@schwinge.name>
\r
57 There are ``Message-ID''s out in the wild that contain spaces.
\r
64 Carl, the main question for you is: does this break sup-import
\r
68 Spammers are quite inventive for creating ``interesting Messages-ID''s.
\r
69 Apparently, notmuch handles these fine internally, but it breaks a
\r
72 $ notmuch restore < ~/tmp/Mail-notmuch_dump/dump
\r
73 No filename given. Reading dump from stdin.
\r
74 Warning: Ignoring invalid input line: 3791856948.991306994491@m0.net Received:fromdialup-62.215.274.4.dial1.stamford([62.215.274.4] ([...])
\r
75 Warning: Ignoring invalid input line: PM200010:29:54 AM ([...])
\r
76 Warning: Ignoring invalid input line: PM200010:51:48 AM ([...])
\r
77 Warning: Ignoring invalid input line: PM200011:47:35 AM ([...])
\r
78 Warning: Ignoring invalid input line: PM200011:48:46 AM ([...])
\r
79 Warning: Ignoring invalid input line: PM200011:50:10 AM ([...])
\r
80 Warning: Ignoring invalid input line: PM200012:21:05 AM ([...])
\r
81 Warning: Ignoring invalid input line: PM200012:21:17 AM ([...])
\r
82 Warning: Ignoring invalid input line: PM200012:21:18 AM ([...])
\r
83 Warning: Ignoring invalid input line: PM200012:21:32 AM ([...])
\r
84 Warning: Ignoring invalid input line: PM20001:48:38 PM ([...])
\r
85 Warning: Ignoring invalid input line: PM20001:53:07 PM ([...])
\r
86 Warning: Ignoring invalid input line: PM20004:01:48 AM ([...])
\r
87 Warning: Ignoring invalid input line: PM20004:01:59 AM ([...])
\r
88 Warning: Ignoring invalid input line: PM20004:10:44 AM ([...])
\r
89 Warning: Ignoring invalid input line: PM20004:20:00 AM ([...])
\r
90 Warning: Ignoring invalid input line: PM20005:06:50 PM ([...])
\r
91 Warning: Ignoring invalid input line: PM20005:14:17 AM ([...])
\r
92 Warning: Ignoring invalid input line: PM20005:32:15 PM ([...])
\r
93 Warning: Ignoring invalid input line: PM20005:32:22 PM ([...])
\r
94 Warning: Ignoring invalid input line: PM20005:33:05 PM ([...])
\r
95 Warning: Ignoring invalid input line: PM20005:33:57 AM ([...])
\r
96 Warning: Ignoring invalid input line: PM20006:24:12 AM ([...])
\r
97 Warning: Ignoring invalid input line: PM20006:25:04 AM ([...])
\r
98 Warning: Ignoring invalid input line: PM20006:25:49 AM ([...])
\r
99 Warning: Ignoring invalid input line: PM20006:26:11 AM ([...])
\r
100 Warning: Ignoring invalid input line: PM20007:05:34 PM ([...])
\r
101 Warning: Ignoring invalid input line: PM2000PM 04:09:15 ([...])
\r
102 Warning: Ignoring invalid input line: PM2000¿ÀÀü 11:07:41 ([...])
\r
103 Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 12:42:47 ([...])
\r
104 Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 12:42:48 ([...])
\r
105 Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 5:58:28 ([...])
\r
106 Warning: Ignoring invalid input line: PM2000¿ÀÈÄ 6:30:51 ([...])
\r
107 Warning: Ignoring invalid input line: Prospect Mailer 20000:37:04 ([...])
\r
108 Warning: Ignoring invalid input line: Prospect Mailer 20000:37:09 ([...])
\r
109 Warning: Ignoring invalid input line: Prospect Mailer 20000:37:11 ([...])
\r
110 Warning: Ignoring invalid input line: Prospect Mailer 20000:37:12 ([...])
\r
111 Warning: Ignoring invalid input line: Prospect Mailer 20000:37:45 ([...])
\r
112 Warning: Ignoring invalid input line: Prospect Mailer 20000:38:10 ([...])
\r
114 Thus, dump; remove all tags; restore is not nullipotent, which it should
\r
117 Especially noteworthy is probably the first one: it happens to have
\r
118 gotten a Received line mangled into the Message-ID, and it ends with a
\r
121 Some more from the freak show:
\r
123 $MESSAGE_ID ([...])
\r
124 %CUSTOM_CHAR[8-10]$%CUSTOM_CHAR[8-10]$%CUSTOM_CHAR[8-10]@%CUSTOM_DOMAIN.msn.com ([...])
\r
125 %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110%RNDLCCHAR13@ ([...])
\r
126 %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110ucp@yahoo.com ([...])
\r
127 %RNDDIGIT1025.%RNDDIGIT15%RNDLCCHAR15%RNDDIGIT110vs@yahoo.com ([...])
\r
128 %RNDDIGIT27eq52md1$9rg57p%RNDDIGIT14$277ts40lsh@%RNDWORD13ivo4068 ([...])
\r
129 %RNDDIGIT27g10u874$3cqh62f%RNDDIGIT14$7fgo121wnwt@%RNDWORD13quw32712 ([...])
\r
130 %RNDDIGIT27mog75vx711$541xqm480xc%RNDDIGIT14$031nq1pk@%RNDWORD13av2979 ([...])
\r
131 %RNDDIGIT27nqf761drk7$7l4mza%RNDDIGIT14$96ijq17zq@%RNDWORD13b1779 ([...])
\r
132 %RNDDIGIT27q0tcg10$94pcn1mw%RNDDIGIT14$7x77pztx@%RNDWORD13ny7619 ([...])
\r
133 %RNDDIGIT27uiw866tv49$5c3rg%RNDDIGIT14$6jl43vv@%RNDWORD13uwh17820 ([...])
\r
134 %RNDDIGIT27x966lug3$0pr016r%RNDDIGIT14$8ye15k@%RNDWORD13qps90907 ([...])
\r
135 %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@ ([...])
\r
136 %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@bambi ([...])
\r
137 %RNDDIGIT310%RNDLCCHAR15%RNDDIGIT15%RNDLCCHAR15$%RNDDIGIT17%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13$%RNDDIGIT15%RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13@wheelchair ([...])
\r
138 %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-%RNDLCCHAR13%RNDDIGIT13. ([...])
\r
139 %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-hi3.yahoo.com ([...])
\r
140 %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@-xz24.yahoo.com ([...])
\r
141 %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@lutanist-%RNDLCCHAR13%RNDDIGIT13.msn.com ([...])
\r
142 %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@millipede-jfq402.yahoo.com ([...])
\r
143 %RNDDIGIT520.%RNDDIGIT110.%RNDDIGIT110@referenda-sgw04.yahoo.com ([...])
\r
144 %RNDDIGIT715.h8OheY%RNDDIGIT28@proffer5.o'brien%RNDDIGIT2yahoo.com ([...])
\r
145 %RNDDIGIT715.jt36NNBvbF%RNDDIGIT28@schematic5.myers%RNDDIGIT2yahoo.com ([...])
\r
146 %RNDDIGIT715.wz394MICrdY%RNDDIGIT28@agriculture6.city%RNDDIGIT2yahoo.com ([...])
\r
147 %RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13-%RNDDIGIT520-%RNDDIGIT1035@%RNDDIGIT13 ([...])
\r
148 %RNDLCCHAR13%RNDDIGIT13%RNDLCCHAR13%RNDDIGIT13-%RNDDIGIT520-%RNDDIGIT1035@pontiac%RNDDIGIT13 ([...])
\r
150 Someone needs to improve their scripting language abilities... But on
\r
153 $ notmuch search --output=files -- 'id:"$MESSAGE_ID"' | wc -l
\r
156 This goes by the lines of ``notmuch as a spam filter'': these are
\r
157 different spam messages, but due to notmuch's Message-ID-based keying,
\r
158 they are all coalesced into one. ;-)
\r
160 000010ff21d1$00005c94$000024ca@smtp.mail.gr^M ([...])
\r
161 0000247e7459$0000617b$000030b1@mx1.777.net.cn^M ([...])
\r
162 200107261918.PAA15837@unix.harrisondigital.com^M ([...])
\r
163 20050131113558.GB4396@dragonfly.hU^S@hU^S@ ([...])
\r
164 5614105.1027079773228.JavaMail.à^U±@à^U± ([...])
\r
165 6428921.1027079772968.JavaMail.à^U±@à^U± ([...])
\r
166 6864195.1027080005012.JavaMail.à^U±@à^U± ([...])
\r
168 Yes, these are really embedded carriage returns (^M; and whatever ^S and
\r
169 ^U are). These are handled fine. (Replaced in this text by their ^x
\r
172 1IO\225y@-00094R-XB@BSN-77-184-114.dsl.siol.net ([...])
\r
173 1IP\225o@-000C29-BR@shcn-4.unm.edu ([...])
\r
174 SAK.2002.05.10.kmfogibc@\212ù\222è ([...])
\r
175 SAK.2002.05.11.ckbbpbpe@\212ù\222è ([...])
\r
176 SAK.2002.05.11.qmgoaoai@\212ù\222è ([...])
\r
177 SAK.2002.05.12.cfolrrgc@\212ù\222è ([...])
\r
178 SAK.2002.05.12.chpbngla@\212ù\222è ([...])
\r
179 SAK.2002.05.12.cooajnlj@\212ù\222è ([...])
\r
180 SAK.2002.05.12.folfrldb@\212ù\222è ([...])
\r
181 SAK.2002.05.12.ncphnarn@\212ù\222è ([...])
\r
182 SAK.2002.05.12.tcjbjsoo@\212ù\222è ([...])
\r
184 Embedded non-ASCII characters \212, \222, \225. These are handled fine.
\r
185 (Replaced in this text by their octal \xxx representation.)
\r
188 Another approach would be to detect invalid Message-IDs (only allow valid
\r
189 ones as per the standard) at notmuch new time, and replace these with a
\r
190 generated Message-ID (as if it's missing completely). But I don't think
\r
191 we should generated a Message-ID unless we really need to.
\r
200 notmuch-restore.c | 7 +++----
\r
201 1 files changed, 3 insertions(+), 4 deletions(-)
\r
203 diff --git a/notmuch-restore.c b/notmuch-restore.c
\r
204 index e4a5355..122c3e7 100644
\r
205 --- a/notmuch-restore.c
\r
206 +++ b/notmuch-restore.c
\r
207 @@ -56,12 +56,11 @@ notmuch_restore_command (unused (void *ctx), int argc, char *argv[])
\r
211 - /* Dump output is one line per message. We match a sequence of
\r
212 - * non-space characters for the message-id, then one or more
\r
213 - * spaces, then a list of space-separated tags as a sequence of
\r
214 + /* The input data is one line per message. First comes the message-id,
\r
215 + * then one space, then a list of space-separated tags as a sequence of
\r
216 * characters within literal '(' and ')'. */
\r
218 - "^([^ ]+) \\(([^)]*)\\)$",
\r
219 + "^(.+) \\(([^)]*)\\)$",
\r
222 while ((line_len = getline (&line, &line_size, input)) != -1) {
\r
224 tg: (3bafdfc..) t/restore_liberal_regex (depends on: baseline)
\r