1 Return-Path: <bremner@unb.ca>
\r
2 X-Original-To: notmuch@notmuchmail.org
\r
3 Delivered-To: notmuch@notmuchmail.org
\r
4 Received: from localhost (localhost [127.0.0.1])
\r
5 by olra.theworths.org (Postfix) with ESMTP id 46F99431FB6
\r
6 for <notmuch@notmuchmail.org>; Wed, 20 Feb 2013 17:29:44 -0800 (PST)
\r
7 X-Virus-Scanned: Debian amavisd-new at olra.theworths.org
\r
11 X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none]
\r
13 Received: from olra.theworths.org ([127.0.0.1])
\r
14 by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024)
\r
15 with ESMTP id eid1EeS5z6Da for <notmuch@notmuchmail.org>;
\r
16 Wed, 20 Feb 2013 17:29:43 -0800 (PST)
\r
17 Received: from tesseract.cs.unb.ca (tesseract.cs.unb.ca [131.202.240.238])
\r
18 (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits))
\r
19 (No client certificate requested)
\r
20 by olra.theworths.org (Postfix) with ESMTPS id 346DE431FAE
\r
21 for <notmuch@notmuchmail.org>; Wed, 20 Feb 2013 17:29:43 -0800 (PST)
\r
22 Received: from fctnnbsc30w-156034082078.dhcp-dynamic.fibreop.nb.bellaliant.net
\r
23 ([156.34.82.78] helo=zancas.localnet)
\r
24 by tesseract.cs.unb.ca with esmtpsa
\r
25 (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80)
\r
26 (envelope-from <bremner@unb.ca>)
\r
27 id 1U8Kyg-0000v5-74; Wed, 20 Feb 2013 21:29:38 -0400
\r
28 Received: from bremner by zancas.localnet with local (Exim 4.80)
\r
29 (envelope-from <bremner@unb.ca>)
\r
30 id 1U8KyZ-0007UO-1d; Wed, 20 Feb 2013 21:29:31 -0400
\r
31 From: David Bremner <david@tethera.net>
\r
32 To: notmuch mailing list <notmuch@notmuchmail.org>
\r
33 Subject: Re: On disk tag storage format
\r
34 In-Reply-To: <874nk8v9zw.fsf@zancas.localnet>
\r
35 References: <874nk8v9zw.fsf@zancas.localnet>
\r
36 User-Agent: Notmuch/0.15.2+32~g16aa65b (http://notmuchmail.org) Emacs/24.2.1
\r
37 (x86_64-pc-linux-gnu)
\r
38 Date: Wed, 20 Feb 2013 21:29:30 -0400
\r
39 Message-ID: <87vc9mtpxh.fsf@zancas.localnet>
\r
41 Content-Type: multipart/mixed; boundary="=-=-="
\r
43 X-BeenThere: notmuch@notmuchmail.org
\r
44 X-Mailman-Version: 2.1.13
\r
46 List-Id: "Use and development of the notmuch mail system."
\r
47 <notmuch.notmuchmail.org>
\r
48 List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
\r
49 <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
\r
50 List-Archive: <http://notmuchmail.org/pipermail/notmuch>
\r
51 List-Post: <mailto:notmuch@notmuchmail.org>
\r
52 List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
\r
53 List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
\r
54 <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
\r
55 X-List-Received-Date: Thu, 21 Feb 2013 01:29:44 -0000
\r
58 Content-Type: text/plain
\r
60 David Bremner <david@tethera.net> writes:
\r
62 > Austin outlined on IRC a way of representing tags on disk as hardlinks
\r
63 > to messages. In order to make the discussion more concrete, I wrote a
\r
64 > prototype in python to dump the notmuch database to this format. On my
\r
65 > 250k messages, this creates 40k new hardlinks, and uses about 5M of
\r
66 > diskspace. The dump process takes about 20s on
\r
67 > my core i7 machine. With symbolic links, the same database takes about
\r
68 > 150M of disk space; this isn't great but it isn't unbearable either.
\r
71 I've being playing a bit with this script and it seems more or less
\r
72 usable as a way of mirroring the notmuch tag database to a link farm.
\r
74 It's a bit faster than my current dump/restore based approach, although
\r
75 if you want to keep the results in a git repository then it takes up
\r
76 more space. Of course the bonus with this approach is that it creates
\r
77 "virtual" maildirs for each tag that can be browsed with the maildir
\r
80 The current default is to use some mix of hard and symbolic links to try
\r
81 to balance the space consumed in a git repo versus the inode
\r
82 consumption/performance issues of using too many symlinks.
\r
84 It's still a prototype, and there is not much error checking, and there
\r
85 are certain issues not dealt with at all (the ones I thought about are
\r
90 Content-Type: text/x-python
\r
91 Content-Disposition: inline; filename=linksync.py
\r
93 # Copyright 2013, David Bremner <david@tethera.net>
\r
95 # Licensed under the same terms as notmuch.
\r
101 from collections import defaultdict
\r
104 # skip automatic and maildir tags
\r
106 skiptags = re.compile(r"^(attachement|signed|encrypted|draft|flagged|passed|replied|unread)$")
\r
108 # some random person on stack overflow suggests:
\r
113 except OSError as exc: # Python >2.5
\r
114 if exc.errno == errno.EEXIST and os.path.isdir(path):
\r
118 CHARSET = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+_@=.,-'
\r
120 encode_re = '([^{0}])'.format(CHARSET)
\r
122 decode_re = '[%]([0-7][0-9A-Fa-f])'
\r
124 def encode_one_char(match):
\r
125 return('%{:02x}'.format(ord(match.group(1))))
\r
127 def encode_for_fs(str):
\r
128 return re.sub(encode_re,encode_one_char, str,0)
\r
130 def decode_one_char(match):
\r
131 return chr(int(match.group(1),16))
\r
133 def decode_from_fs(str):
\r
134 return re.sub(decode_re,decode_one_char, str, 0)
\r
137 def mk_tag_dir(tagdir):
\r
139 mkdir_p (os.path.join(tagdir, 'cur'))
\r
140 mkdir_p (os.path.join(tagdir, 'new'))
\r
141 mkdir_p (os.path.join(tagdir, 'tmp'))
\r
144 flagpart = '(:2,[^:]*)'
\r
145 flagre = re.compile(flagpart + '$');
\r
147 def path_for_msg (dir, msg):
\r
148 filename = msg.get_filename()
\r
149 flagsmatch = flagre.search(filename)
\r
150 if flagsmatch == None:
\r
153 flags = flagsmatch.group(1)
\r
155 return os.path.join(dir, 'cur', encode_for_fs(msg.get_message_id()) + flags)
\r
158 def unlink_message(dir, msg):
\r
160 dir = os.path.join(dir, 'cur')
\r
162 filepattern = encode_for_fs(msg.get_message_id()) + flagpart +'?$'
\r
164 filere = re.compile(filepattern);
\r
166 for file in os.listdir(dir):
\r
167 if filere.match(file):
\r
168 os.unlink(os.path.join(dir, file))
\r
170 def dir_for_tag(tag):
\r
171 enc_tag = encode_for_fs (tag)
\r
172 return os.path.join(tagroot, enc_tag)
\r
174 disk_tags = defaultdict(set)
\r
177 def read_tags_from_disk(rootdir):
\r
179 for root, subFolders, files in os.walk(rootdir):
\r
180 for filename in files:
\r
181 msg_id = filename.split(':')[0]
\r
182 tag = root.split('/')[-2]
\r
183 decoded_id = decode_from_fs(msg_id)
\r
184 disk_ids.add(decoded_id)
\r
185 disk_tags[decoded_id].add(decode_from_fs(tag));
\r
189 parser = argparse.ArgumentParser(description='Sync notmuch tag database to/from link farm')
\r
190 parser.add_argument('-l','--link-style',choices=['hard','symbolic', 'adaptive'],
\r
191 default='adaptive',dest='link_style')
\r
192 parser.add_argument('-d','--destination',choices=['disk','notmuch'], default='disk',
\r
193 dest='destination')
\r
194 parser.add_argument('-t','--threshold', default=50000L, type=int, dest='threshold')
\r
196 parser.add_argument('tagroot')
\r
198 opts=parser.parse_args()
\r
200 tagroot=opts.tagroot
\r
202 sync_from_links = (opts.destination == 'notmuch')
\r
204 read_tags_from_disk(tagroot)
\r
206 if sync_from_links:
\r
207 db = notmuch.Database(mode=notmuch.Database.MODE.READ_WRITE)
\r
209 db = notmuch.Database(mode=notmuch.Database.MODE.READ_ONLY)
\r
211 dbtags = filter (lambda tag: not skiptags.match(tag), db.get_all_tags())
\r
213 querystr = ' OR '.join(map (lambda tag: 'tag:'+tag, dbtags));
\r
215 q_new = notmuch.Query(db, querystr)
\r
216 q_new.set_sort(notmuch.Query.SORT.UNSORTED)
\r
217 for msg in q_new.search_messages():
\r
219 # silently ignore empty tags
\r
220 db_tags = set(filter (lambda tag: tag != '' and not skiptags.match(tag),
\r
223 message_id = msg.get_message_id()
\r
225 disk_ids.discard(message_id)
\r
227 missing_on_disk = db_tags.difference(disk_tags[message_id])
\r
228 missing_in_db = disk_tags[message_id].difference(db_tags)
\r
230 if sync_from_links:
\r
233 filename = msg.get_filename()
\r
235 if len(missing_on_disk) > 0:
\r
236 if opts.link_style == 'adaptive':
\r
237 statinfo = os.stat (filename)
\r
238 symlink = (statinfo.st_size > opts.threshold)
\r
240 symlink = opts.link_style == 'symbolic'
\r
242 for tag in missing_on_disk:
\r
244 if sync_from_links:
\r
245 msg.remove_tag(tag,sync_maildir_flags=False)
\r
247 tagdir = dir_for_tag (tag)
\r
248 mk_tag_dir (tagdir)
\r
250 newlink = path_for_msg (tagdir, msg)
\r
253 os.symlink(filename, newlink)
\r
255 os.link(filename, newlink)
\r
258 for tag in missing_in_db:
\r
259 if sync_from_links:
\r
260 msg.add_tag(tag,sync_maildir_flags=False)
\r
262 tagdir = dir_for_tag (tag)
\r
263 unlink_message(tagdir,msg)
\r
265 if sync_from_links:
\r
268 # everything remaining in disk_ids is a deleted message
\r
269 # unless we are syncing back to the database, in which case
\r
270 # it just might not currently have any non maildir tags.
\r
272 if not sync_from_links:
\r
273 for root, subFolders, files in os.walk(tagroot):
\r
274 for filename in files:
\r
275 msg_id = filename.split(':')[0]
\r
276 decoded_id = decode_from_fs(msg_id)
\r
277 if decoded_id in disk_ids:
\r
278 os.unlink(os.path.join(root, filename))
\r
283 # currently empty directories are not pruned.
\r