Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id EEE2D431FAF for ; Thu, 29 Nov 2012 05:01:32 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ISli5YCzfkEu for ; Thu, 29 Nov 2012 05:01:32 -0800 (PST) Received: from tesseract.cs.unb.ca (tesseract.cs.unb.ca [131.202.240.238]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 562CF431FAE for ; Thu, 29 Nov 2012 05:01:32 -0800 (PST) Received: from fctnnbsc30w-142167090129.dhcp-dynamic.fibreop.nb.bellaliant.net ([142.167.90.129] helo=zancas.localnet) by tesseract.cs.unb.ca with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1Te3k9-0001eo-2Q; Thu, 29 Nov 2012 09:01:29 -0400 Received: from bremner by zancas.localnet with local (Exim 4.80) (envelope-from ) id 1Te3k3-0005yS-JA; Thu, 29 Nov 2012 09:01:23 -0400 From: David Bremner To: "notmuch mailing list" Subject: On disk tag storage format User-Agent: Notmuch/0.14+75~g984212d (http://notmuchmail.org) Emacs/24.1.1 (x86_64-pc-linux-gnu) Date: Thu, 29 Nov 2012 09:01:23 -0400 Message-ID: <874nk8v9zw.fsf@zancas.localnet> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam_bar: - X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Nov 2012 13:01:33 -0000 --=-=-= Content-Type: text/plain Austin outlined on IRC a way of representing tags on disk as hardlinks to messages. In order to make the discussion more concrete, I wrote a prototype in python to dump the notmuch database to this format. On my 250k messages, this creates 40k new hardlinks, and uses about 5M of diskspace. The dump process takes about 20s on my core i7 machine. With symbolic links, the same database takes about 150M of disk space; this isn't great but it isn't unbearable either. The assumption in both cases is that maildirsync is on, so most tags are stored in the the original maildirs. In principle such a representation (or some variation) could be be used to interect with some external source of tagging information like gmail. It could also be used (with rsync --hard-links?) to synchronize notmuch databases between machines. I'm still unsure about the runtime performance impact of updating the file system and the Xapian index with every tag operation, but I thought I would see if the representation itself was usable for most people without bringing the filesystem to its knees. So I'd be interested to hear other people experiences running this script. It _should_ be safe since it opens the database in readonly form, but the smart money is on backups before running other peoples experimental code. Especially since I don't claim to actually know python. One technicality is that this hex-encodes ':' (compared to the other code floating around); this is so hex_encode(message_id)+maildir_flags is a valid maildir name. The uniqueness of the names comes from the (much discussed) keying of messages on message-ids. --=-=-= Content-Type: text/x-python Content-Disposition: inline; filename=dump-tags.py import notmuch import re import os, errno maildirish= re.compile(r"^(draft|flagged|passed|replied|unread)$") symlink = False # some random person on stack overflow suggests: def mkdir_p(path): try: os.makedirs(path) except OSError as exc: # Python >2.5 if exc.errno == errno.EEXIST and os.path.isdir(path): pass else: raise CHARSET=('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+_@=.,-') encode_re='([^{0}])'.format(CHARSET) def encode_one_char(match): return('%{:02x}'.format(ord(match.group(1)))) def encode_for_fs(str): return re.sub(encode_re,encode_one_char, str,0) flagre = re.compile("(:2,[^:]*)$"); tagroot='tags' db = notmuch.Database(mode=notmuch.Database.MODE.READ_ONLY) q_new = notmuch.Query(db, '*') q_new.set_sort(notmuch.Query.SORT.UNSORTED) for msg in q_new.search_messages(): for tag in msg.get_tags(): if tag == '': print 'Dunno what to do about empty tag on ', msg.get_message_id() else: if not maildirish.match(tag): # ignore multiple filenames filename = msg.get_filename() message_id = msg.get_message_id() flagsmatch = flagre.search(filename) if flagsmatch == None: flags = '' else: flags = flagsmatch.group(1) tagdir = os.path.join(tagroot, encode_for_fs(tag)) curdir = os.path.join(tagdir, 'cur') mkdir_p (os.path.join(tagdir, 'new')) mkdir_p ( os.path.join(tagdir, 'tmp')) mkdir_p(curdir); newlink = os.path.join(curdir, encode_for_fs(message_id) + flags) if symlink: os.symlink(filename, newlink) else: os.link(filename, newlink ) --=-=-=--