Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id BE42C431FC2 for ; Tue, 14 Aug 2012 09:52:08 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -4.999 X-Spam-Level: X-Spam-Status: No, score=-4.999 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_HI=-5, UNPARSEABLE_RELAY=0.001] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0veuzWsXCWt6 for ; Tue, 14 Aug 2012 09:52:08 -0700 (PDT) Received: from rcsinet15.oracle.com (rcsinet15.oracle.com [148.87.113.117]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 2F756431FAE for ; Tue, 14 Aug 2012 09:52:08 -0700 (PDT) Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by rcsinet15.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id q7EGq4T4021055 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 14 Aug 2012 16:52:05 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id q7EGq4YB004735 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 14 Aug 2012 16:52:04 GMT Received: from abhmt104.oracle.com (abhmt104.oracle.com [141.146.116.56]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id q7EGq3jt001878; Tue, 14 Aug 2012 11:52:03 -0500 Received: from pub.cz.oracle.com (/10.163.20.32) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 14 Aug 2012 09:52:03 -0700 Date: Tue, 14 Aug 2012 18:50:44 +0200 From: Vladimir Marek To: Ciprian Dorin Craciun Subject: Re: Alternative (raw) message store (i.e. instead of maildir) Message-ID: <20120814165044.GP28321@pub.cz.oracle.com> Mail-Followup-To: Ciprian Dorin Craciun , Stewart Smith , notmuch@notmuchmail.org References: <20120811094635.GY28321@pub.cz.oracle.com> <874no613ms.fsf@flamingspork.com> <20120814160442.GO28321@pub.cz.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Source-IP: acsinet22.oracle.com [141.146.126.238] Cc: notmuch@notmuchmail.org X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Aug 2012 16:52:08 -0000 > >> > - fuse zip stores all changes in memory until unmounted > >> > - fuse zip (and libzip for that matter) creates new temporary file when > >> > updating archive, which takes considerable time when the archive is > >> > very big. > >> > >> This isn't much of a hastle if you have maildir per time period and > >> archive off. Maybe if you sync flags it may be... > > > > That might be interesting solution, maildir per time period. > > > Although using a zip file through FUSE as a maildir store is not > much better in my opinion. > > This is because it still doesn't solve the syscall overhead. For > example just going through the list of files to find those that > changed requires the following syscalls: > * reading the next directory entry (which is amortized as it reads > them in a batch, but the batch size is limited, should we say 1 > syscall per 10 files?); > * stat-ing the file; > > Now by adding FUSE we add an extra context switch for each syscall... > > Although this issue would be problematic only for reindexing, but still... That's a price I would be willing to pay to have single file instead of many. > > But still > > fuse zip caches all the data until unmounted. So even with just reading > > it keeps growing (I hope I'm not accusing fuse zip here, but this is my > > understanding form the code). This could be simply alleviated by having > > it periodically unmounted and mounted again (perhaps from cron). > > I think there is an option for FUSE mount to specify if the data > should be cached by the kernel or not, as such this shouldn't be a > problem for FUSE itself, except if the Zip FUSE handler does some > extra caching.) To my understanding it's the handler itself. > >> > Of course this solution would have some disadvantages too, but for me > >> > the advantages would win. At the moment I'm not sure if I want to > >> > continue working on that. Maybe if there would be more interested guys > >> > >> I'm *really* tempted to investigate making this work for archived > >> mail. Of course, the list of mounted file systems could get insane > >> depending on granularity I guess... > > > > Well, if your granularity will be one archive per year of mail, it > > should not be that bad ... > > > On the other hand I strongly sustain having a more optimized > backend for emails, especially for such cases. For example a > BerkeleyDB would perfectly fit such a use case, especially if we store > the body and the headers in separate databases. > > Just a small experiment, below are the R `summary(emails)` of the > sizes of my 700k emails: > ~~~~ > Min. 1st Qu. Median Mean 3rd Qu. Max. > 8 4364 5374 11510 7042 31090000 > ~~~~ > > As seen 75% of the emails are below 7k, and this without any compression... > > Moreover we could organize the keys so that in a B-Tree structure > the emails in the same thread are closer together... Now I'm not sure if you talk about some berkeley-db fuse filesystem or direct support in notmuch. I don't have enough cycles to modify notmuch, so I started to look at simpler (codewise) solution ... To summarize, what I personally want from the mail storage - ability to read and write mails - should work with mutt (or mutt-kz) - simple backup to windows drive (files can't contain double colon ':') -- Vlad