Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 23AE9431FC2 for ; Tue, 14 Aug 2012 09:38:26 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.799 X-Spam-Level: X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JbwhBHq76UMo for ; Tue, 14 Aug 2012 09:38:25 -0700 (PDT) Received: from mail-wg0-f45.google.com (mail-wg0-f45.google.com [74.125.82.45]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id 2310A431FAE for ; Tue, 14 Aug 2012 09:38:25 -0700 (PDT) Received: by wgbdq12 with SMTP id dq12so498176wgb.2 for ; Tue, 14 Aug 2012 09:38:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=pDHM2SnMCIian2BbvOSvTGSNB0/0/BQMahDhcEkTjQo=; b=olQ4a6P8fw0+XRUM73Xr8heOeUKYVU/5mMVbr8sWJzJqjTyfDiymmMAr0dFw6u++fS 422o/gREtlppqABZXWTsceTd2hEhTjHSNTjJbZCNX3ZneBnn2CiSbGgn45S9UXOM9rNj mW17kpgQDqkO5TpJiVDHXhWk/7+WVNgC+I/Go/lS9fESXG5LiA3mehY7ZpqMrdbDHYSr +5dfnm00NhZThdl8ysk/9D8P41DQswoOAfU8XMcCMpX+gsoIaWW2t9ewy+1FBKhII0al NSReFTvXAxf8jaU+rUNVgGA+8i3tNDAdGTy7gMOcpvt1kqhd9XMbVblkPyXDpgSBk11W UDiQ== MIME-Version: 1.0 Received: by 10.180.94.164 with SMTP id dd4mr29323720wib.1.1344962302574; Tue, 14 Aug 2012 09:38:22 -0700 (PDT) Received: by 10.180.104.196 with HTTP; Tue, 14 Aug 2012 09:38:22 -0700 (PDT) In-Reply-To: <20120814160442.GO28321@pub.cz.oracle.com> References: <20120811094635.GY28321@pub.cz.oracle.com> <874no613ms.fsf@flamingspork.com> <20120814160442.GO28321@pub.cz.oracle.com> Date: Tue, 14 Aug 2012 19:38:22 +0300 Message-ID: Subject: Re: Alternative (raw) message store (i.e. instead of maildir) From: Ciprian Dorin Craciun To: Stewart Smith , notmuch@notmuchmail.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Aug 2012 16:38:26 -0000 On Tue, Aug 14, 2012 at 7:04 PM, Vladimir Marek wrote: >> > - fuse zip stores all changes in memory until unmounted >> > - fuse zip (and libzip for that matter) creates new temporary file when >> > updating archive, which takes considerable time when the archive is >> > very big. >> >> This isn't much of a hastle if you have maildir per time period and >> archive off. Maybe if you sync flags it may be... > > That might be interesting solution, maildir per time period. Although using a zip file through FUSE as a maildir store is not much better in my opinion. This is because it still doesn't solve the syscall overhead. For example just going through the list of files to find those that changed requires the following syscalls: * reading the next directory entry (which is amortized as it reads them in a batch, but the batch size is limited, should we say 1 syscall per 10 files?); * stat-ing the file; Now by adding FUSE we add an extra context switch for each syscall... Although this issue would be problematic only for reindexing, but still... > But still > fuse zip caches all the data until unmounted. So even with just reading > it keeps growing (I hope I'm not accusing fuse zip here, but this is my > understanding form the code). This could be simply alleviated by having > it periodically unmounted and mounted again (perhaps from cron). I think there is an option for FUSE mount to specify if the data should be cached by the kernel or not, as such this shouldn't be a problem for FUSE itself, except if the Zip FUSE handler does some extra caching.) >> > Of course this solution would have some disadvantages too, but for me >> > the advantages would win. At the moment I'm not sure if I want to >> > continue working on that. Maybe if there would be more interested guys >> >> I'm *really* tempted to investigate making this work for archived >> mail. Of course, the list of mounted file systems could get insane >> depending on granularity I guess... > > Well, if your granularity will be one archive per year of mail, it > should not be that bad ... On the other hand I strongly sustain having a more optimized backend for emails, especially for such cases. For example a BerkeleyDB would perfectly fit such a use case, especially if we store the body and the headers in separate databases. Just a small experiment, below are the R `summary(emails)` of the sizes of my 700k emails: ~~~~ Min. 1st Qu. Median Mean 3rd Qu. Max. 8 4364 5374 11510 7042 31090000 ~~~~ As seen 75% of the emails are below 7k, and this without any compression... Moreover we could organize the keys so that in a B-Tree structure the emails in the same thread are closer together... Ciprian.