Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 0AAB6431FB6 for ; Mon, 25 Jun 2012 15:14:20 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qfDweLdEfyDB for ; Mon, 25 Jun 2012 15:14:19 -0700 (PDT) Received: from smtp.chost.de (setoy.chost.de [217.160.209.225]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id CAFBA431FAF for ; Mon, 25 Jun 2012 15:14:18 -0700 (PDT) Received: (qmail 13902 invoked by uid 5015); 25 Jun 2012 22:14:16 -0000 Received: (nullmailer pid 27762 invoked by uid 123); Mon, 25 Jun 2012 22:14:16 -0000 Received: from twin.sascha.silbe.org (twin.sascha.silbe.org [192.168.1.2]) by flatty.sascha.silbe.org ([192.168.1.252]) with SMTP via TCP; 25 Jun 2012 22:14:16 -0000 Received: (nullmailer pid 3812 invoked by uid 8193); Mon, 25 Jun 2012 22:14:16 -0000 To: Austin Clements , notmuch Subject: Re: [PATCH 0/3] Speed up notmuch new for unchanged directories In-Reply-To: <87pq8n1de4.fsf@awakening.csail.mit.edu> References: <1340555366-25891-1-git-send-email-sascha-pgp@silbe.org> <87pq8n1de4.fsf@awakening.csail.mit.edu> User-Agent: Notmuch/0.13.2+51~gecf7cfe (http://notmuchmail.org) Emacs/23.2.1 (x86_64-pc-linux-gnu) Date: Tue, 26 Jun 2012 00:13:40 +0200 Message-ID: MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" From: Sascha Silbe X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 25 Jun 2012 22:14:20 -0000 --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Austin Clements writes: > On Sun, 24 Jun 2012, Sascha Silbe wrote: ["notmuch new" listing every directory, even if it's unchanged] > I haven't looked over your patches yet, but this result surprises me. > Could you explain your setup a little more? How much mail do you have > and across how many directories? What file system are you using? As mentioned in passing already, I have a total of about 900k unique mails (sometimes several copies of them, received over different paths, e.g. mailing list and a direct CC). Most of that is "old" mails, in directories that are not getting updated. If notmuch would support mbox, I'd use that instead for those old mails. The total number of directories in the mail store is about 29k and the total number of files (including the git repository and mbox files that sup used) is about 1.25M. Since a housekeeping job last weekend, the number of mails in directories that are still getting updated is about 4k, i.e. about 5=E2=80= =B0 of the total number of mails or 3=E2=80=B0 of the total number of files. The n= umber of directories getting updated is 104, i.e. about 4=E2=80=B0 of the total n= umber of directories. Ideally, we'd get the run-time of "notmuch new" down by a similar factor. With just plain POSIX and no additional information that won't be possible, but providing a way to channel information about updates into notmuch (rather than having it scan everything over and over again) should help. That information is already available as output from the mail fetching process (rsync in my case). Of course, it would be purely optional: "notmuch new" without additional information would simply continue to scan everything. > I'm also surprised that your new approach helps. This directory listing > has to be read off disk one way or the other, but listing directories is > the bread-and-butter of file systems, whereas I would think that Xapian > would require more IO to accomplish the same effect. "notmuch new" needs to iterate over a list of all directories to find those with new mails (and potentially new subdirectories). However, it does not need to list the *contents* of those folders. I'm surprised as well, but rather in the opposite direction: Based on a naive calculation, we'd expect to see a speedup on the order of (1.25M+29k)/29k=C2=A0=3D=C2=A044. The actual results suggest that stat()ing= (done 29k times both before and after the patch) is taking about 19 times as long as listing a directory entry (before the patch we listed 1M entries, now we list none if nothing has changed). (*) In practice, the speedup achieved by my patch is larger than what the benchmark suggests because there are other processes running that use RAM. If we need to read a lot from disk (like "notmuch new" did before my patch), there's a good chance it's already been evicted from the cache since the last run. The fewer we need to read, the more likely it is to still be in the cache. Similarly, reading lots of data from disk will displace other data in the cache. These effects are not covered by the pure "hot cache" and "cold cache" timings. > Does your patch win because you can specifically list subdirectories > out of Xapian, making the IO proportional to the number of > subdirectories instead of the number of subdirectories and files (even > though the constant factors probably favor reading from the file > system)? It wins because the factor is the number of files in each directory, not just some low constant based on file system overhead vs. Xapian overhead. > I like the idea of these patches, I just want to make sure I have a firm > grip on what's being optimized and why it wins. Certainly a good idea. Thanks for taking the time! Sascha (*) float(linsolve([29000*x + 1250000*y =3D 3.3 * 29000*x], [x])); in maxima, if you'd like to check the math. =2D-=20 http://sascha.silbe.org/ http://www.infra-silbe.de/ --=-=-= Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQEcBAEBCgAGBQJP6OKVAAoJELpz82VMF3DaMzIIALpqmnz26Mk8EZMooszj6oOK rA6b+LO+B9qCdpSc1/0bs/qm7pC1AEs3G6ycliqntUddj34vq0jXW+yZ2llou6kk W56B4fVnamYX+AtFSrNHi9GxRcyDRK6fmZv5Qtr55poJayKFaeJNhaj4EblULtCp 3JeEQI+x9FJglVMMp67QTZMlrn0JIxqyfeWDhbpBYdunJrraOtF3hmJeqfJbIcMm 5rDkvwcvybjjP1oA5wHN/H8euoFb0CO0K+Y36MCiemu0xnijlGaUVt6/I/wjNn1F yesV4CQHZ5VsBKWYeLxV3BRETUDKvN5ds/gjffbZhoiJSShA/hCbYHPhj7jjdUc= =8WVY -----END PGP SIGNATURE----- --=-=-=--