From 5df6880d5dcf6800aa8107990c92a6ad1bb33d78 Mon Sep 17 00:00:00 2001 From: dm-list-email-notmuch Date: Thu, 24 Apr 2014 15:31:49 +1700 Subject: [PATCH] Re: [PATCH] Add configurable changed tag to messages that have been changed on disk --- de/d38ec5466c2078bebc914adca1c99c451dce9e | 153 ++++++++++++++++++++++ 1 file changed, 153 insertions(+) create mode 100644 de/d38ec5466c2078bebc914adca1c99c451dce9e diff --git a/de/d38ec5466c2078bebc914adca1c99c451dce9e b/de/d38ec5466c2078bebc914adca1c99c451dce9e new file mode 100644 index 000000000..1198d3fcd --- /dev/null +++ b/de/d38ec5466c2078bebc914adca1c99c451dce9e @@ -0,0 +1,153 @@ +Return-Path: + +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by olra.theworths.org (Postfix) with ESMTP id D26A3431FC4 + for ; Wed, 23 Apr 2014 15:31:58 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at olra.theworths.org +X-Spam-Flag: NO +X-Spam-Score: -2.3 +X-Spam-Level: +X-Spam-Status: No, score=-2.3 tagged_above=-999 required=5 + tests=[RCVD_IN_DNSWL_MED=-2.3] autolearn=disabled +Received: from olra.theworths.org ([127.0.0.1]) + by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id PIBwrsjGS7ZV for ; + Wed, 23 Apr 2014 15:31:54 -0700 (PDT) +Received: from market.scs.stanford.edu (market.scs.stanford.edu [171.66.3.10]) + (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) + (No client certificate requested) + by olra.theworths.org (Postfix) with ESMTPS id E922B431FBF + for ; Wed, 23 Apr 2014 15:31:53 -0700 (PDT) +Received: from market.scs.stanford.edu (localhost.scs.stanford.edu + [127.0.0.1]) by market.scs.stanford.edu (8.14.7/8.14.7) with ESMTP id + s3NMVoYW028965; Wed, 23 Apr 2014 15:31:50 -0700 (PDT) +Received: (from dm@localhost) + by market.scs.stanford.edu (8.14.7/8.14.7/Submit) id s3NMVnq9018357; + Wed, 23 Apr 2014 15:31:49 -0700 (PDT) +X-Authentication-Warning: market.scs.stanford.edu: dm set sender to + return-qnmtsdtvmswvfn8meiysvc9qaw@ta.scs.stanford.edu using -f +From: dm-list-email-notmuch@scs.stanford.edu +To: Austin Clements +Subject: Re: [PATCH] Add configurable changed tag to messages that have been + changed on disk +In-Reply-To: <20140423205920.GM25817@mit.edu> +References: <1396800683-9164-1-git-send-email-eg@gaute.vetsj.com> + <87wqf2gqig.fsf@ta.scs.stanford.edu> <1397140962-sup-6514@qwerzila> + <87wqexnqvb.fsf@ta.scs.stanford.edu> <1397163239-sup-5101@qwerzila> + <87d2g9ja0h.fsf@maritornes.cs.unb.ca> <1398237865-sup-624@qwerzila> + <87ioq0l8th.fsf@ta.scs.stanford.edu> + <20140423205920.GM25817@mit.edu> +Date: Wed, 23 Apr 2014 15:31:49 -0700 +Message-ID: <87iopz4qzu.fsf@ta.scs.stanford.edu> +MIME-Version: 1.0 +Content-Type: text/plain +Cc: notmuch +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.13 +Precedence: list +Reply-To: David Mazieres expires 2014-07-22 PDT + +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Wed, 23 Apr 2014 22:31:59 -0000 + +Austin Clements writes: + +>> A middle ground might be to use the maximum of two values: 1) the +>> time-of-day at which notmuch started executing, and 2) the highest ctime +>> in the database plus 100 microseconds (leaving plenty of slop to store +>> timestamps as IEEE doubles with 52 significant bits). Since the values +>> will be Btree-indexed, computing the max plus one will be cheap. +> +> This makes me curious if you've considered how to fit this in to +> Xapian. The Xapian query syntax supports range queries over document +> "values", but within the Xapian B-tree, values are stored in docid +> order, not value order, so Xapian's range query operator is actually a +> full scan in implementation. I assume it does this so it doesn't have +> to store both forward and inverse indexes of values. (I spent some +> time figuring out the layout of the Xapian database and have fairly +> detailed notes if anyone's curious.) + +Aside from finding the previous max time, everything else should work +identically to the date query operator and NOTMUCH_VALUE_TIMESTAMP. + +Though I believe you, I'm a little surprised the values aren't indexed. +An alternative design might use terms like XCTIMExxxxxxxxxxxxxxxx where +the x's are hex digits. But this seems a bit clunky and not using +Xapian the way it is indented to be used. + +When I do a query with a giant result set ordered by date (notmuch +search --sort=oldest-first "*"), the first few results come back pretty +quickly, so I guess the full database scan is not an issue, at least for +~10^5 messages. + +> This is still reasonably fast in practice because it's a sequential +> scan and only requires a few bytes per message, but it's probably not +> what you'd expect. That said, Xapian does track per-value statistics +> that would suffice for the particular problem of monotonic time stamps +> (e.g., Database::get_value_upper_bound). + +Oh, well in that case there is no issue. That max is the only statistic +we need. Everything that requires a full database scan, like get me all +messages whose properties have changed since time X, is something that +you can't do at all right now. And in fact I'm already scanning all +100,000 message IDs AND diffing the results against a separate sqlite +database to detect changes in only 0.09 seconds (Linux) or 1.2 seconds +(32-bit OpenBSD). This will only make that faster, and additionally +allow other people to do what I'm doing without writing a bunch of C++ +code. + +> In principle it would be possible to use user metadata or even +> document terms to support true B-tree range scans by ctime order, but +> I don't think it's possible to express queries over this using +> Xapian's query parser. I've written about 90% of a (new) custom query +> parser for Notmuch that would enable this, but little things like my +> looming thesis deadline have interfered with me finishing it. + +Yeah, I've been avoiding the query parser and just scanning terms and +postlists directly. Since the lack of ctime forced me to scan the whole +database anyway, I found it much faster to scan each tag's posting list +and dump the results into sqlite than to extract tag terms on a +per-document basis the way notmuch dump does. + +>> Incidentally, if you are really this paranoid about time stamps, it +>> should bother you that notmuch's directory timestamps only have one +>> second granularity. +> +> This is historical (and, I agree, unfortunate). But nobody's +> complained, so it hasn't been worth changing the libnotmuch interface +> to support sub-second directory mtimes. However, notmuch new does +> correctly handle deliveries in the same second it runs. If the +> wall-clock time when it starts is the same as the on-disk directory +> mtime, it skips updating the in-database directory mtime at the end. +> Hence, on the next run, it will still consider the directory +> out-of-date. It's a bit of a hack, but it's a hack that would be +> necessary for supporting older file systems even if we did support +> sub-second timestamps. + +Yeah, is kind of a problem me. I currently scan the XFDIRENTRY terms +belonging to a directory only if the directory's notmuch mtime has +changed since the last time I examined Xapian's state. I used to scan +the actual directories, which was fine, but not so useful because I +don't actually want to deal with messages that notmuch has not yet +indexed. Conversely, if a directory has not changed since the last time +muchsync ran, but notmuch's idea of the directory has changed (because +someone ran notmuch new), then I do care about scanning for new/deleted +XFDIRENTRY terms. + +But couldn't notmuch fix the sub-second problem in a fully backwards +compatible way? After all, the database is already storing these mtimes +as doubles. Even for OSes that don't support st_mtim, notmuch could +just add 0.00001 seconds to the previous timestamp of a modified +directory. + +David -- 2.26.2