Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id B9862431FB6 for ; Wed, 4 May 2011 18:48:41 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.699 X-Spam-Level: X-Spam-Status: No, score=-0.699 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TlrG3flThR9F for ; Wed, 4 May 2011 18:48:40 -0700 (PDT) Received: from mail-qy0-f181.google.com (mail-qy0-f181.google.com [209.85.216.181]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id B6023431FB5 for ; Wed, 4 May 2011 18:48:40 -0700 (PDT) Received: by qyg14 with SMTP id 14so1431099qyg.5 for ; Wed, 04 May 2011 18:48:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=BO7LfIVZLYpcGxAj9Z+5N3lBRVQpfg/L5yE36ODa2xk=; b=mGFEUE5sjtmC7HhgilOwy5j+Eph6ZQ/FbYrQm+/OUR8XXRMn5SvrwMJbOXcXFncebG Q+VRz+Ta1HF+tYxT1AlfQuU8Igdb6tPgh67gt7Bj7p6WUbQRof4K5ui76gJSywvMVUD3 HN8BdwfRHy0LQn6Rie+oLFr5tlsOXVvz30tLU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=Xgg6mmrH0SsDq1yxMI4QyhiSijn+J0dyRVqZrsac6o07cQhzZlH2g5Egn5QOlUa68O Mcg23ZCRVx3b2aOLyvV5LqRr9WaFp+fGxcTTpPGhSxrcVtvr7MGRQ45p7ZtULHoB+zeV HFTDW1oQ/LKqBT1i5Fa8t5kFK6hW+FPdmeBP0= MIME-Version: 1.0 Received: by 10.229.206.42 with SMTP id fs42mr1405286qcb.150.1304560119908; Wed, 04 May 2011 18:48:39 -0700 (PDT) Sender: amdragon@gmail.com Received: by 10.229.233.17 with HTTP; Wed, 4 May 2011 18:48:39 -0700 (PDT) In-Reply-To: References: Date: Wed, 4 May 2011 21:48:39 -0400 X-Google-Sender-Auth: SOnPQT98FKqvGwp6EFEUpJTihRI Message-ID: Subject: Re: storing From and Subject in xapian From: Austin Clements To: Istvan Marko Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: notmuch@notmuchmail.org X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 05 May 2011 01:48:41 -0000 This is awesome. What was your machine configuration? As another data point, with a probably very different configuration (8 year old P4, new SSD), my test query was 1.9X faster uncached and 1.6X faster cached. It also produced 60% fewer disk reads. I saw the same 1% increase in database size. BTW, the reason you're missing some of the subjects is that the char* returned from _notmuch_message_get_header_value goes out of scope as soon as that function returns. A simple fix is to replace return value.c_str(); with return talloc_strdup (message, value.c_str ()); Values are probably the right place to store this information (though I've never been completely clear on the difference between document data and values). Terms would be indexed, which is both unnecessary (unless there's a reason to do *exact* matches on from and subject?) and would result in more database expansion. On Tue, May 3, 2011 at 11:40 PM, Istvan Marko wrote: > > I have been looking at the I/O patterns of "notmuch search" with the > default output format and noticed that it has to parse the maildir file > of every matched message to get the From and Subject headers. I figured > that this must be slowing things down, especially when the files are not > in the filesystem cache. > > So I wanted to see how much difference would it make to have the From > and Subject stored in xapian to avoid this parsing. > > With the attached patch I get a speedup of 2x with cached and almost 10x > with uncached files for searches with many matches. > > The attached patch is only intended as proof of concept. I am not > familiar with xapian so I wasn't sure if this kind of data should be > stored as terms, values or data. I went with values simply because I saw > that message-id and timestamp were already stored that way. Perhaps the > data type would be more appropriate since the fields are not used for > searching or sorting. Oh and for some reason I get blank Subject for > about 1% of the matches. > > > Is there a downside to this approach? The only one I see is that the > xapian db size increases by about 1% but to me the speed increase would > be well worth it. > > > > > -- > =A0 =A0 =A0 =A0Istvan > > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch > >