Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 49D5B431FD0 for ; Sat, 28 May 2011 01:58:11 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.799 X-Spam-Level: X-Spam-Status: No, score=-0.799 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id s+9hG-axKvQx for ; Sat, 28 May 2011 01:58:10 -0700 (PDT) Received: from mail-wy0-f181.google.com (mail-wy0-f181.google.com [74.125.82.181]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id EDDC2431FB6 for ; Sat, 28 May 2011 01:58:09 -0700 (PDT) Received: by wyi11 with SMTP id 11so1991472wyi.26 for ; Sat, 28 May 2011 01:58:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:cc:subject:from:to:in-reply-to:references:date :message-id:user-agent:content-transfer-encoding:mime-version :content-type; bh=kXhSTCAZBxY1vsG8vZEAZpQoiaDbNh62gx2g2Bqg3JE=; b=o6YYAvTa4tC221hRR0/cDj2gr2XYfTr1xELHFIW5t+UvSQ8WX6ndyjIoRpoC3aut9E BUJDUdJaMf8AmaJD1gUy/wCnCUmH8AarcPWnZe6pTJtGTz8bT8zx/M8097uGAC+8GGIq 5YcpQ7NyKGTjG9QA6PCt/j8w7gBaEcHz/objw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=cc:subject:from:to:in-reply-to:references:date:message-id :user-agent:content-transfer-encoding:mime-version:content-type; b=dKDy5eFaHAGtiO15HVzq8L/30dNG3eSKNAbzksbqN0VIrk2u9gWsTi1jPDGSbhUO9U VTCOI1ugR6r3x8CHMrcU88xvujahXunVWmLeACMlbdGPcwUIgXICEd/+tDoslyRzZf52 ks1J3JGqmCyS8QWp/gQ9/fmqkdn3Oi1eDdlOc= Received: by 10.227.54.196 with SMTP id r4mr2863866wbg.68.1306573088356; Sat, 28 May 2011 01:58:08 -0700 (PDT) Received: from localhost (cpc1-sgyl2-0-0-cust47.sgyl.cable.virginmedia.com [80.192.18.48]) by mx.google.com with ESMTPS id ge4sm1712630wbb.47.2011.05.28.01.58.05 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 28 May 2011 01:58:06 -0700 (PDT) Subject: Re: one-time-iterators From: Patrick Totzke To: Austin Clements In-reply-to: References: <1306397849-sup-3304@brick> <877h9d9y5m.fsf@yoom.home.cworth.org> <1306442683-sup-9315@brick> <20110526214302.GR29861@mit.edu> <1306446621-sup-3184@brick> <1306518628-sup-5396@brick> Date: Sat, 28 May 2011 09:58:03 +0100 Message-Id: <1306572273-sup-9995@brick> User-Agent: Sup/git Content-Transfer-Encoding: 8bit MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-1306573083-764234-30475-5249-1-="; protocol="application/pgp-signature" Cc: notmuch X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 May 2011 08:58:11 -0000 --=-1306573083-764234-30475-5249-1-= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Excerpts from Austin Clements's message of Fri May 27 20:29:24 +0100 2011= : > On Fri, May 27, 2011 at 2:04 PM, Patrick Totzke > wrote: > > Excerpts from Austin Clements's message of Fri May 27 03:41:44 +0100 = 2011: > >> >> > > Have you tried simply calling list() on your thread > >> >> > > iterator to see how expensive it is? =C2=A0My bet is that it'= s quite cheap, > >> >> > > both memory-wise and CPU-wise. > >> >> > Funny thing: > >> >> > =C2=A0q=3DDatabase().create_query('*') > >> >> > =C2=A0time tlist =3D list(q.search_threads()) > >> >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For so= me reason > >> >> > the list constructor must read mere than once from the iterator= . > >> >> > So this is not an option, but even if it worked, it would show > >> >> > the same behaviour as my above test.. > >> >> > >> >> Interesting. =C2=A0Looks like the Threads class implements __len_= _ and that > >> >> its implementation exhausts the iterator. =C2=A0Which isn't a gre= at idea in > >> >> itself, but it turns out that Python's implementation of list() c= alls > >> >> __len__ if it's available (presumably to pre-size the list) befor= e > >> >> iterating over the object, so it exhausts the iterator before eve= n > >> >> using it. > >> >> > >> >> That said, if list(q.search_threads()) did work, it wouldn't give= you > >> >> better performance than your experiment above. > > true. Nevertheless I think that list(q.search_threads()) > > should be equivalent to [t for t in q.search_threads()], which is > > something to be fixed in the bindings. Should I file an issue somehow= ? > > Or is enough to state this as a TODO here on the list? > = > Yes, they should be equivalent. > = > Sebastian was thinking about fixing the larger issue of generator > exhaustion, which would address this, though the performance would > depend on the cost of iterating twice. This is why generators > shouldn't support __len__. Unfortunately, it's probably hard to get > rid of at this point and I doubt there's a way to tell list() to > overlook the presence of a __len__ method. Why not simply removing support for __len__ in the Threads and Messages c= lasses? > >> >> > would it be very hard to implement a Query.search_thread_ids() = ? > >> >> > This name is a bit off because it had to be done on a lower lev= el. > >> >> > >> >> Lazily fetching the thread metadata on the C side would probably > >> >> address your problem automatically. =C2=A0But what are you doing = that > >> >> doesn't require any information about the threads you're manipula= ting? > >> > Agreed. Unfortunately, there seems to be no way to get a list of t= hread > >> > ids or a reliable iterator thereof by using the current python bin= dings. > >> > It would be enough for me to have the ids because then I could > >> > search for the few threads I actually need individually on demand.= > >> > >> There's no way to do that from the C API either, so don't feel left > >> out. =C2=A0]:--8) =C2=A0It seems to me that the right solution to yo= ur problem > >> is to make thread information lazy (effectively, everything gathered= > >> in lib/thread.cc:_thread_add_message). =C2=A0Then you could probably= > >> materialize that iterator cheaply. > > Alright. I'll put this on my mental notmuch wish list and > > hope that someone will have addressed this before I run out of > > ideas how to improve my UI and have time to look at this myself. > > For now, I go with the [t.get_thread_id for t in q.search_threads()] > > approach to cache the thread ids myself and live with the fact that > > this takes time for large result sets. > > > >> In fact, it's probably worth > >> trying a hack where you put dummy information in the thread object > >> from _thread_add_message and see how long it takes just to walk the > >> iterator (unfortunately I don't think profiling will help much here > >> because much of your time is probably spent waiting for I/O). > > I don't think I understand what you mean by dummy info in a thread > > object. > = > In _thread_add_message, rather than looking up the message's author, > subject, etc, just hard-code some dummy values. Performance-wise, > this would simulate making the thread metadata lookup lazy, so you > could see if making this lazy would address your problem. Thanks for the clarification. I did that, and also commented out the lower parts of _notmuch_thread_create and this did indeed improve the performance, but not so much as I had hoped: In [10]: q=3DDatabase().create_query('*') In [11]: time T=3D[t for t in q.search_threads()] CPU times: user 2.43 s, sys: 0.22 s, total: 2.65 s Wall time: 2.66 s And I have only about 8000 mails in my index. Making thread lookups lazy would help, but here one would still create a lot of unused (empty) thread objects. The easiest solution to my problem would in my opinion be a function that queries only for thread ids without instanciating them. But I can't think of any other use case than mine for this so I guess many of you would be against adding this to the API? /p --=-1306573083-764234-30475-5249-1-= Content-Disposition: attachment; filename="signature.asc" Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iEYEARECAAYFAk3guRsACgkQlDQDZ9fWxapp0wCgwm7Ua/eLv/tP+6sPaZiNRQs5 hYsAoM9M0WSNUF9Ja4OP1RLamMtNtZH8 =5W3J -----END PGP SIGNATURE----- --=-1306573083-764234-30475-5249-1-=--