From: Joey Hess Date: Thu, 8 May 2008 23:42:33 +0000 (-0400) Subject: design for a xapian search plugin X-Git-Tag: 2.46~10 X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=4eba3f631b91df2d250fa11c07260e97b7795adf;p=ikiwiki.git design for a xapian search plugin --- diff --git a/doc/todo/different_search_engine.mdwn b/doc/todo/different_search_engine.mdwn index 39f3e3256..a6364b432 100644 --- a/doc/todo/different_search_engine.mdwn +++ b/doc/todo/different_search_engine.mdwn @@ -12,6 +12,75 @@ search so it understands what words are most important in a search. (So does Lucene..) Another nice thing is it supports "more documents like this one" kind of search. --[[Joey]] +## xapian + +I've invesitgated xapian briefly. I think a custom xapian indexer and use +of the omega for cgi searches could work well for ikiwiki. --[[Joey]] + +### indexer + +A custom indexer is needed because omindex isn't good enough for ikiwiki's +needs for incremental rendering. (And because, since ikiwiki has page info +in memory, it's silly to write it to disk and have omindex read it back.) + +The indexer would run as a ikiwiki hook. It needs to be passed the page +name, and the content. Which hook to use is an open question. +Possibilities: + +* `filter` - Since this runs before preprocess, only the actual text + written on the page would be indexed. Not text generated by directives, + pulled in by inlining, etc. There's something to be said for that. And + something to be said against it. It would also get markdown formatted + content, mostly, though it would still need to strip html. +* `sanitize` - Would get the htmlized content, so would need to strip html. + Preprocessor directive output would be indexed. +* `format` - Would get the entire html page, including the page template. + Probably not a good choice as indexing the same template for each page + is unnecessary. + +Currently, a filter hook seems the best option. + +The hook would remove any html from the content, and index it. +It would need to add the same document data that omindex would, as well as +adding the same special terms (see +http://xapian.org/docs/omega/overview.html "Boolean terms"). + +(Note that the U term is a bit tricky because I'll have to replicate +ominxes's hash_string() to hash terms > 240 chars.) + +The indexer (and deleter) will need a way to figure out the ids in xapian +of the documents to delete. One way is storing the id of each page in the +ikiwiki index. + +The other way would be adding a special term to the xapian db that can be +used with replace_document_by_term/delete_document_by_term. omindex uses +U as a term, and I guess I could just use that, and then map page +names to urls when deleting a page ... only real problem being the +hashing; a collision would be bad. + +At the moment, storing xapian ids in the ikiwiki index file seems like the +best approach. + +The hook should try to avoid re-indexing pages that have not changed since +they were last indexed. One problem is that, if a page with an inline is +built, every inlined item will get each hook run. And so a naive hook would +index each of those items, even though none of them have necessarily +changed. Date stamps are one possibility. Another would be to avoid having +the hook not do any indexing when `%preprocessing` is set (Ikiwiki.pm would +need to expose that variable.) + +#### cgi + +The cgi hook would exec omega to handle the searching, much as is done +with estseek in the current search plugin. + +It would first set `OMEGA_CONFIG_FILE=.ikiwiki/omega.conf` ; that omega.conf +would set `database_dir=.ikiwiki/xapian` and probably also set a custom +`template_dir`, which would have modified templates branded for ikiwiki. So +the actual xapian db would be in `.ikiwiki/xapian/default/`. + +## lucene + >> I've done a bit of prototyping on this. The current hip search library is [Lucene](http://lucene.apache.org/java/docs/). There's a Perl port called [Plucene](http://search.cpan.org/~tmtm/Plucene-1.25/). Given that it's already packaged, as `libplucene-perl`, I assumed it would be a good starting point. I've written a **very rough** patch against `IkiWiki/Plugin/search.pm` to handle the indexing side (there's no facility to view the results yet, although I have a command-line interface working). That's below, and should apply to SVN trunk. >> Of course, there are problems. ;-)