Indexer Improvement

The DokuWiki indexer is being rewritten to be more flexible, and to support searching in metadata.

Source Repositories

Michael Hamann’s branch
Tom Harris’s branch

Testing

A copy of Pro Git by Scott Chacon has been formatted for testing. This book is under the Creative Commons By-NC-SA-3.0 license.

Planning

The current (2010-11-07 “Anteater”) fulltext indexer uses this schema.

CREATE TABLE word ( wordid INTEGER PRIMARY KEY, word TEXT );
CREATE TABLE page ( pageid INTEGER PRIMARY KEY, pagename TEXT );
CREATE TABLE wordindex ( wordid INTEGER KEY, pageid INTEGER, COUNT INTEGER );
CREATE TABLE pageword ( pageid INTEGER KEY, wordid INTEGER);

In order to search metadata as fulltext, these schemas will be needed.

CREATE TABLE meta ( metaid INTEGER PRIMARY KEY, keyname TEXT );
CREATE TABLE metaindex ( wordid INTEGER KEY, pageid INTEGER, metaid INTEGER );
CREATE TABLE pagemeta ( pageid INTEGER KEY, metaid INTEGER, wordid INTEGER );
CREATE TABLE metaword ( metaid INTEGER KEY, pageid INTEGER, wordid INTEGER);

Metadata that is not suitable for fulltext indexing will need these tables:

CREATE TABLE keyword ( valueid INTEGER PRIMARY KEY, metaname STRING KEY, VALUE TEXT );
CREATE TABLE keyindex ( metaname STRING KEY, valueid INTEGER KEY, valueid INTEGER );
CREATE TABLE pagekey ( pageid INTEGER PRIMARY KEY, metaname STRING KEY, valueid INTEGER);

In the file-based implementation, key-value indexes are three files per index with the name of the file serving as the metaname column.

The index for title metadata is stored in a single table with the pages as keys. Pages can only have one title, and most titles will be unique within a wiki.

Configuration

Not all metadata keys collected by DokuWiki require indexing. How to index keys can be determined by one of these alternatives:

An event hook
DokuWiki defines a set of keys for indexing. Plugins register an event handler to modify the set. Pros: flexible, hooks are well established. Cons: hooks consume memory even when not needed, the set of keys doesn’t need complicated logic.
A configuration file
A file in DOKU_CONF defines the set of keys. Plugins provide their own file that extends the set. Pros: little runtime overhead, the configuration doesn’t need to change very often. Cons: need to check every plugin.

This implementation uses an event hook.

API

The public interface of the indexer class will be

class Doku_Indexer {
 
    /**
     * Adds the contents of a page to the fulltext index
     * The added text replaces previous words for the same page. An empty value erases the page.
     */
    function addPageWords($page, $text);
    /**
     * Adds the contents of metadata to the fulltext index
     * $key can be an array to add more than one.
     */
    function addMetaKeys($page, $key, $value=null);
    /**
     * Remove a page from the index
     * Erases entries in all known indexes.
     */
    function deletePage($page);
    /**
     * Split the text into words for fulltext search
     * TODO: also need &$stopwords ?
     */
    function tokenizer($text, $wc=false);
    /**
     * Find pages in the fulltext index containing the words,
     * optionally using metadata keys, or both keys and text
     */
    function lookup($tokens);
    function lookupKey($key, &$value, $func=null);
    /**
     * Return a list of all pages
     */
    function getPages($key=null);
    /**
     * Return a list of words sorted by frequency
     * There is no maximum if $max<$min. 
     * Will use a key-value index if one exists, otherwise uses the fulltext index.
     */
    function histogram($min=1, $max=0, $minlen=3, $key=null);
 
}

Global functions are

function idx_get_version(); /* A string that describes the indexer implementation */
function idx_get_indexer(); /* Creates and returns a Doku_Indexer */
function idx_addPage($page, $verbose=false); /* Adds a wiki page and its metadata to the index */
function idx_lookup($words); /* Find tokens in the fulltext index (TODO: and metadata?) */
function idx_tokenizer($string, $wc=false); /* split a string into tokens */

Aside from idx_get_indexer, these functions already exist and will be preserved for compatibility.

The fulltext search interface will also need to be converted to a class.