The DokuWiki indexer is being rewritten to be more flexible, and to support searching in metadata.
A copy of Pro Git by Scott Chacon has been formatted for testing. This book is under the Creative Commons By-NC-SA-3.0 license.
The current (2010-11-07 “Anteater”) fulltext indexer uses this schema.
CREATE TABLE word ( wordid INTEGER PRIMARY KEY, word TEXT ); CREATE TABLE page ( pageid INTEGER PRIMARY KEY, pagename TEXT ); CREATE TABLE wordindex ( wordid INTEGER KEY, pageid INTEGER, COUNT INTEGER ); CREATE TABLE pageword ( pageid INTEGER KEY, wordid INTEGER);
In order to search metadata as fulltext, these schemas will be needed.
CREATE TABLE meta ( metaid INTEGER PRIMARY KEY, keyname TEXT ); CREATE TABLE metaindex ( wordid INTEGER KEY, pageid INTEGER, metaid INTEGER ); CREATE TABLE pagemeta ( pageid INTEGER KEY, metaid INTEGER, wordid INTEGER ); CREATE TABLE metaword ( metaid INTEGER KEY, pageid INTEGER, wordid INTEGER);
Metadata that is not suitable for fulltext indexing will need these tables:
CREATE TABLE keyword ( valueid INTEGER PRIMARY KEY, metaname STRING KEY, VALUE TEXT ); CREATE TABLE keyindex ( metaname STRING KEY, valueid INTEGER KEY, valueid INTEGER ); CREATE TABLE pagekey ( pageid INTEGER PRIMARY KEY, metaname STRING KEY, valueid INTEGER);
In the file-based implementation, key-value indexes are three files per index with the name of the file serving as the metaname
column.
The index for title
metadata is stored in a single table with the pages as keys. Pages can only have one title, and most titles will be unique within a wiki.
Not all metadata keys collected by DokuWiki require indexing. How to index keys can be determined by one of these alternatives:
DOKU_CONF
defines the set of keys. Plugins provide their own file that extends the set. Pros: little runtime overhead, the configuration doesn’t need to change very often. Cons: need to check every plugin.
This implementation uses an event hook.
The public interface of the indexer class will be
class Doku_Indexer { /** * Adds the contents of a page to the fulltext index * The added text replaces previous words for the same page. An empty value erases the page. */ function addPageWords($page, $text); /** * Adds the contents of metadata to the fulltext index * $key can be an array to add more than one. */ function addMetaKeys($page, $key, $value=null); /** * Remove a page from the index * Erases entries in all known indexes. */ function deletePage($page); /** * Split the text into words for fulltext search * TODO: also need &$stopwords ? */ function tokenizer($text, $wc=false); /** * Find pages in the fulltext index containing the words, * optionally using metadata keys, or both keys and text */ function lookup($tokens); function lookupKey($key, &$value, $func=null); /** * Return a list of all pages */ function getPages($key=null); /** * Return a list of words sorted by frequency * There is no maximum if $max<$min. * Will use a key-value index if one exists, otherwise uses the fulltext index. */ function histogram($min=1, $max=0, $minlen=3, $key=null); }
Global functions are
function idx_get_version(); /* A string that describes the indexer implementation */ function idx_get_indexer(); /* Creates and returns a Doku_Indexer */ function idx_addPage($page, $verbose=false); /* Adds a wiki page and its metadata to the index */ function idx_lookup($words); /* Find tokens in the fulltext index (TODO: and metadata?) */ function idx_tokenizer($string, $wc=false); /* split a string into tokens */
Aside from idx_get_indexer
, these functions already exist and will be preserved for compatibility.
The fulltext search interface will also need to be converted to a class.