AppString > AppStringDoc
The module contains an implementation of the technique presented in [1].
For compiling instructions, please see CompileDoc.
The module uses C++ STL TR1 library provided by GNU GCC.
An example of how to use the module is available in src/topk/example.cc.
The module is divided in three components:
The Top-k Index (Topk::Index) class is defined in src/topk/topkindex.h. Its main methods are:
Index();
Index(const std::string &filename);
template<class InputIterator>
void build(InputIterator begin, InputIterator end, const GramGen &gramGen);
void load(const std::string &filename);
void save(const std::string &filename)
const;
The Top-k Query Index (Topk::IndexQuery) class contains a part of the index that is relevant for a particular query. This class is used by the search algorithms. It is defined in src/topk/topkindex.h and its main method is:
IndexQuery(const Index &idx, const Query &query);
The Top-k Search Algorithms are defined in multiple files. All the algorithms define the getTopk method. Its prototype is:
template<
class RandomAccessIterator1,
class RandomAccessIterator2,
class OutputIterator>
void getTopk(
const RandomAccessIterator1 data,
const RandomAccessIterator2 weights,
const Index &idx,
const Query &que,
IndexQuery &idxQue,
OutputIterator topk);
The most popular algorithm is the heap-based algorithm. It is defined in the src/topk/topkheap.h.
The main idea is that a Topk::Index object can be created to hold the index. The index can be build by specifying an iterator over a sequence of strings and a gram generator. Additionally, the index can be saved to disk and then loaded from disk. When a query comes, a Topk::IndexQuery object is created using the Topk::Index instance and the query. Finally, the search algorithm of choice can applied.
[1] Rares Vernica, Chen Li: Efficient top-k algorithms for fuzzy search in string collections. KEYS 2009: 9-14. (Workshop on Keyword Search on Structured Data, collocated with SIGMOD 2009)