AppString > AppStringDoc
The module contains an implementation of the technique presented in [1].
For compiling instructions, please see CompileDoc.
The module uses C++ STL TR1 library provided by GNU GCC.
An example of how to use the module is available in src/topk/example.cc.
The module is divided in three components:
The Top-k Index (Topk::Index) class is defined in src/topk/topkindex.h. Its main methods are:
Index(); Index(const std::string &filename); template<class InputIterator> void build(InputIterator begin, InputIterator end, const GramGen &gramGen); void load(const std::string &filename); void save(const std::string &filename) const;
The Top-k Query Index (Topk::IndexQuery) class contains a part of the index that is relevant for a particular query. This class is used by the search algorithms. It is defined in src/topk/topkindex.h and its main method is:
IndexQuery(const Index &idx, const Query &query);
The Top-k Search Algorithms are defined in multiple files. All the algorithms define the getTopk method. Its prototype is:
template< class RandomAccessIterator1, class RandomAccessIterator2, class OutputIterator> void getTopk( const RandomAccessIterator1 data, const RandomAccessIterator2 weights, const Index &idx, const Query &que, IndexQuery &idxQue, OutputIterator topk);
The most popular algorithm is the heap-based algorithm. It is defined in the src/topk/topkheap.h.
The main idea is that a Topk::Index object can be created to hold the index. The index can be build by specifying an iterator over a sequence of strings and a gram generator. Additionally, the index can be saved to disk and then loaded from disk. When a query comes, a Topk::IndexQuery object is created using the Topk::Index instance and the query. Finally, the search algorithm of choice can applied.
[1] Rares Vernica, Chen Li: Efficient top-k algorithms for fuzzy search in string collections. KEYS 2009: 9-14. (Workshop on Keyword Search on Structured Data, collocated with SIGMOD 2009)