AppString > AppStringDoc
The module contains an implementation of the technique presented in [1]. The technique was invented in the Data Debugger Project at Microsoft, Research.
For compiling instructions, please see CompileDoc.
The module uses C++ STL TR1 library provided by GNU GCC and Boost 1.34.1 library.
For installing Boost library on a Debian (including Ubuntu) machine, you need to run:
$ sudo apt-get install libboost-dev
An example of how to use the module is available in src/partenum/example.cc.
The main class of the module is ParEnum which is declared in src/partenum/partenum.h.
The main methods of PartEnum are:
PartEnum(const vector<string> &data, unsigned q, unsigned editdist, unsigned n1, unsigned n2); PartEnum(const vector<string> &data, const string &filename); void build(); void saveIndex(const string &filename) const; void search(const string &query, vector<unsigned> &results); void search(const string &query, const unsigned editdist, vector<unsigned> &results);
The main idea is that the user can create a PartEnum object by specifying a vector of strings (dataset) and a few extra parameters (see [1] for details) or load an existing object from a file. If the object was not loaded, then it needs to be built. Next, the user has the option of saving the object to a file. In order to search approximately in the dataset for a given string, the user calls the function search.
Pentium D 3.4GHz Dual Core, 2GB memory, Linux (Ubuntu), g++. A data set of 54,000 person names.
Technique | Dataset Size | Ed Threshold | Q | Time (ms) | Index size (MB) | Comments |
Scan | 54k | 1 | - | 11.86 | 1.3 | |
Scan | 54k | 2 | - | 21.30 | 1.6 | |
Scan | 54k | 3 | - | 35.49 | 4.2 | |
- | ||||||
PartEnum | 54k | 1 | 2 | 1.21 | 57.3 | n1=2,n2=8 |
PartEnum | 54k | 2 | 2 | 12.04 | 60.2 | n1=3,n2=8 |
PartEnum | 54k | 3 | 1 | 35.24 | 34.8 | n1=2,n2=7 |
[1] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929