Back to Index
AppString > AppStringDoc
The module contains an implementation of the technique presented in . The technique was invented in the Data Cleaning Project at Microsoft, Research.
For compiling instructions, please see CompileDoc.
The module uses C++ STL TR1 library provided by GNU GCC and Boost 1.34.1 library.
On systems with the aptitude package manager (e.g. Ubuntu, Debian) you can install all required packages by typing the following as root user (or using sudo):
$ sudo apt-get install libboost-dev
An example of how to use the module is available in src/partenum/example.cc.
The main class of the module is ParEnum which is declared in src/partenum/partenum.h.
The main methods of PartEnum are:
PartEnum(const vector<string> &data, unsigned q, unsigned editdist, unsigned n1, unsigned n2); PartEnum(const vector<string> &data, const string &filename); void build(); void saveIndex(const string &filename) const; void search(const string &query, vector<unsigned> &results); void search(const string &query, const unsigned editdist, vector<unsigned> &results);
The main idea is that the user can create a PartEnum object by specifying a vector of strings (dataset) and a few extra parameters (see  for details) or load an existing object from a file. If the object was not loaded, then it needs to be built. Next, the user has the option of saving the object to a file. In order to search approximately in the dataset for a given string, the user calls the function search.
Pentium D 3.4GHz Dual Core, 2GB memory, Linux (Ubuntu), g++. A data set of 54,000 person names.
|Technique||Dataset Size||Ed Threshold||Q||Time (ms)||Index size (MB)||Comments|
 Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929