Back to Index


AppString > AppStringDoc

PartEnum

Overview

The module contains an implementation of the technique presented in [1]. The technique was invented in the Data Debugger Project at Microsoft, Research.

Usage

For compiling instructions, please see CompileDoc.

The module uses C++ STL TR1 library provided by GNU GCC and Boost 1.34.1 library.

For installing Boost library on a Debian (including Ubuntu) machine, you need to run:

$ sudo apt-get install libboost-dev

An example of how to use the module is available in src/partenum/example.cc.

Interface

The main class of the module is ParEnum which is declared in src/partenum/partenum.h.

The main methods of PartEnum are:

  PartEnum(const vector<string> &data, 
           unsigned q, 
           unsigned editdist, 
           unsigned n1, 
           unsigned n2);

  PartEnum(const vector<string> &data, 
           const string &filename);

  void build();
  void saveIndex(const string &filename) const;  

  void search(const string &query, vector<unsigned> &results);
  void search(const string &query, const unsigned editdist,
              vector<unsigned> &results);

The main idea is that the user can create a PartEnum object by specifying a vector of strings (dataset) and a few extra parameters (see [1] for details) or load an existing object from a file. If the object was not loaded, then it needs to be built. Next, the user has the option of saving the object to a file. In order to search approximately in the dataset for a given string, the user calls the function search.

Performance

Pentium D 3.4GHz Dual Core, 2GB memory, Linux (Ubuntu), g++. A data set of 54,000 person names.

Technique Dataset Size Ed Threshold Q Time (ms) Index size (MB) Comments
Scan 54k 1 - 11.86 1.3
Scan 54k 2 - 21.30 1.6
Scan 54k 3 - 35.49 4.2
-
PartEnum 54k 1 2 1.21 57.3 n1=2,n2=8
PartEnum 54k 2 2 12.04 60.2 n1=3,n2=8
PartEnum 54k 3 1 35.24 34.8 n1=2,n2=7

Contributors


[1] Arvind Arasu, Venkatesh Ganti, Raghav Kaushik: Efficient Exact Set-Similarity Joins. VLDB 2006: 918-929