AppString > AppStringDoc
The module contains the implementation of five merging algorithms: Heap, MergeOpt,ScanCount, MergeSkip and DivideSkip, where Heap and MergeOpt algorithms are proposed in [1] and other three algorithms are proposed in our work [2].
An example of how to use the module is available in source:codebase/appstring/trunk/listmerger/unittest.cc.
The main head file of the module is declared in source:codebase/appstring/trunk/listmerger/listmerger.h.
The main API of mergers are:
//used for Heap, MergeOpt, MergeSkip and DivideSkip Merger Merger(){}; //only used for ScanCount merger Merger(unsigned maxObjectId ) { maxObjectID = maxObjectId; };//end Merger // the lists are assumed to be sorted in an ascending order virtual void merge(const vector <Array<unsigned>*> &arrays, const unsigned threshold, // threshold of count vector&results ) = 0;
Example codes for calling various mergers.
//Merger *mergeLists = new Heap(); //Merger *mergeLists = new MergeOpt(); //set max reord ID as the maxmal unsigned integer //Merger *mergeLists = new ScanCount(~0); //Merger *mergeLists = new MergeSkip(); Merger *mergeLists = new DivideSkip(); mergeLists->merge(lists, threshold, result);
Edit distance threshold=2, we use 3-gram, three data sets: DBLP, IMDB and Web Corpus. The following data shows the average performance (ms) for 100 calling.
DBLP (ms) | IMDB (ms) | Web Corpus (ms) | |
Heap | 114.53 | 115.32 | 95.49 |
MergeOpt | 13.32 | 28.83 | 58.83 |
ScanCount | 30.01 | 26.40 | 77.44 |
MergeSkip | 9.22 | 10.89 | 45.16 |
DivideSkip | 1.34 | 4.20 | 10.98 |
The best performances for three data sets are achieved by DivideSkip algorithms.
More performance results can be found in [2].
[1] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates, SIGMOD 2004
[2] Chen Li, Jiaheng Lu and Yiming Lu: Using Filtering and Merging Algorithms on Approximate String Searches, Submitted for publishing