AppString > AppStringDoc

Merger

Overview

The module contains the implementation of five merging algorithms: Heap, MergeOpt,ScanCount, MergeSkip and DivideSkip, where Heap and MergeOpt algorithms are proposed in [1] and other three algorithms are proposed in our work [2].

Usage

An example of how to use the module is available in source:codebase/appstring/trunk/listmerger/unittest.cc.

Interface

The main head file of the module is declared in source:codebase/appstring/trunk/listmerger/listmerger.h.

The main API of mergers are:


  //used for Heap, MergeOpt, MergeSkip and DivideSkip Merger
  Merger(){};

  //only used for ScanCount merger
  Merger(unsigned maxObjectId )
    {
      maxObjectID = maxObjectId;
    };//end Merger

  // the lists are assumed to be sorted in an ascending order
  virtual void merge(const vector <Array<unsigned>*> &arrays, 
		     const unsigned threshold, // threshold of count
		     vector &results ) = 0; 

Example codes for calling various mergers.



  //Merger *mergeLists = new Heap();
  //Merger *mergeLists = new MergeOpt();
  //set max reord ID as the maxmal unsigned integer
  //Merger *mergeLists = new ScanCount(~0);
  //Merger *mergeLists = new MergeSkip();  
  Merger *mergeLists = new DivideSkip();  

  mergeLists->merge(lists, threshold, result);

Performance

Edit distance threshold=2, we use 3-gram, three data sets: DBLP, IMDB and Web Corpus. The following data shows the average performance (ms) for 100 calling.

DBLP (ms) IMDB (ms) Web Corpus (ms)
Heap 114.53 115.32 95.49
MergeOpt 13.32 28.83 58.83
ScanCount 30.01 26.40 77.44
MergeSkip 9.22 10.89 45.16
DivideSkip 1.34 4.20 10.98

The best performances for three data sets are achieved by DivideSkip algorithms.

More performance results can be found in [2].


[1] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates, SIGMOD 2004

[2] Chen Li, Jiaheng Lu and Yiming Lu: Using Filtering and Merging Algorithms on Approximate String Searches, Submitted for publishing