Back to Index
AppString > AppStringDoc
This module supports fuzzy string searching, i.e., finding strings similar to a given a string or a collection of strings, where similarity is defined using edit distance. It supports both join queries and string-string-search queries. The approach first maps strings into a high-dimensional Euclidean space. An R-tree is constructed from the mapped objects. The program will sample from the strings to get a new threshold to be used to do search in the Euclidean space. A similarity string search is converted to a spatial search in the new Euclidean space using R-trees. [1,2]
This package includes three data files of names:
We ran this package on a Linux platform with a 2.8GHZ Pentium 4 with 1GB RAM. The join on source1.txt and source2.txt (2000 x 2000 strings) took about 105 seconds (including the time to compute the recall, which was using nested loop and expensive), and with a recall of 98.9%. We ran 100 single-string-search queries on 100K name strings, and the average search time is 0.079 seconds, with the average recall 98.8%.
 Liang Jin, Chen Li, Sharad Mehrotra: Efficient Record
Linkage in Large Data Sets. DASFAA 2003: 137-
 Chen Li, Liang Jin, Sharad Mehrotra: Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques. World Wide Web 9(4): 557-584