AppString > AppStringDoc
Filtertree
Overview
Many applications need to answer approximate string queries efficiently. The following are a few examples:
- The "Did you mean" feature by many search engines partially relies on finding words similar to a given keyword.
- Spellchecking: suggest good words for a possibly mistyped word.
- Record linkage: identify records that could represent the same real-world entity.
Filtertree is an indexing tree structure to support approximate string search efficiently. In particular, it can be used to answer such queries: Given a collection of strings and a single string, how to find those strings in the collection that are "similar to" the given string? There are various string similarity functions. This structure mainly deals with edit distance. Its main idea is to use a tree structure, where each node or level * corresponds to a filter. In particular:
- It can take various filters.
- The nodes at each level could use different filters, depending on the data distribution. Different paths on the tree could even use different sequences of filters.
- The tree might not be balanced. The length of each path depends on the data distribution.
As a proof of concept, in the current implementation, we used three filters: length filter, position filter, count filter [1] [2] [3].
- Length filter: if two strings S1 and S2 are within edit distance K, their length difference should be at most K.
- Position filter: if two strings S1 and S2 are within edit distance K, we can ignore their q-grams that are more than k positions away.
- Count filter: if two string S1 and S2 are within edit distance K, they must share enough number common q-grams.
By using these filters, a filtertree could look like:
Notice that in general the tree structure can be more heterogeneous in terms of the depth of different paths and the types of filters in the nodes.
Files
-
gram.h/gram.cc: They define the 3 types of nodes of Filtertree, including StringPosition? (position level), GroupStringPosition? (length level), and LengthBucket? (gram level).
-
buckethead.h/buckethead.cc: Class Buckets provide the hash function for quickly searching the grams in Filtertree.
-
filtertree.h/filtertree.cc: They provide the basic functions for Filtertree. There are two kinds of constructors: one requires the user to set the q value explicitly, and it will build the index; the other one doesn’t require the q value, and it will load the index from a file.
-
unittest.cc: a test example file that shows two usages, one is to build the index first, and then do the approximate search queries; the other one is to load the index from an external file, and then do approximate search queries.
-
perftest.cc: It uses 200 queries and a dataset with 54K strings to test the performance of Filtertree, including the running time and memory requirement.
- Platform: Pentium D 3.4GHz Dual Core, 2GB memory, Linux (Ubuntu), g++.
- Dataset: 54K person names.
- Queries: 200 queries, and we computed the average performance.
Ed threshold |
Q |
Time (ms) |
Index size (MB)
|
1 |
1 |
1.30 |
1.45
|
2 |
1 |
4.05 |
1.45
|
3 |
1 |
14.10 |
1.45
|
1 |
2 |
0.30 |
2.16
|
2 |
2 |
0.80 |
2.16
|
3 |
2 |
5.05 |
2.16
|
1 |
3 |
0.25 |
4.36
|
2 |
3 |
1.08 |
4.36
|
3 |
3 |
13.30 |
4.36
|
[1] E. Sutinen and J. Tarhio. Filtration with q-Samples in Approximate String Matching. In CPM, 1996.
[2] E. Sutinen and J. Tarhio. On Using q-Grams Locations in Approximate String Matching. In ESA, pages 327–340, 1995.
[3] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free. VLDB 2001, pages 491-500.