AppString > AppStringDoc

Filtertree

Overview

Many applications need to answer approximate string queries efficiently. The following are a few examples:

Filtertree is an indexing tree structure to support approximate string search efficiently. In particular, it can be used to answer such queries: Given a collection of strings and a single string, how to find those strings in the collection that are "similar to" the given string? There are various string similarity functions. This structure mainly deals with edit distance. Its main idea is to use a tree structure, where each node or level * corresponds to a filter. In particular:

As a proof of concept, in the current implementation, we used three filters: length filter, position filter, count filter [1] [2] [3].

  1. Length filter: if two strings S1 and S2 are within edit distance K, their length difference should be at most K.
  2. Position filter: if two strings S1 and S2 are within edit distance K, we can ignore their q-grams that are more than k positions away.
  3. Count filter: if two string S1 and S2 are within edit distance K, they must share enough number common q-grams.

By using these filters, a filtertree could look like:

Filtertree

Notice that in general the tree structure can be more heterogeneous in terms of the depth of different paths and the types of filters in the nodes.

Files

Performance Results

Ed threshold Q Time (ms) Index size (MB)
1 1 1.30 1.45
2 1 4.05 1.45
3 1 14.10 1.45
1 2 0.30 2.16
2 2 0.80 2.16
3 2 5.05 2.16
1 3 0.25 4.36
2 3 1.08 4.36
3 3 13.30 4.36

[1] E. Sutinen and J. Tarhio. Filtration with q-Samples in Approximate String Matching. In CPM, 1996.
[2] E. Sutinen and J. Tarhio. On Using q-Grams Locations in Approximate String Matching. In ESA, pages 327–340, 1995.
[3] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free. VLDB 2001, pages 491-500.