Filtertree

Introduction

This module supports efficient approximate string search on a collection of strings. An approximate query asks for all strings in the collection that are "similar" to the query string for a given similarity function and similarity threshold.

Approximate String Search

Many applications need to answer approximate string queries. The following are a few examples:

The "Did you mean" feature by many search engines partially relies on finding words similar to a given keyword.
Spellchecking: suggest good words for a possibly mistyped word.
Record linkage: identify records that could represent the same real-world entity.

Overview

Our implementation stores everything in main memory. That includes:
- The string collection
- The filtertree
- The inverted lists

We support the following similarity functions / distance measures:
- Levenshtein Distance (aka the Edit Distance) http://en.wikipedia.org/wiki/Levenshtein_distance
- Jaccard Similarity http://en.wikipedia.org/wiki/Jaccard_Similarity_Coefficient
- Cosine Similarity http://en.wikipedia.org/wiki/Cosine_similarity
- Dice Similarity http://en.wikipedia.org/wiki/Dice%27s_coefficient

We support the following single-signature filters:
- Length Filter
- Checksum Filter

The index structure (filtertree + inverted lists) can be saved/loaded to/from a file.

Answering Approximate String Queries

To answer queries efficiently this module uses an inverted-list index on the q-grams of the strings in the collection. That is, each string is decomposed into substrings (grams) of size q using a sliding window, and then for each gram we build a list of string ids containing that gram (the inverted list of that gram). The process of answering queries is based on the observation that if two strings are similar, then they must share a certain number of common grams (depending on the similarity function and similarity threshold). False-positives must be removed in a post-processing step, i.e. the true similarities are computed.

In addition to the above we can further increase the performance of queries by using filters. A single-signature filter partitions the string collection into disjoint subsets based on some criteria. For answering a query we only need to consider some of the subsets. For example, if we were looking for all strings in the collection within an edit-distance of 1 to the string "abcde", then we know that any answer string must have a length in [4,6]. So, if we partition the string collection using the length of the strings we can avoid processing irrelevant string ids during query answering. The checksum filter is very similar to the length filter. We partition the data strings based on their checksums. For query answering we can determine a range of checksums that answers must lie in.

The Filtertree Structure

The filtertree structure facilitates the use of multiple single-signature filters. Each level in the tree partitions the string collection based on one filter. Each leaf node contains an inverted-list structure on the subset of strings belonging to that leaf. For answering a query we traverse the tree to identify leaf nodes that could contain answers to the query, and process them independently. The following is an example of a filtertree with a fanout of 3 and both the length and checksum filter applied:

FilterTree Structure with two filters

High-Level Overview of Important Components

Includes components from core and from filtertree

Code overview of filtertree

StatsGen Output

The StatsGenerator allows collecting of performance data on the approximate string search library. For example, different filters, merging algorithms, datasets, query workloads can be tested. A good start is perftest.cc included in the filtertree folder. The performance numbers are written to an output file (e.g. perftest.cc writes to "perftest_search_stats.txt"). The StatsGenerator is intended for advanced users who are familiar with the algorithmic details of approximate string search. The numbers generated depict different steps in the process of query answering and will only be understood by people familiar with the subject. For getting an idea of the query performance using certain parameters it is sufficient to focus on field8 which measures the average query performance of the given workload.

The output generated is semicolon-separated and has the following format:

field1: size of dataset
field2: filtertree fanout
field3: avg time for calculating merging-threshold (ms)
field4: avg time for pre-processing, traversing the filtertree, etc. (ms)
field5: avg time for merging (ms)
field6: avg time for post-processing (ms)
field7: avg time for scans (ms)
field8: avg total time (ms)
field9: avg number of candidate strings to post-process
field10: total number of scans
field11: index build time (ms)

Contributors

Alexander Behm (design, main author)
Chen Li (design, project leader)
Rares Vernica (design)
Shengyue Ji (design, implementation)
Yiming Lu (design, implementation)