Filtertree

Introduction

This module supports efficient approximate string search on a collection of strings. An approximate query asks for all strings in the collection that are "similar" to the query string for a given similarity function and similarity threshold.

Approximate String Search

Many applications need to answer approximate string queries. The following are a few examples:

Spellchecking: suggest good words for a possibly mistyped word.
Record linkage: identify records that could represent the same real-world entity.
The "Did you mean" feature by many search engines partially relies on finding words similar to a given keyword.

Overview

Indexing Options

We provide three different flavors of indexing+searching, each of which provide specific optimizations for improving query performance:

All structures are in main memory, including:
- String collection
- Filtertree
- Inverted index

All structures are in main memory. We use lossy compression on the inverted index to save space.
- String collection in memory
- Filtertree in memory
- Inverted index in memory (compressed)

The data and inverted index is on disk.
- String collection on disk
- Filtertree in memory
- Inverted index on disk

Other Functionality

We support the following similarity functions / distance measures:

Levenshtein Distance (aka the Edit Distance) http://en.wikipedia.org/wiki/Levenshtein_distance
Jaccard Similarity http://en.wikipedia.org/wiki/Jaccard_Similarity_Coefficient
Cosine Similarity http://en.wikipedia.org/wiki/Cosine_similarity
Dice Similarity http://en.wikipedia.org/wiki/Dice%27s_coefficient

We support the following single-signature filters (e.g. for partitioning the string collection):

Length Filter
Charsum Filter

The index structure (filtertree + inverted index) can be saved/loaded to/from a file.

Answering Approximate String Queries

To answer queries efficiently this module uses an inverted-list index on the q-grams of the strings in the collection. That is, each string is decomposed into substrings (grams) of size q using a sliding window, and then for each gram we build a list of string ids containing that gram (the inverted list of that gram). The process of answering queries is based on the observation that if two strings are similar, then they must share a certain number of common grams (depending on the similarity function and similarity threshold). False-positives must be removed in a post-processing step, i.e. the true similarities are computed.

In addition to the above we can further increase the performance of queries by using filters. A single-signature filter partitions the string collection into disjoint subsets based on some criteria. For answering a query we only need to consider some of the subsets. For example, if we were looking for all strings in the collection within an edit-distance of 1 to the string "abcde", then we know that any answer string must have a length in [4,6]. So, if we partition the string collection using the length of the strings we can avoid processing irrelevant string ids during query answering. The charsum filter is similar to the length filter. We partition the data strings based on their charsums (sum of characters in the string). For query answering we can determine a range of charsums that answers must lie in.

The Filtertree Structure

The filtertree structure facilitates the use of filters. Each level in the tree partitions the string collection based on one filter. Each leaf node contains a gram inverted-index on the subset of strings belonging to that leaf. For answering a query we traverse the tree to identify leaf nodes that "survive" the filtering criteria, and probe the inverted indexes attached to those leaves to get a list of candidate answers. The following is an example of a filtertree with a fanout of 3 and both the length and charsum filter applied:

Filtertree structure with two partitioning filters

Recommendations

In most cases, using exactly one partitioning filter yields the best performance. An intuitive (but simplified) explanation is as follows:

For the in-memory indexes, there is trade-off between processing several inverted indexes with short lists or processing fewer inverted indexes with longer lists
For the disk-based indexes the layout of the inverted lists in the file can be optimized if exactly one partitioning filter is used

High-Level Overview of Important Components

The purpose of this overview is to give the reader a feeling for the code design
Many details, such as methods, method parameters, and attributes have been left out for simplicity
Blue components are meant for in-memory indexes
Yellow components are meant for disk-based indexes
Red components are meant for in-memory indexes compressed with CombineLists
Green components are meant for in-memory indexes compressed with DiscardLists

General Components

Code overview of general components

Indexers, Searchers, and ListMergers

Code overview of indexers, searchers and mergers

StatsGen Output

The StatsGenerator allows collecting of performance data on the approximate string search library. For example, different filters, merging algorithms, datasets, query workloads can be tested. A good start is perftest.cc included in the filtertree folder. The performance numbers are written to an output file (e.g. perftest.cc writes to "perftest_search_stats.txt"). The StatsGenerator is intended for advanced users who are familiar with the algorithmic details of approximate string search. The numbers generated depict different steps in the process of query answering and will only be understood by people familiar with the subject. For getting an idea of the query performance using certain parameters it is sufficient to focus on the field "total time" which measures the average query performance of the given workload.

Contributors

Alexander Behm (design, main author)
Chen Li (design, project leader)
Rares Vernica (design)
Shengyue Ji (design, implementation)
Yiming Lu (design, implementation)