Back to Index


AppString > AppStringDoc

  1. Filtertree
    1. Introduction
    2. Approximate String Search
    3. Overview
      1. Indexing Options
      2. Other Functionality
    4. Answering Approximate String Queries
    5. The Filtertree Structure
      1. Recommendations
    6. High-Level Overview of Important Components
      1. General Components
      2. Indexers, Searchers, and ListMergers
    7. StatsGen Output
    8. Contributors

Filtertree

Introduction

This module supports efficient approximate string search on a collection of strings. An approximate query asks for all strings in the collection that are "similar" to the query string for a given similarity function and similarity threshold.

Approximate String Search

Many applications need to answer approximate string queries. The following are a few examples:

Overview

Indexing Options

We provide three different flavors of indexing+searching, each of which provide specific optimizations for improving query performance:

  1. All structures are in main memory, including:
    • String collection
    • Filtertree
    • Inverted index
  1. All structures are in main memory. We use lossy compression on the inverted index to save space.
    • String collection in memory
    • Filtertree in memory
    • Inverted index in memory (compressed)
  1. The data and inverted index is on disk.
    • String collection on disk
    • Filtertree in memory
    • Inverted index on disk

Other Functionality

We support the following similarity functions / distance measures:

We support the following single-signature filters (e.g. for partitioning the string collection):

The index structure (filtertree + inverted index) can be saved/loaded to/from a file.

Answering Approximate String Queries

To answer queries efficiently this module uses an inverted-list index on the q-grams of the strings in the collection. That is, each string is decomposed into substrings (grams) of size q using a sliding window, and then for each gram we build a list of string ids containing that gram (the inverted list of that gram). The process of answering queries is based on the observation that if two strings are similar, then they must share a certain number of common grams (depending on the similarity function and similarity threshold). False-positives must be removed in a post-processing step, i.e. the true similarities are computed.

In addition to the above we can further increase the performance of queries by using filters. A single-signature filter partitions the string collection into disjoint subsets based on some criteria. For answering a query we only need to consider some of the subsets. For example, if we were looking for all strings in the collection within an edit-distance of 1 to the string "abcde", then we know that any answer string must have a length in [4,6]. So, if we partition the string collection using the length of the strings we can avoid processing irrelevant string ids during query answering. The charsum filter is similar to the length filter. We partition the data strings based on their charsums (sum of characters in the string). For query answering we can determine a range of charsums that answers must lie in.

The Filtertree Structure

The filtertree structure facilitates the use of filters. Each level in the tree partitions the string collection based on one filter. Each leaf node contains a gram inverted-index on the subset of strings belonging to that leaf. For answering a query we traverse the tree to identify leaf nodes that "survive" the filtering criteria, and probe the inverted indexes attached to those leaves to get a list of candidate answers. The following is an example of a filtertree with a fanout of 3 and both the length and charsum filter applied:

Filtertree structure with two partitioning filters

Recommendations

In most cases, using exactly one partitioning filter yields the best performance. An intuitive (but simplified) explanation is as follows:

High-Level Overview of Important Components

General Components

Code overview of general components

Indexers, Searchers, and ListMergers

Code overview of indexers, searchers and mergers

StatsGen Output

The StatsGenerator allows collecting of performance data on the approximate string search library. For example, different filters, merging algorithms, datasets, query workloads can be tested. A good start is perftest.cc included in the filtertree folder. The performance numbers are written to an output file (e.g. perftest.cc writes to "perftest_search_stats.txt"). The StatsGenerator is intended for advanced users who are familiar with the algorithmic details of approximate string search. The numbers generated depict different steps in the process of query answering and will only be understood by people familiar with the subject. For getting an idea of the query performance using certain parameters it is sufficient to focus on the field "total time" which measures the average query performance of the given workload.

Contributors