|
FLAMINGO Package (Approximate String Matching)
Release 2.0 (October 14, 2008)
Department of Computer Science, UC Irvine
This version is outdated. Our most recent release is here.
|
Contributors
« Back to Flamingo Main Page
Getting Started
Please refer to the Flamingo Getting Started Guide.
Introduction
This release (in C++) includes the source code of several
algorithms for approximate string matching developed at UC Irvine. It
includes algorithms for approximate selection queries, selectivity
estimation for approximate selection queries, approximate queries on
mixed types, and others. Although an implementation for approximate
joins is included, the focus of this release is on approximate
selection queries.
Here is a brief explanation of the terms used above:
- Approximate String Search: Given a collection of strings and a single string,
how to find those strings in the collection that are "similar to" the
given string?
This functionality is implemented by the modules
Common,
FilterTree,
Listmerger,
StringMap,
and
PartEnum.
We recommend getting started with
the FilterTree module for this purpose.
- Selectivity Estimation for Approximate String Search: Given
a collection of strings and a single string, how can we estimate the
number of strings that are "similar to" the given string? This
functionality is implemented in the SEPIA module.
- Approximate String Join: Given two collections of strings (possibly the same
collection), how to find those pairs of strings that are "similar to"
each other?
There are various string similarity functions, such as
Levenshtein Distance (aka the Edit Distance),
Jaccard Similarity,
Cosine Similarity, and
Dice
Similarity. The following is a description of the modules
corresponding to the source directory structure:
- Common: This module contains classes for supporting
the following similarity functions / distance measures:
Levenshtein Distance (aka Edit Distance), Jaccard Similarity, Cosine Similarity, Dice Similarity. It also
provides functionality for decomposing strings into grams.
- FilterTree: This module provides functionality for approximate string search
using an inverted-list index. Furthermore,
query performance can be improved by adding filters, i.e. partitioning the string collection into
disjoint subsets according to some property (e.g. the length of the strings).
The use of filters is facilitated by a hierarchical structure (the FilterTree),
in which each level in the tree corresponds to one filter.
We have implemented the length and checksum filter.
- ListMerger: Answering approximate string queries based on an inverted-list index
requires finding elements that occur at least T times on the inverted lists belonging
to the grams in the query string (T depends on the similarity metric and the similarity threshold).
This problem is commonly referred to as the T-occurrence problem.
This module implements several algorithms for solving the T-occurrence problem
as described in "Efficient Merging and Filtering Algorithms for Approximate String Searches",
Chen Li, Jiaheng Lu and Yiming Lu, ICDE 2008.
- MAT-Tree: MAT-tree is an indexing structure to support
queries on data with an approximate string predicate and a numeric
predicate. A typical query is: "Find employee records whose name is
similar to Speilberg and whose age is close to 45." The indexing
structure is proposed in the following paper: "Indexing Mixed Types
for Approximate Retrieval," Liang Jin, Nick Koudas, Chen Li, Anthony
K.H. Tung, VLDB 2005, Trondheim, Norway.
- Sepia: This technique solves the problem of estimating the
selectivity of an approximate string predicate. It can answer
questions such as: "From a collection of strings, how many of them
have an edit distance within 3 to a given string?". Such information
can be used in optimizing queries of approximate string matching. The
technique was published in the paper: "Selectivity Estimation for Fuzzy
String Predicates in Large Data Sets," Liang Jin and Chen Li, VLDB
2005, Trondheim, Norway.
- StringMap: This algorithm maps strings from the
edit-distance metric space to a high-dimensional Euclidean space, and
uses a multi-dimensional indexing structure to answer approximate
queries. The algorithm is published in the paper: "Efficient Record
Linkage in Large Data Sets," by Liang Jin, Chen Li, and Sharad
Mehrotra, in 8th International Conference on Database Systems for
Advanced Applications (DASFAA) 2003, Kyoto, Japan.
- PartEnum: This algorithm is published in the paper:
"Efficient Exact Set-Similarity Joins," Arvind Arasu, Venkatesh Ganti,
Raghav Kaushik, VLDB 2006. We implemented the algorithm to
support approximate string matching queries, excluding approximate joins.
In addition, we have provided some commonly used functions in the
util directory.
Changes in Version 2.0
- The code design of filtertree has been completely re-worked.
- Added separate module listmerger, which implements various list-merging algorithms.
- Added support for Jaccard, Cosine and Dice similarity functions.
- Moved gram generators and similarity functions to "common" module.
- Due to the new design, we removed the positional filter but added the checksum filter.
- Several bugfixes in all modules.
Acknowledgements: This release is partially
supported by the
NSF CAREER
Award
No. IIS-0238586,
the NSF
award No. IIS-0742960,
the NSF-funded RESCUE
project, a Google Research Award, a gift fund from Microsoft and a
fund from CalIt2.
Many thanks to Sattam Alsubaiee, Minh Doan, and Kensuke Ohta for their
valuable testing and feedback on the code and documentation.
License Agreement: Permission to use, copy,
modify, and distribute the implementations of MAT-Tree, SEPIA,
StringMap, and FilterTree is permitted under the terms of the BSD
license. The implementation of the PartEnum algorithm invented by
Microsoft researchers is limited to non commercial use, which would be
covered under the royalty free covenant that Microsoft made public.
For any questions regarding this release, please send email to
flamingo AT ics.uci.edu