|
The FLAMINGO Project on Data Cleaning
Department of Computer Science, UC Irvine
|
Objective
The Flamingo Project focuses on data cleaning, i.e., how to
deal with errors and inconsistencies in information systems. As an
example, in many applications such as data integration, commercial
organizations need to collect data from various sources to conduct
analysis and make decisions. Often, the data from these different
sources can have inconsistencies. For instance, we use first name,
last name, SSN, and birthday to identify a person. However, the same
name, e.g., "Schwarzenegger", may be misspelled as "Swarzzengaer" or
other forms. Such errors make it more challenging to link records from
different places and answer queries approximately. We are developing
algorithms in order to make query answering and information retrieval
efficient in the presence of such inconsistencies and errors.
With the NSF
award IIS-0844574,
we plan to study the following problems. Supporting fuzzy queries is
becoming increasingly more important in applications that need to deal
with a variety of data inconsistencies in structures, representations,
or semantics. Many existing algorithms require an offline analysis of
data sets to construct an efficient index structure to support online
query processing. Fuzzy join queries of data sets are more time
consuming due to the computational complexity. The PI is studying
three research problems: (1) constructing high-quality inverted lists
for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of
large data sets using Hadoop; and (3) using the developed techniques
to improve data quality of large collections of documents.
With the NSF
award 1030002,
we will study how to support powerful keyword search with efficient
indexing structures and algorithms in a clouding-computing
infrastructure. A main application is supporting
family reunification in disasters
such as the Haiti Earthquake. Check our portals for
the Haiti Earthquake
and Chile Earthquake. The
main challenge is how to use limited programming primitives in the
cloud to implement index structures and search algorithms.
News
Fuzzy Keyword Search on Spatial Data
We present a solution to
support Fuzzy
Keyword Search on Spatial Data.
Releases
- Latest
- 4.1 (February 22nd, 2012)
- 4.0 (October 23rd, 2010)
- 3.0 (March 29th, 2010)
- 2.0.1 (November 7th, 2008)
- 2.0 (October 14th, 2008)
- 1.0 (April 17th, 2007)
- Toolkit (October 14th, 2008), UDF
functions for MySQL
People
Alumni and Visitors
- Guoliang Li, spring of 2008, visitor from Tsinghua University, China.
- Jiaheng Lu, postdoc, 2006-2008. Now a faculty at Renmin University, China.
- Yiming Lu, graduated from UC Irvine in 2008
- Bin Wang and Xiaochun Yang, summers of 2006, 2007, and 2008, visitors from
Northeastern University, China
- Liang Jin, graduated from UC Irvine in 2005
Publications
- Answering Approximate String Queries on Large Data Sets Using External Memory
Alexander Behm, Chen Li, Michael J. Carey.
ICDE 2011 (accepted for publication).
- Supporting Location-Based Approximate-Keyword Queries
Sattam Alsubaiee, Alexander Behm, Chen Li.
PDF
PPTX
Source Code
ACM SIGSPATIAL GIS 2010.
- Efficient Parallel Set-Similarity Joins Using MapReduce
Rares Vernica, Michael J. Carey, Chen Li.
PDF Full Version
Source Code
SIGMOD 2010.
- Fuzzy Keyword Search on Spatial Data (Demo)
Sattam Alsubaiee and Chen Li
PDF Demo
DASFAA 2010.
- Efficient top-k algorithms for fuzzy search in string
collections.
Rares Vernica, Chen Li.
PDF PDF
slides Source Code
KEYS 2009: 9-14. (Workshop on Keyword Search on Structured
Data, collocated with SIGMOD 2009)
- Efficient Interactive Fuzzy Keyword Search
Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng
PDF
PPTX
ConferenceLink
WWW 2009.
- Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu
PDF
Full Version
PPTX
Source Code
ICDE 2009.
- Efficient Approximate Search on String Collections (Tutorial)
Marios Hadjieleftheriou, Chen Li
PPT Part1,
PPT Part2
ICDE 2009.
- Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently
PDF
PPT
Xiaochun Yang, Bin Wang, Chen Li.
SIGMOD 2008.
- Efficient Merging and Filtering Algorithms for Approximate String Searches
PDF
PPT
Source Code
Chen Li, Jiaheng Lu, and Yiming Lu.
ICDE 2008.
- SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases
Source Code
Liang Jin, Chen Li, and Rares Vernica.
VLDB Journal 2007. It's an extended version of the SEPIA paper in VLDB05.
- VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams.
PDF
PPT
Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007, Vienna, Austria
- Relaxing Join and Selection Queries.
PDF
PPT
Source Code
Nick Koudas, Chen Li, Anthony Tung, and Rares Vernica.
VLDB 2006, Seoul, Korea.
- Selectivity Estimation for Fuzzy String Predicates in Large
Data Sets.
PDF
PPT
Source Code
Liang Jin and Chen Li.
VLDB 2005, Trondheim, Norway.
- Indexing Mixed Types for Approximate Retrieval.
PDF
PPT
Source Code
Liang Jin, Nick Koudas, Chen Li, Anthony K.H. Tung.
VLDB 2005, Trondheim, Norway.
- NNH: Improving Performance of Nearest-Neighbor Searches Using
Histograms.
PDF
Full Version
PPT
Liang Jin, Nick Koudas, Chen Li.
EDBT 2004, Heraklion - Crete, Greece.
- Efficient Record Linkage in Large Data Sets.
PDF,
PPT
Source Code
Liang Jin, Chen Li, and Sharad Mehrotra.
8th International Conference on Database Systems for Advanced
Applications (DASFAA) 2003, Kyoto, Japan.
Received 10-year Best Paper Award for DASFAA 2013.
- Supporting Efficient Record Linkage for Large Data Sets Using
Mapping Techniques
Chen Li, Liang Jin, and Sharad Mehrotra
World Wide Web Journal, Volume 9, Number 4, pages 557-584, December 2006.
This journal article is an extended version of the DASFAA03 paper.
Acknowledgements: This release is partially
supported by the
NSF CAREER
Award
No. IIS-0238586,
the NSF
award No. IIS-0742960,
the NSF
award IIS-0844574,
the NSF
award 1030002,
the NSF-funded RESCUE project,
the NIH grant 1R21LM010143-01A1,
a Google Research Award, a gift fund from Microsoft,
a research grant from Amazon.com to allow us to use their MapReduce cluster, and a
fund from CalIt2.
Many thanks to Minh Doan and Kensuke Ohta for their valuable testing
and feedback on the code and documentation.
For any questions regarding this project, please
send email to flamingo AT ics.uci.edu