|
The FLAMINGO Project on Data Cleaning
Department of Computer Science, UC Irvine
|
Objective
The Flamingo Project focuses on data cleaning, i.e., how to
deal with errors and inconsistencies in information systems. As an
example, in many applications such as data integration, commercial
organizations need to collect data from various sources to conduct
analysis and make decisions. Often, the data from these different
sources can have inconsistencies. For instance, we use first name,
last name, SSN, and birthday to identify a person. However, the same
name, e.g., "Schwarzenegger", may be misspelled as "Swarzzengaer" or
other forms. Such errors make it more challenging to link records from
different places and answer queries approximately. We are developing
algorithms in order to make query answering and information retrieval
efficient in the presence of such inconsistencies and errors.
With the new NSF award IIS-0844574, we plan to study the following problems. Supporting fuzzy queries is becoming
increasingly more important in applications that need to deal with a
variety of data inconsistencies in structures, representations, or
semantics. Many existing algorithms require an offline analysis of
data sets to construct an efficient index structure to support online
query processing. Fuzzy join queries of data sets are more time
consuming due to the computational complexity. The PI is studying
three research problems: (1) constructing high-quality inverted lists
for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of
large data sets using Hadoop; and (3) using the developed techniques
to improve data quality of large collections of documents.
News
Releases
- Latest
- 2.0.1 (November 7th, 2008)
- 2.0 (October 14th, 2008)
- 1.0 (April 17th, 2007)
- Toolkit (October 14th, 2008), UDF
functions for MySQL
People
- Alexander Behm (Ph.D. Student)
- Shengyue Ji (Ph.D. Student)
- Chen Li (Faculty)
- Rares Vernica (Ph.D. Student)
Alumni and Visitors
- Guoliang Li, spring of 2008, visitor from Tsinghua University, China.
- Jiaheng Lu, postdoc, 2006-2008. Now a faculty at Renmin University, China.
- Yiming Lu, graduated from UC Irvine in 2008
- Bin Wang and Xiaochun Yang, summers of 2006, 2007, and 2008, visitors from
Northeastern University, China
- Liang Jin, graduated from UC Irvine in 2005
Publications
- Efficient Interactive Fuzzy Keyword Search
Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng
(PDF
PPTX
ConferenceLink)
WWW 2009.
- Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu
PDF
PPTX
ICDE 2009.
- Efficient Approximate Search on String Collections (Tutorial)
Marios Hadjieleftheriou, Chen Li
PPT (Part1,
Part2)
ICDE 2009.
- Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently
PDF
PPT
Xiaochun Yang, Bin Wang, Chen Li.
SIGMOD 2008.
- Efficient Merging and Filtering Algorithms for Approximate String Searches
PDF
PPT
Source Code
Chen Li, Jiaheng Lu, and Yiming Lu.
ICDE 2008.
- SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases
Source Code
Liang Jin, Chen Li, and Rares Vernica.
VLDB Journal 2007. It's an extended version of the SEPIA paper in VLDB05.
- VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams.
PDF
PPT
Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007, Vienna, Austria
- Relaxing Join and Selection Queries.
PDF
PPT
Source Code
Nick Koudas, Chen Li, Anthony Tung, and Rares Vernica.
VLDB 2006, Seoul, Korea.
- Selectivity Estimation for Fuzzy String Predicates in Large
Data Sets.
PDF
PPT
Source Code
Liang Jin and Chen Li.
VLDB 2005, Trondheim, Norway.
- Indexing Mixed Types for Approximate Retrieval.
PDF
PPT
Source Code
Liang Jin, Nick Koudas, Chen Li, Anthony K.H. Tung.
VLDB 2005, Trondheim, Norway.
- NNH: Improving Performance of Nearest-Neighbor Searches Using
Histograms.
PDF
Full Version
PPT
Liang Jin, Nick Koudas, Chen Li.
EDBT 2004, Heraklion - Crete, Greece.
- Efficient Record Linkage in Large Data Sets.
PDF,
PPT
Source Code
Liang Jin, Chen Li, and Sharad Mehrotra.
8th International Conference on Database Systems for Advanced
Applications (DASFAA) 2003, Kyoto, Japan.
- Supporting Efficient Record Linkage for Large Data Sets Using
Mapping Techniques
Chen Li, Liang Jin, and Sharad Mehrotra
World Wide Web Journal, Volume 9, Number 4, pages 557-584, December 2006.
This journal article is an extended version of the DASFAA03 paper.
Acknowledgements: This release is partially
supported by the
NSF CAREER
Award No. IIS-0238586,
the NSF award No. IIS-0742960,
the NSF award IIS-0844574,
the NSF-funded RESCUE project, a
Google Research Award, a gift fund from Microsoft and a fund
from CalIt2.
Many thanks to Sattam Alsubaiee, Minh Doan, and Kensuke Ohta for their
valuable testing and feedback on the code and documentation.
For any questions regarding this project, please
send email to flamingo AT ics.uci.edu