|
The FLAMINGO Project on Data Cleaning
Department of Computer Science, UC Irvine
|
Objective
The Flamingo Project focuses on data cleaning, i.e., how to
deal with errors and inconsistencies in information systems. As an
example, in many applications such as data integration, commercial
organizations need to collect data from various sources to conduct
analysis and make decisions. Often, the data from these different
sources can have inconsistencies. For instance, we use first name,
last name, SSN, and birthday to identify a person. However, the same
name, e.g., "Schwarzenegger", may be misspelled as "Swarzzengaer" or
other forms. Such errors make it more challenging to link records from
different places and answer queries approximately. We are developing
algorithms in order to make query answering and information retrieval
efficient in the presence of such inconsistencies and errors.
News
Releases
People
- Alexander Behm (Ph.D. Student)
- Shengyue Ji (Ph.D. Student)
- Chen Li (Faculty)
- Rares Vernica (Ph.D. Student)
Alumni and Visitors
- Guoliang Li, spring of 2008, visitor from Tsinghua University, China.
- Jiaheng Lu, postdoc, 2006-2008
- Yiming Lu, graduated from UC Irvine in 2008
- Bin Wang and Xiaochun Yang, summers of 2006, 2007, and 2008, visitors from
Northeastern University, China
- Liang Jin, graduated from UC Irvine in 2005
Publications
- Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently
PDF
PPT
Xiaochun Yang, Bin Wang, Chen Li.
SIGMOD 2008.
- Efficient Merging and Filtering Algorithms for Approximate String Searches
PDF
PPT
Chen Li, Jiaheng Lu, and Yiming Lu.
ICDE 2008.
- SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases
Liang Jin, Chen Li, and Rares Vernica.
VLDB Journal 2007. It's an extended version of the SEPIA paper in VLDB05.
- VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams.
PDF
PPT
Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007, Vienna, Austria
- Relaxing Join and Selection Queries.
PDF
PPT
Nick Koudas, Chen Li, Anthony Tung, and Rares Vernica.
VLDB 2006, Seoul, Korea.
- Selectivity Estimation for Fuzzy String Predicates in Large
Data Sets.
PDF
PPT
Liang Jin and Chen Li.
VLDB 2005, Trondheim, Norway.
- Indexing Mixed Types for Approximate Retrieval.
PDF
PPT
Liang Jin, Nick Koudas, Chen Li, Anthony K.H. Tung.
VLDB 2005, Trondheim, Norway.
- NNH: Improving Performance of Nearest-Neighbor Searches Using
Histograms.
PDF
Full Version
PPT
Liang Jin, Nick Koudas, Chen Li.
EDBT 2004, Heraklion - Crete, Greece.
- Efficient Record Linkage in Large Data Sets.
PDF,
PPT
Liang Jin, Chen Li, and Sharad Mehrotra.
8th International Conference on Database Systems for Advanced
Applications (DASFAA) 2003, Kyoto, Japan.
- Supporting Efficient Record Linkage for Large Data Sets Using
Mapping Techniques
Chen Li, Liang Jin, and Sharad Mehrotra
World Wide Web Journal, Volume 9, Number 4, pages 557-584, December 2006.
This journal article is an extended version of the DASFAA03 paper.
Acknowledgements: This release is partially
supported by the
NSF CAREER
Award No. IIS-0238586,
the NSF award No. IIS-0742960,
the NSF-funded RESCUE project, a
Google Research Award, and a fund
from CalIt2.
For any questions regarding this project, please
send email to flamingo AT ics.uci.edu