The FLAMINGO Project on Data Cleaning
Department of Computer Science, UC Irvine
The Flamingo Project focuses on data cleaning, i.e., how to
deal with errors and inconsistencies in information systems. As an
example, in many applications such as data integration, commercial
organizations need to collect data from various sources to conduct
analysis and make decisions. Often, the data from these different
sources can have inconsistencies. For instance, we use first name,
last name, SSN, and birthday to identify a person. However, the same
name, e.g., "Schwarzenegger", may be misspelled as "Swarzzengaer" or
other forms. Such errors make it more challenging to link records from
different places and answer queries approximately. We are developing
algorithms in order to make query answering and information retrieval
efficient in the presence of such inconsistencies and errors.
With the NSF
we plan to study the following problems. Supporting fuzzy queries is
becoming increasingly more important in applications that need to deal
with a variety of data inconsistencies in structures, representations,
or semantics. Many existing algorithms require an offline analysis of
data sets to construct an efficient index structure to support online
query processing. Fuzzy join queries of data sets are more time
consuming due to the computational complexity. The PI is studying
three research problems: (1) constructing high-quality inverted lists
for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of
large data sets using Hadoop; and (3) using the developed techniques
to improve data quality of large collections of documents.
With the NSF
we will study how to support powerful keyword search with efficient
indexing structures and algorithms in a clouding-computing
infrastructure. A main application is supporting
family reunification in disasters
such as the Haiti Earthquake. Check our portals for
the Haiti Earthquake
and Chile Earthquake. The
main challenge is how to use limited programming primitives in the
cloud to implement index structures and search algorithms.
Fuzzy Keyword Search on Spatial Data
We present a solution to
Keyword Search on Spatial Data.
- 4.1 (February 22nd, 2012)
- 4.0 (October 23rd, 2010)
- 3.0 (March 29th, 2010)
- 2.0.1 (November 7th, 2008)
- 2.0 (October 14th, 2008)
- 1.0 (April 17th, 2007)
- Toolkit (October 14th, 2008), UDF
functions for MySQL
Alumni and Visitors
- Guoliang Li, spring of 2008, visitor from Tsinghua University, China.
- Jiaheng Lu, postdoc, 2006-2008. Now a faculty at Renmin University, China.
- Yiming Lu, graduated from UC Irvine in 2008
- Bin Wang and Xiaochun Yang, summers of 2006, 2007, and 2008, visitors from
Northeastern University, China
- Liang Jin, graduated from UC Irvine in 2005
- Answering Approximate String Queries on Large Data Sets Using External Memory
Alexander Behm, Chen Li, Michael J. Carey.
ICDE 2011 (accepted for publication).
- Supporting Location-Based Approximate-Keyword Queries
Sattam Alsubaiee, Alexander Behm, Chen Li.
ACM SIGSPATIAL GIS 2010.
- Efficient Parallel Set-Similarity Joins Using MapReduce
Rares Vernica, Michael J. Carey, Chen Li.
PDF Full Version
- Fuzzy Keyword Search on Spatial Data (Demo)
Sattam Alsubaiee and Chen Li
- Efficient top-k algorithms for fuzzy search in string
Rares Vernica, Chen Li.
slides Source Code
KEYS 2009: 9-14. (Workshop on Keyword Search on Structured
Data, collocated with SIGMOD 2009)
- Efficient Interactive Fuzzy Keyword Search
Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng
- Space-Constrained Gram-Based Indexing for Efficient Approximate String Search
Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu
- Efficient Approximate Search on String Collections (Tutorial)
Marios Hadjieleftheriou, Chen Li
- Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently
Xiaochun Yang, Bin Wang, Chen Li.
- Efficient Merging and Filtering Algorithms for Approximate String Searches
Chen Li, Jiaheng Lu, and Yiming Lu.
- SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases
Liang Jin, Chen Li, and Rares Vernica.
VLDB Journal 2007. It's an extended version of the SEPIA paper in VLDB05.
- VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams.
Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007, Vienna, Austria
- Relaxing Join and Selection Queries.
Nick Koudas, Chen Li, Anthony Tung, and Rares Vernica.
VLDB 2006, Seoul, Korea.
- Selectivity Estimation for Fuzzy String Predicates in Large
Liang Jin and Chen Li.
VLDB 2005, Trondheim, Norway.
- Indexing Mixed Types for Approximate Retrieval.
Liang Jin, Nick Koudas, Chen Li, Anthony K.H. Tung.
VLDB 2005, Trondheim, Norway.
- NNH: Improving Performance of Nearest-Neighbor Searches Using
Liang Jin, Nick Koudas, Chen Li.
EDBT 2004, Heraklion - Crete, Greece.
- Efficient Record Linkage in Large Data Sets.
Liang Jin, Chen Li, and Sharad Mehrotra.
8th International Conference on Database Systems for Advanced
Applications (DASFAA) 2003, Kyoto, Japan.
Received 10-year Best Paper Award for DASFAA 2013.
- Supporting Efficient Record Linkage for Large Data Sets Using
Chen Li, Liang Jin, and Sharad Mehrotra
World Wide Web Journal, Volume 9, Number 4, pages 557-584, December 2006.
This journal article is an extended version of the DASFAA03 paper.
Acknowledgements: This release is partially
supported by the
award No. IIS-0742960,
the NSF-funded RESCUE project,
the NIH grant 1R21LM010143-01A1,
a Google Research Award, a gift fund from Microsoft,
a research grant from Amazon.com to allow us to use their MapReduce cluster, and a
fund from CalIt2.
Many thanks to Minh Doan and Kensuke Ohta for their valuable testing
and feedback on the code and documentation.
For any questions regarding this project, please
send email to flamingo AT ics.uci.edu