The FLAMINGO Project on Data Cleaning

Department of Computer Science, UC Irvine


The Flamingo Project focuses on data cleaning, i.e., how to deal with errors and inconsistencies in information systems. As an example, in many applications such as data integration, commercial organizations need to collect data from various sources to conduct analysis and make decisions. Often, the data from these different sources can have inconsistencies. For instance, we use first name, last name, SSN, and birthday to identify a person. However, the same name, e.g., "Schwarzenegger", may be misspelled as "Swarzzengaer" or other forms. Such errors make it more challenging to link records from different places and answer queries approximately. We are developing algorithms in order to make query answering and information retrieval efficient in the presence of such inconsistencies and errors.

With the NSF award IIS-0844574, we plan to study the following problems. Supporting fuzzy queries is becoming increasingly more important in applications that need to deal with a variety of data inconsistencies in structures, representations, or semantics. Many existing algorithms require an offline analysis of data sets to construct an efficient index structure to support online query processing. Fuzzy join queries of data sets are more time consuming due to the computational complexity. The PI is studying three research problems: (1) constructing high-quality inverted lists for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of large data sets using Hadoop; and (3) using the developed techniques to improve data quality of large collections of documents.

With the NSF award 1030002, we will study how to support powerful keyword search with efficient indexing structures and algorithms in a clouding-computing infrastructure. A main application is supporting family reunification in disasters such as the Haiti Earthquake. Check our portals for the Haiti Earthquake and Chile Earthquake. The main challenge is how to use limited programming primitives in the cloud to implement index structures and search algorithms.

Our qSpeller project page for the Microsoft Speller Challenge.


Fuzzy Keyword Search on Spatial Data

We present a solution to support Fuzzy Keyword Search on Spatial Data.



Alumni and Visitors


Acknowledgements: This release is partially supported by the NSF CAREER Award No. IIS-0238586, the NSF award No. IIS-0742960, the NSF award IIS-0844574, the NSF award 1030002, the NSF-funded RESCUE project, the NIH grant 1R21LM010143-01A1, a Google Research Award, a gift fund from Microsoft, a research grant from to allow us to use their MapReduce cluster, and a fund from CalIt2.
Many thanks to Minh Doan and Kensuke Ohta for their valuable testing and feedback on the code and documentation.

For any questions regarding this project, please send email to flamingo AT