The FLAMINGO Project on Data Cleaning

Department of Computer Science, UC Irvine

Objective

The Flamingo Project focuses on data cleaning, i.e., how to deal with errors and inconsistencies in information systems. As an example, in many applications such as data integration, commercial organizations need to collect data from various sources to conduct analysis and make decisions. Often, the data from these different sources can have inconsistencies. For instance, we use first name, last name, SSN, and birthday to identify a person. However, the same name, e.g., "Schwarzenegger", may be misspelled as "Swarzzengaer" or other forms. Such errors make it more challenging to link records from different places and answer queries approximately. We are developing algorithms in order to make query answering and information retrieval efficient in the presence of such inconsistencies and errors.

With the new NSF award IIS-0844574, we plan to study the following problems. Supporting fuzzy queries is becoming increasingly more important in applications that need to deal with a variety of data inconsistencies in structures, representations, or semantics. Many existing algorithms require an offline analysis of data sets to construct an efficient index structure to support online query processing. Fuzzy join queries of data sets are more time consuming due to the computational complexity. The PI is studying three research problems: (1) constructing high-quality inverted lists for fuzzy search queries using Hadoop; (2) supporting fuzzy joins of large data sets using Hadoop; and (3) using the developed techniques to improve data quality of large collections of documents.

News

Releases

People

Alumni and Visitors

Publications

Acknowledgements: This release is partially supported by the NSF CAREER Award No. IIS-0238586, the NSF award No. IIS-0742960, the NSF award IIS-0844574, the NSF-funded RESCUE project, a Google Research Award, a gift fund from Microsoft and a fund from CalIt2.
Many thanks to Sattam Alsubaiee, Minh Doan, and Kensuke Ohta for their valuable testing and feedback on the code and documentation.


For any questions regarding this project, please send email to flamingo AT ics.uci.edu