Back to Index


AppString > AppStringDoc

  1. Getting Started
    1. Introduction
    2. Downloading the Package
    3. Requirements
    4. Wrappers (Simplest Way To Use The Library)
    5. Step-By-Step Guide
      1. Step1: Compiling The Library
      2. Step2: Using The Library In An Application
        1. Option1: In-Memory Index
        2. Option2: In-Memory, Compressed Index
        3. Option3: On-Disk Index
      3. Step3: Compiling And Running The Application
    6. Basic Usage
    7. Example Files

Getting Started

Introduction

This tutorial will guide through the basics steps needed to perform approximate string search on a collection of strings using this library. This guide focuses on how to use the FilterTree (FilterTreeDoc) module.

Downloading the Package

Requirements

Most modules in this release were developed and tested on Ubuntu Linux using the GNU GCC/G++ compiler.

In order to compile and run most modules you will need the following:

On systems with the aptitude package manager (e.g. Ubuntu, Debian) you can install all required packages by typing the following as root user (or using sudo):

$ apt-get install gcc g++ libboost-dev

The module MatTreeDoc was developed in Visual C++. No makefile is provided for that module. We recommend using Windows and Visual C++ for that module.

Wrappers (Simplest Way To Use The Library)

For your convenience, we have added wrappers that contain all necessary objects as described in section "Basic Usage". All you need to do to build an index and execute queries, is to create an instance of a wrapper. These wrappers initialize components with default values and are the simplest and fastest way to use our library - at the expense of being able to control tuning parameters (which filters are used, fanout, etc.).

We recommend browsing through the code in filtertree/wrappers/example.cc.

Step-By-Step Guide

In this guide we will use a wrapper to show you how to perform approximate string search using the edit distance.

Step1: Compiling The Library

Let us say you have extracted the archive to the following directory: /home/joe/flamingo-3.0

Then you need to edit /home/joe/flamingo-3.0/src/makefile.inc and set CODEBASEROOT to the root directory of the source files, i.e. /home/joe/flamingo-3.0/src

After the modifications, your /home/joe/flamingo-3.0/src/makefile.inc should look like this:

CODEBASEROOT = /home/abehm/group/codebase/appstring/tags/release-3.0
APPSTRINGROOT = $(CODEBASEROOT)

VPATH = $(APPSTRINGROOT)

CC = g++

### Cygwin:
# CC = g++-4
# CXX = g++-4

CPPFLAGS = -Wall -I$(APPSTRINGROOT) -O3 -DNDEBUG
# CPPFLAGS = -Wall -I$(APPSTRINGROOT) -O3
# CPPFLAGS = -Wall -I$(APPSTRINGROOT) -g -pg

ifndef CODEBASEROOT
$(error -- - EDIT MAKEFILE.INC - -- Please edit makefile.inc and set the CODEBASEROOT variable to the absolute path of the source code directory. 
e.g., if you put the code in /home/joe/flamingo-3.0/src do: CODEBASEROOT = /home/joe/flamingo-3.0/src)
endif

Now you can compile the wrapper library (and all other required libraries) by entering /home/joe/flamingo-3.0/src/filtertree/wrappers and running make, i.e.:

$ cd /home/joe/flamingo-3.0/src/filtertree/wrappers
$ make

There should now be a file libfiltertreewrappers.a in /home/joe/flamingo-3.0/src/filtertree/wrappers, i.e. for an ls -l you should get an output similar to the following:

$ ls -l
-rwxr-xr-x 1 joe joe 534818 2010-03-25 15:49 example
-rw-r--r-- 1 joe joe   8883 2010-03-24 16:06 example.cc
-rw-r--r-- 1 joe joe 255092 2010-03-25 15:48 example.o
-rw-r--r-- 1 joe joe   1476 2010-03-25 15:48 libfiltertreewrappers.a
-rw-r--r-- 1 joe joe   2117 2010-03-24 15:56 makefile
-rw-r--r-- 1 joe joe   1814 2010-03-24 16:06 wrapperabs.h
-rw-r--r-- 1 joe joe   2181 2010-03-24 16:06 wrappercombinelists.h
-rw-r--r-- 1 joe joe   4814 2010-03-24 16:06 wrapperdiscardlists.h
-rw-r--r-- 1 joe joe   2076 2010-03-24 15:58 wrapperondisk.h
-rw-r--r-- 1 joe joe    308 2010-03-24 16:06 wrappers.cc
-rw-r--r-- 1 joe joe   5034 2010-03-24 16:06 wrappers.h
-rw-r--r-- 1 joe joe    994 2010-03-24 16:06 wrappersimple.h
-rw-r--r-- 1 joe joe   1344 2010-03-25 15:48 wrappers.o

(note that the exact file sizes may differ from yours)

Step2: Using The Library In An Application

Now that we have compiled the library, we are ready to include it into an application.

Let us assume you wish to use the library in an application located in /home/joe/searchapp that consists of one source file /home/joe/searchapp/main.cc

We will discuss three options: using an in-memory index with and without compression, and a disk-based index.

Option1: In-Memory Index

You can copy and paste the following lines of source code into /home/joe/searchapp/main.cc for us to compile:

#include "filtertree/wrappers/wrappers.h"

int main() {
  StringContainerVector strContainer;
  // read 4k lines from the file specified
  strContainer.fillContainer("/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt", 4000); 
  
  WrapperSimpleEd wrapper(&strContainer, 2); // use a simple wrapper that uses the edit distance and 2-grams
  wrapper.buildIndex();

  float editDistance = 1.0f;
  string queryString = "kathrin";

  vector<unsigned> resultStringIDs; // where to store the result string ids
  wrapper.search(queryString, editDistance, resultStringIDs);

  // print out the result strings
  cout << "SIMILAR STRINGS: " << endl;
  for(unsigned i = 0; i < resultStringIDs.size(); i++) {
    string tmp;
    strContainer.retrieveString(tmp, resultStringIDs.at(i));
    cout << tmp << endl;
  }

  return 0;
}

In this example all data structures are stored in main memory.

This application will use the first 4000 lines of /home/joe/flamingo-3.0/src/filtertree/data/female_names.txt as the data strings.

It will build an index to support approximate string search and answer a query that asks for all data strings that are within an edit-distance of 1 to "kathrin".

Finally, the results will be displayed.

Option2: In-Memory, Compressed Index

You can copy and paste the following lines of source code into /home/joe/searchapp/main.cc for us to compile:

#include "filtertree/wrappers/wrappers.h"

int main() {
  StringContainerVector strContainer;
  // read 4k lines from the file specified
  strContainer.fillContainer("/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt", 4000); 
  
  // use wrapper for compressed index
  // index is compressed by discarding lists, using longest-lists-first (LLF) method
  // the reduction ratio is 0.5
  WrapperDiscardListsLLFEd wrapper(&strContainer, 2, true, 0.5);
  wrapper.buildIndex();

  float editDistance = 1.0f;
  string queryString = "kathrin";

  vector<unsigned> resultStringIDs; // where to store the result string ids
  wrapper.search(queryString, editDistance, resultStringIDs);

  // print out the result strings
  cout << "SIMILAR STRINGS: " << endl;
  for(unsigned i = 0; i < resultStringIDs.size(); i++) {
    string tmp;
    strContainer.retrieveString(tmp, resultStringIDs.at(i));
    cout << tmp << endl;
  }

  return 0;
}

In this example all data structures are stored in main memory.

This application will use the first 4000 lines of /home/joe/flamingo-3.0/src/filtertree/data/female_names.txt as the data strings.

It will build an index to support approximate string search and answer a query that asks for all data strings that are within an edit-distance of 1 to "kathrin".

Finally, the results will be displayed.

Option3: On-Disk Index

You can copy and paste the following lines of source code into /home/joe/searchapp/main.cc for us to compile:

#include "filtertree/wrappers/wrappers.h"

int main() {
  StringContainerRM strContainer;
  strContainer.createAndOpen("collection.rm");
  // read 4k lines from the file specified
  strContainer.fillContainer("/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt", 4000);
  
  // use a simple wrapper that uses the edit distance and 2-grams, 
  // inverted lists are saved to the file index.ix
  WrapperOnDiskSimpleEd wrapper(&strContainer, "index.ix", 2); 
  wrapper.buildIndex();

  float editDistance = 1.0f;
  string queryString = "kathrin";

  vector<unsigned> resultStringIDs; // where to store the result string ids
  wrapper.search(queryString, editDistance, resultStringIDs);

  // print out the result strings
  cout << "SIMILAR STRINGS: " << endl;
  for(unsigned i = 0; i < resultStringIDs.size(); i++) {
    string tmp;
    strContainer.retrieveString(tmp, resultStringIDs.at(i));
    cout << tmp << endl;
  }

  return 0;
}

In this example, both the inverted index and the data strings are stored on disk (the FilterTree remains in memory).

This application will use the first 4000 lines of /home/joe/flamingo-3.0/src/filtertree/data/female_names.txt as the data strings.

It will build an index to support approximate string search and answer a query that asks for all data strings that are within an edit-distance of 1 to "kathrin".

Finally, the results will be displayed.

Step3: Compiling And Running The Application

Since we decided to have every module produce it's own library (.a file) it is necessary to link several .a files with the main.o (produced by /home/joe/searchapp/main.cc).

The simplest way to achieve this is to create a makefile for the application, i.e. create a file /home/joe/searchapp/makefile with the following contents:

include /home/joe/flamingo-3.0/src/makefile.inc

LDFLAGS = -lrt

all: main

main:   main.o \
                $(APPSTRINGROOT)/common/libcommon.a \
                $(APPSTRINGROOT)/listmerger/liblistmerger.a \
                $(APPSTRINGROOT)/filtertree/libfiltertree.a \
                $(APPSTRINGROOT)/filtertree/wrappers/libfiltertreewrappers.a \
                $(APPSTRINGROOT)/sepia/libsepia.a \
                $(APPSTRINGROOT)/util/libutil.a

Now you should be able to compile the application using make, i.e.

$ make

If make was successful, you have compiled and linked the application! It is time to try and run it by typing

$ ./main

You should have the an output something similar to the following:

INPUTFILE: "/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt"
100% FILLING CONTAINER: 4000/4000; 0'0"/0'0"   
SIMILAR STRINGS: 
kathryn
kathrine
katherin
kathrin

Congratulations, you have successfully created your first application using The Flamingo Package!

Basic Usage

Approximate string search can be performed in two basic steps: (1) building the index, and (2) answering queries using the index. We will now discuss the basic components for each of the steps at a high-level.

  1. Building The Index, Needed Components
    • String Container (stores the data strings on which you want to perform queries)
    • Gram Generator (decomposes strings into grams)
    • Indexer (builds the filter tree and the inverted lists, needs a String Container and a Gram Generator)
  1. Answering Queries Using The Index, Needed Components:
    • List Merger (solves the T-occurrence problem given a set of inverted lists and a merging-threshold)
    • Indexer (builds the filtertree and the inverted-lists, needs a String Container and a Gram Generator)
    • Searcher (answers queries, needs a List Merger and an Indexer)
    • Similarity Metric (represents the similarity metric to be used)
    • Query (contains the query string, the similarity metric and the similarity threshold)

Refer to filtertree/example.cc for some advanced examples.

Example Files

Apart from reading this guide, we recommend you browse through the code of some example files. We have provided these files to help you understand how to use the library as quickly as possible.