Back to Index


AppString > AppStringDoc

Getting Started

Introduction

This tutorial will guide through the basics steps needed to perform approximate string search on a collection of strings using this library. This guide focuses on how to use the FilterTree (FilterTreeDoc) module.

Downloading the Package

Requirements

Most modules in this release were developed and tested on Ubuntu Linux using the GNU GCC/G++ compiler.

In order to compile and run most modules you will need the following:

On systems with the aptitude package manager (e.g. Ubuntu, Debian) you can install all required packages by typing the following as root user (or using sudo):

$ apt-get install gcc g++ libboost-dev

The module MatTreeDoc was developed in Visual C++. No makefile is provided for that module. We recommend using Windows and Visual C++ for that module.

Wrappers (Simplest Way To Use The Library)

For your convenience, we have added wrappers that contain all necessary objects as described in section "Basic Usage". All you need to do to build an index and execute queries, is to create an instance of a wrapper. These wrappers initialize components with default values and are the simplest and fastest way to use our library - at the expense of being able to control tuning parameters (which filters are used, fanout, etc.).
We recommend browsing through the code in filtertree/wrappers/example.cc.

Step-By-Step Guide

In this guide we will use a wrapper to show you how to perform approximate string search using the edit distance.

Step1: Compiling The Library

Let us say you have extracted the archive to the following directory: /home/joe/flamingo-2.0
Then you need to edit /home/joe/flamingo-2.0/src/makefile.inc and set CODEBASEROOT to the root directory of the source files, i.e. /home/joe/flamingo-2.0/src
After the modifications, your /home/joe/flamingo-2.0/src/makefile.inc should look like this:

CODEBASEROOT = /home/joe/flamingo-2.0/src
APPSTRINGROOT = $(CODEBASEROOT)

VPATH = $(APPSTRINGROOT)

CC = g++

CPPFLAGS = -Wall -I$(APPSTRINGROOT) -O3
# CPPFLAGS = -Wall -I$(APPSTRINGROOT) -g

ifndef CODEBASEROOT
$(error Please edit makefile.ini and set the CODEBASEROOT variable to the absolute path of the source code directory. e.g., if you put the code in /home/user/flamingo-2.0/src do: CODEBASEROOT = /home/user/flamingo-2.0/src)
 endif

Now you can compile the wrapper library (and all other required libraries) by entering /home/joe/flamingo-2.0/src/filtertree/wrappers and running make, i.e.:

$ cd /home/joe/flamingo-2.0/src/filtertree/wrappers
$ make

There should now be a file libfiltertreewrappers.a in /home/joe/flamingo-2.0/src/filtertree/wrappers, i.e. for an ls -l you should get an output similar to the following:

$ ls -l
-rwxr-xr-x 1 joe joe 139644 2008-09-17 16:36 example
-rw-r--r-- 1 joe joe   2913 2008-09-17 10:55 example.cc
-rw-r--r-- 1 joe joe  84668 2008-09-17 16:35 example.o
-rw-r--r-- 1 joe joe   1772 2008-09-17 16:35 libfiltertreewrappers.a
-rw-r--r-- 1 joe joe   1609 2008-09-17 10:55 makefile
-rw-r--r-- 1 joe joe   1822 2008-09-17 10:55 wrapperabs.h
-rw-r--r-- 1 joe joe    307 2008-09-17 10:55 wrappers.cc
-rw-r--r-- 1 joe joe    679 2008-09-17 10:55 wrappers.h
-rw-r--r-- 1 joe joe    985 2008-09-17 16:35 wrappersimple.h
-rw-r--r-- 1 joe joe   1640 2008-09-17 16:35 wrappers.o

(note that the exact file sizes may differ from yours)

Step2: Using The Library In An Application

Now that we have compiled the library, we are ready to include it into an application.
Let us assume you wish to use the library in an application located in /home/joe/searchapp that consists of one source file /home/joe/searchapp/main.cc
You can copy and paste the following lines of source code into /home/joe/searchapp/main.cc for us to compile:

#include "wrappers/wrappers.h"

#include <iostream>
#include <vector>


using namespace std;

int main() {
  StringContainerVector strContainer;
  strContainer.fillContainer("/home/joe/flamingo-2.0/src/filtertree/data/dummy.txt", 80); // read 80 lines from the file specified
  
  WrapperSimpleEd wrapper(&strContainer, 3); // use a simple wrapper that uses the edit distance and 3-grams
  wrapper.buildIndex();

  float editDistance = 2.0f;
  string queryString = "xample";

  vector<unsigned> resultStringIDs; // where to store the result string ids
  wrapper.search(queryString, editDistance, resultStringIDs);

  // print out the result strings
  cout << "SIMILAR STRINGS: " << endl;
  for(unsigned i = 0; i < resultStringIDs.size(); i++) {
    string tmp;
    strContainer.retrieveString(tmp, resultStringIDs.at(i));
    cout << tmp << endl;
  }

  return 0;
}

This application will use the first 80 lines of /home/joe/flamingo-2.0/src/filtertree/data/dummy.txt as the data strings.
It will build an index to support approximate string search and answer a query that asks for all data strings that are within an edit-distance of 2 to "xample".
Finally, the results will be displayed.

Step3: Compiling And Running The Application

Since we decided to have every module produce it's own library (.a file) it is necessary to link several .a files with the main.o (produced by /home/joe/searchapp/main.cc).
The simplest way to achieve this is to create a makefile for the application, i.e. create a file /home/joe/searchapp/makefile with the following contents:

include /home/joe/flamingo-2.0/src/makefile.inc

LDFLAGS = -lrt

all: main

main:   main.o \
        /home/joe/flamingo-2.0/common/libcommon.a \
        /home/joe/flamingo-2.0/listmerger/liblistmerger.a \
        /home/joe/flamingo-2.0/filtertree/libfiltertree.a \
        /home/joe/flamingo-2.0/util/libutil.a \
        /home/joe/flamingo-2.0/filtertree/wrappers/libfiltertreewrappers.a

Now you should be able to compile the application using make, i.e.

$ make

If make was successful, you have compiled and linked the application! It is time to try and run it by typing

$ ./main

You should have the an output something similar to the following:

INPUTFILE: "/home/joe/flamingo-2.0/src/filtertree/data/dummy.txt"
100% FILLING CONTAINER: 80/80; 0'0"/0'0"   
100% INSERTING INTO INDEX: 80/80; 0'0"/0'0"   
SIMILAR STRINGS: 
example1
example2
example3
example4
example5

Congratulations, you have successfully created your first application using this library!

Basic Usage

Approximate string search can be performed in two basic steps: (1) building the index, and (2) answering queries using the index. We will now discuss the basic components for each of the steps (at a high-level).

  1. Building The Index, Needed Components
    • String Container (stores the data strings on which you want to perform queries)
    • Gram Generator (decomposes strings into grams)
    • Indexer (builds the filter tree and the inverted lists, needs a String Container and a Gram Generator)
  1. Answering Queries Using The Index, Needed Components:
    • List Merger (solves the T-occurrence problem given a set of inverted lists and a merging-threshold)
    • Indexer (builds the filtertree and the inverted-lists, needs a String Container and a Gram Generator)
    • Searcher (answers queries, needs a List Merger and an Indexer)
    • Similarity Metric (represents the similarity metric to be used)
    • Query (contains the query string, the similarity metric and the similarity threshold)

Refer to filtertree/example.cc for some examples.

Example Files

Apart from reading this guide, we recommend you browse through the code of some example files. We have provided these files to help you understand how to use the library as quickly as possible.