AppString > AppStringDoc
This tutorial will guide through the basics steps needed to perform approximate string search on a collection of strings using this library. This guide focuses on how to use the FilterTree (FilterTreeDoc) module.
Most modules in this release were developed and tested on Ubuntu Linux using the GNU GCC/G++ compiler.
In order to compile and run most modules you will need the following:
On systems with the aptitude package manager (e.g. Ubuntu, Debian) you can install all required packages by typing the following as root user (or using sudo):
$ apt-get install gcc g++ libboost-dev
The module MatTreeDoc was developed in Visual C++. No makefile is provided for that module. We recommend using Windows and Visual C++ for that module.
For your convenience, we have added wrappers that contain all
necessary objects as described in section "Basic Usage". All you
need to do to build an index and execute queries, is to create an
instance of a wrapper. These wrappers initialize components with
default values and are the simplest and fastest way to use our
library - at the expense of being able to control tuning parameters
(which filters are used, fanout, etc.).
We recommend browsing through the code in
filtertree/wrappers/example.cc.
In this guide we will use a wrapper to show you how to perform approximate string search using the edit distance.
Let us say you have extracted the archive to the following
directory: /home/joe/flamingo-3.0
Then you need to edit
/home/joe/flamingo-3.0/src/makefile.inc and set
CODEBASEROOT to the root directory of the source files, i.e.
/home/joe/flamingo-3.0/src
After the modifications, your
/home/joe/flamingo-3.0/src/makefile.inc should
look like this:
CODEBASEROOT = /home/abehm/group/codebase/appstring/tags/release-3.0 APPSTRINGROOT = $(CODEBASEROOT) VPATH = $(APPSTRINGROOT) CC = g++ ### Cygwin: # CC = g++-4 # CXX = g++-4 CPPFLAGS = -Wall -I$(APPSTRINGROOT) -O3 -DNDEBUG # CPPFLAGS = -Wall -I$(APPSTRINGROOT) -O3 # CPPFLAGS = -Wall -I$(APPSTRINGROOT) -g -pg ifndef CODEBASEROOT $(error -- - EDIT MAKEFILE.INC - -- Please edit makefile.inc and set the CODEBASEROOT variable to the absolute path of the source code directory. e.g., if you put the code in /home/joe/flamingo-3.0/src do: CODEBASEROOT = /home/joe/flamingo-3.0/src) endif
Now you can compile the wrapper library (and all other required libraries) by entering /home/joe/flamingo-3.0/src/filtertree/wrappers and running make, i.e.:
$ cd /home/joe/flamingo-3.0/src/filtertree/wrappers $ make
There should now be a file libfiltertreewrappers.a in /home/joe/flamingo-3.0/src/filtertree/wrappers, i.e. for an ls -l you should get an output similar to the following:
$ ls -l -rwxr-xr-x 1 joe joe 534818 2010-03-25 15:49 example -rw-r--r-- 1 joe joe 8883 2010-03-24 16:06 example.cc -rw-r--r-- 1 joe joe 255092 2010-03-25 15:48 example.o -rw-r--r-- 1 joe joe 1476 2010-03-25 15:48 libfiltertreewrappers.a -rw-r--r-- 1 joe joe 2117 2010-03-24 15:56 makefile -rw-r--r-- 1 joe joe 1814 2010-03-24 16:06 wrapperabs.h -rw-r--r-- 1 joe joe 2181 2010-03-24 16:06 wrappercombinelists.h -rw-r--r-- 1 joe joe 4814 2010-03-24 16:06 wrapperdiscardlists.h -rw-r--r-- 1 joe joe 2076 2010-03-24 15:58 wrapperondisk.h -rw-r--r-- 1 joe joe 308 2010-03-24 16:06 wrappers.cc -rw-r--r-- 1 joe joe 5034 2010-03-24 16:06 wrappers.h -rw-r--r-- 1 joe joe 994 2010-03-24 16:06 wrappersimple.h -rw-r--r-- 1 joe joe 1344 2010-03-25 15:48 wrappers.o
(note that the exact file sizes may differ from yours)
Now that we have compiled the library, we are ready to include
it into an application.
Let us assume you wish to use the library in an application located
in /home/joe/searchapp that consists of one source
file /home/joe/searchapp/main.cc
We will discuss three options: using an in-memory index with and
without compression, and a disk-based index.
You can copy and paste the following lines of source code into /home/joe/searchapp/main.cc for us to compile:
#include "filtertree/wrappers/wrappers.h" int main() { StringContainerVector strContainer; // read 4k lines from the file specified strContainer.fillContainer("/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt", 4000); WrapperSimpleEd wrapper(&strContainer, 2); // use a simple wrapper that uses the edit distance and 2-grams wrapper.buildIndex(); float editDistance = 1.0f; string queryString = "kathrin"; vector<unsigned> resultStringIDs; // where to store the result string ids wrapper.search(queryString, editDistance, resultStringIDs); // print out the result strings cout << "SIMILAR STRINGS: " << endl; for(unsigned i = 0; i < resultStringIDs.size(); i++) { string tmp; strContainer.retrieveString(tmp, resultStringIDs.at(i)); cout << tmp << endl; } return 0; }
In this example all data structures are stored in main
memory.
This application will use the first 4000 lines of
/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt
as the data strings.
It will build an index to support approximate string search and
answer a query that asks for all data strings that are within an
edit-distance of 1 to "kathrin".
Finally, the results will be displayed.
You can copy and paste the following lines of source code into /home/joe/searchapp/main.cc for us to compile:
#include "filtertree/wrappers/wrappers.h" int main() { StringContainerVector strContainer; // read 4k lines from the file specified strContainer.fillContainer("/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt", 4000); // use wrapper for compressed index // index is compressed by discarding lists, using longest-lists-first (LLF) method // the reduction ratio is 0.5 WrapperDiscardListsLLFEd wrapper(&strContainer, 2, true, 0.5); wrapper.buildIndex(); float editDistance = 1.0f; string queryString = "kathrin"; vector<unsigned> resultStringIDs; // where to store the result string ids wrapper.search(queryString, editDistance, resultStringIDs); // print out the result strings cout << "SIMILAR STRINGS: " << endl; for(unsigned i = 0; i < resultStringIDs.size(); i++) { string tmp; strContainer.retrieveString(tmp, resultStringIDs.at(i)); cout << tmp << endl; } return 0; }
In this example all data structures are stored in main
memory.
This application will use the first 4000 lines of
/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt
as the data strings.
It will build an index to support approximate string search and
answer a query that asks for all data strings that are within an
edit-distance of 1 to "kathrin".
Finally, the results will be displayed.
You can copy and paste the following lines of source code into /home/joe/searchapp/main.cc for us to compile:
#include "filtertree/wrappers/wrappers.h" int main() { StringContainerRM strContainer; strContainer.createAndOpen("collection.rm"); // read 4k lines from the file specified strContainer.fillContainer("/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt", 4000); // use a simple wrapper that uses the edit distance and 2-grams, // inverted lists are saved to the file index.ix WrapperOnDiskSimpleEd wrapper(&strContainer, "index.ix", 2); wrapper.buildIndex(); float editDistance = 1.0f; string queryString = "kathrin"; vector<unsigned> resultStringIDs; // where to store the result string ids wrapper.search(queryString, editDistance, resultStringIDs); // print out the result strings cout << "SIMILAR STRINGS: " << endl; for(unsigned i = 0; i < resultStringIDs.size(); i++) { string tmp; strContainer.retrieveString(tmp, resultStringIDs.at(i)); cout << tmp << endl; } return 0; }
In this example, both the inverted index and the data strings
are stored on disk (the FilterTree remains in memory).
This application will use the first 4000 lines of
/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt
as the data strings.
It will build an index to support approximate string search and
answer a query that asks for all data strings that are within an
edit-distance of 1 to "kathrin".
Finally, the results will be displayed.
Since we decided to have every module produce it's own library
(.a file) it is necessary to link several .a files with the main.o
(produced by /home/joe/searchapp/main.cc).
The simplest way to achieve this is to create a makefile for the
application, i.e. create a file
/home/joe/searchapp/makefile with the following
contents:
include /home/joe/flamingo-3.0/src/makefile.inc LDFLAGS = -lrt all: main main: main.o \ $(APPSTRINGROOT)/common/libcommon.a \ $(APPSTRINGROOT)/listmerger/liblistmerger.a \ $(APPSTRINGROOT)/filtertree/libfiltertree.a \ $(APPSTRINGROOT)/filtertree/wrappers/libfiltertreewrappers.a \ $(APPSTRINGROOT)/sepia/libsepia.a \ $(APPSTRINGROOT)/util/libutil.a
Now you should be able to compile the application using make, i.e.
$ make
If make was successful, you have compiled and linked the application! It is time to try and run it by typing
$ ./main
You should have the an output something similar to the following:
INPUTFILE: "/home/joe/flamingo-3.0/src/filtertree/data/female_names.txt" 100% FILLING CONTAINER: 4000/4000; 0'0"/0'0" SIMILAR STRINGS: kathryn kathrine katherin kathrin
Congratulations, you have successfully created your first application using The Flamingo Package!
Approximate string search can be performed in two basic steps: (1) building the index, and (2) answering queries using the index. We will now discuss the basic components for each of the steps at a high-level.
Refer to filtertree/example.cc for some advanced examples.
Apart from reading this guide, we recommend you browse through the code of some example files. We have provided these files to help you understand how to use the library as quickly as possible.