Thursday 26 December 2013

Searching with Bloom Filters in Haskell

Overview


This post is primarily about my experience with the bloomfilter package in Haskell.

Coding it up

Coding it up was much of a muchness, aside from a late realisation that there's no Hashable instance for Text which caused a quick change from Text to ByteString, but aside form that, it was very easy.

Even querying the index is easy, I shall post up the code when I am back at home, since I'm with my parents over the Christmas period.

Overall, it's roughly 38 lines, but around 7 lines of that are simple module imports.

Testing

To test it, I grabbed roughly 2.5MB of data from Project Gutenberg, and simply started querying the index. I haven't done any "proper" performance testing so I've just been feeling it out with no timers.

I didn't bother forcing anything, so the first query can take a little while as the index is built, but it's nothing prohibitive. After that, every search feels instantaneous.

After building with -rtsopts, I had it generate a heap profile, which showed that the peak memory allocated with my test set was roughly 8MB, but that fell back down quite rapidly.

Improvements

I suspect that a carefully used deepseq, and ByteString.Lazy could really improve the memory foot print of the system, and remove the initial "hang" when doing the first search. I shall have to investigate this at a later date.

Extensions

If I could just get access to the underlying UArray Int Hash in a way that I could manipulate, I could make searching more probabalistic, but also much faster.

Similarly, I could extend the usage of a bloom filter into a similarity metric, but unfortunately, I'm not entirely sure how to use the underlying representation at this time.

I basically need access to the cardinality of the underlying representation, and the ability to union and intersect the underlying bit arrays. Once I have these (or worked out how to do it on the underlying representation) performing similarity matching becomes very simple.

Conclusion

Haskell's bloom filter library seems to be very fast whilst maintaining an easy to use facade.

However this is not flexible enough to really play with bloom filters in all the ways they can be very easily. Perhaps basing the underlying representation on the bitset package, or offering a "serialisable" version which can be easily manipulated would be all it takes.

Wednesday 25 December 2013

Client-Side Similarity Matching

Overview

This post is entirely inspired by a post I had recently seen on HackerNews, specifically, "Writing a full-text search engine using Bloom filters", by Stavros Korokithakis. I then went off and did a bit of Googling, and came up with a paper by Jain, Dahlin and Tewari.

The Full-Text Search

Korokithakis' method is actually reasonably ingenious, take your documents, tokenise them, and enter them into a bloom filter. This gives a highly compact representation of the set of tokens in your document which is extremely efficient to query. Take the bloom filter, and add it to a map of DocumentId ->  BloomFilter.

To search, one tokenises the query,  iterates the values (BloomFilters) of the map querying if the query tokens appear in the bloom filter. If they do, the document is considered a hit in the search.

It doesn't take much imagination to extend the tokenising to stop-word filtering, stemming, n-grams, and all the other lovely methods used by more traditional full-text search engines to make the search more usable by humans, and to extend the "relevance" beyond hit/no hit (e.g. more hits = better is one improvement to the described scheme that I can think of).

Similarity Searching

This is where Jain, Dahlin and Tewari's paper comes in handy. They used bloom filters for detecting similar web pages when crawling a site.

Their method was to take index the web pages using a bloom filter, but using a different tokenising method, and then to simply count the proportion of matching bits in the resulting bloom filters. They then have some nice analysis to show how "similar" two documents can be considered to be if their bloom filters have a high proportion of set bits.

Onto the Client

This is very simple, you parcel up your index, but instead of searching it, you simply take the current document's already computed bloom filter (no need to send over the pesky tokenisation code!) and do a bitwise AND on the other elements of the index. The ones with the highest cardinality are the ones which are most similar to your selected article.

This is also the kind of calculation that could be done once at at index generation time.