Wednesday 25 December 2013

Client-Side Similarity Matching

Overview

This post is entirely inspired by a post I had recently seen on HackerNews, specifically, "Writing a full-text search engine using Bloom filters", by Stavros Korokithakis. I then went off and did a bit of Googling, and came up with a paper by Jain, Dahlin and Tewari.

The Full-Text Search

Korokithakis' method is actually reasonably ingenious, take your documents, tokenise them, and enter them into a bloom filter. This gives a highly compact representation of the set of tokens in your document which is extremely efficient to query. Take the bloom filter, and add it to a map of DocumentId ->  BloomFilter.

To search, one tokenises the query,  iterates the values (BloomFilters) of the map querying if the query tokens appear in the bloom filter. If they do, the document is considered a hit in the search.

It doesn't take much imagination to extend the tokenising to stop-word filtering, stemming, n-grams, and all the other lovely methods used by more traditional full-text search engines to make the search more usable by humans, and to extend the "relevance" beyond hit/no hit (e.g. more hits = better is one improvement to the described scheme that I can think of).

Similarity Searching

This is where Jain, Dahlin and Tewari's paper comes in handy. They used bloom filters for detecting similar web pages when crawling a site.

Their method was to take index the web pages using a bloom filter, but using a different tokenising method, and then to simply count the proportion of matching bits in the resulting bloom filters. They then have some nice analysis to show how "similar" two documents can be considered to be if their bloom filters have a high proportion of set bits.

Onto the Client

This is very simple, you parcel up your index, but instead of searching it, you simply take the current document's already computed bloom filter (no need to send over the pesky tokenisation code!) and do a bitwise AND on the other elements of the index. The ones with the highest cardinality are the ones which are most similar to your selected article.

This is also the kind of calculation that could be done once at at index generation time.

No comments:

Post a Comment