MySQL Relevancy Search

If you have ever developed an interface for a data or content intensive application, you know that having a proper search mechanism is essential. With the new version of MySQL (4.0+), you can build ultra fast relevancy sorted searchs directly into your application. With a little PHP and some slick SQL commands, I can show you how to do this with minimal effort.

The most basic implementation of an SQL database search is to use the LIKE command after the WHERE clause, usually in conjunction with the percentage operator. Surrounding a string with percentage signs on both ends instructs MySQL to search for any text containing substrings of your query. The percentage sign acts as a wildcard operator, like the asterisk in Windows. The SQL statement for this usually looks like:

 
Of course, the returned resource from this query will be strings that contain the word waffles, such as "I like waffles", "I ate waffles for breakfast", or "waffles are nutritious and buttery".  You can also use the command NOT LIKE to exclude results that match your condition. Unfortunately — for a large dataset — you will often get thousands of results that match your criteria. Your user will most likely be looking for an entry that is buried 20 pages in, never to be discovered.

The solution to this problem is to use a relevancy algorithm. Relevancy is the measurement of how closely your result matches your search query. Google uses a very complex relevancy algorithm, much more complex than we need for the typical database web application. Luckily for us, MySQL has it’s own relevancy algorithm built in.

The actual behavior of this algorithm is not that important, just know that it takes into account many factors to determine a relevancy value beyond substring matching. The algorithm actually assigns a floating point number to each returned entry, allowing you to sort by descending relevancy or display the actual relevancy in your result. The first step to building this type of search is to build an index for your data. Indices maintain an easily searchable data structure allowing very fast search queries to be executed. Indices also have the advantage of keeping only the relevant fields of your database indexed and ignoring the rest. To build a fulltext index, you should issue a SQL command through the command line or a related frontend (phpMyAdmin). The command is simple, and looks like this:

Replace name_of_index with some arbitrary identifier for the index, table with the name of the table you wish to index, and replace the column names within the parenthesis with the fields you wish to index. These fields should only be relevant to the search query. Note that indices only work on MyISAM databases. Also remember that once you create an index, it becomes automatically liked to your dataset and will affect all subsequent queries. This may slow down INSERT and UPDATE commands, but the effect will be unnoticeable unless you have an enterprise level dataset. The index will be self updating, and if you wish to remove it later you may do so. Also be aware that an index can take up quite a bit of disk space for large data sets. This is a speed/space trade-off, so be prepared to use some resources.

Now that we have our index we can properly use the MySQL relevancy functionality. These commands are called MATCH and AGAINST, used in tandem. Here’s our new search query:

This will return a MySQL resource where all returned rows are relevant to $search_string in some way. This is not too useful in this unsorted form, so the trick to get the results to return in order of sorted decreasing relevancy is to add another MATCH AGAINST pair to the WHERE clause:

Now you can blast out your search results, sorted by relevancy, and even use the new score field that is returned to give each entry a relevancy score or percentage. In my experience the user does not really need to know this information, the order implies the score by itself. Displaying it is just one of those things a true geek can’t resist.

More information on this technique can be found in the MySQL manual.


About this entry


You may also enjoy


Recent comments