Dealing with Sphinx: encountered a bug?

First reaction of most people when getting a bug, either unexpected result or program crash is to either fill a bug ticket or report it on the forum.

Is this the best way? Well, you need to do something before jumping on reporting a bug: more exactly, be sure you use latest version.

There are a lot of reports with crashes or “this doesn’t work as it supposed to do”. When you ask, what you find? In general they use 2.0.2-beta or 2.0.1 or worst : 0.9.9.

0.9.9 was released THREE years ago. Sphinx doesn’t even support this version, with exception of express request ( read money ). Some linux distributions still come with 0.9.9 in their repositories. Use a package from Sphinx site. A version like 0.9.9 is old, might have bugs, miss a lot of features and the open-source support will never fix a reported bug for it. Same goes for 1.10. Bugs are fixed for latest stable, latest beta and development version. That is 2.0.6 , 2.1.1 and trunk at the moment of writing this post.

You should always have the latest (stable) of Sphinx. Why ? Two reasons : every maintanance release gets fixes and possible speed improvements. Sphinx includes compability backwards so it’s safe to upgrade. A newer version can handle an upgrade indexes created with an older version. Even between releases as 2.0.6 and 2.1.1. Of course, it’s wise to do a backup of the indexes in case something might go wrong. Upgrading is very easy. If you use a binary package, just install the binary package for the newer version. Sphinx itself consists of several executables and a configuration file. The binary packages will not overwrite your config, in case of an upgrade.

 

What is Sphinx and what it can do

I’ve seen a lot from my experience from dealing with sphinx integration that many people don’t understand both technically and literally what is SphinxSearch , what can do and most important what is NOT and what CANNOT do.

First , let’s take the idea behind Sphinx . Sphinx probably appeared because it’s creator was not satisfied with MySQL ( or any other db ) performances when doing a search , more exactly when doing a TEXT one .  This main idea is , even now at version 2 ,  what Sphinx does and what is good at : FAST text search .  How it makes that ? Simple explained  , it uses some kind of inverted index . What’s that ? Well , if you have a text , you can represent in two ways : one it to store the text as it is , another one is to store all the words in that text and number of occurences and (maybe ) the position of the word in that text . First thing you need to know about Sphinx is that it never stores the full text .

Now you have some tables and you want Sphinx to do the search job . What exactly happends ? Sphinx needs to get that data . Currently there are 2 ways , either giving him a sql query or an xml file . Careful : a sql query ! It can short or long , but a query . It can be simple or can have 10 joins in it . What is really important is very connected to what you want to search . Even in a classic SQL search , in the end you want something returned from the search. A book , an user , a message , whatever . The important thing is this item you want it have an ID . In MySQL usually it’s a primary key , called id or whatever_id . So getting back to our query , you are throwing some data in Sphinx . Each item needs to have an unique id . Sphinx doesn’t know about primary key , it doesn’t care , for him the id you will give it , it will see it as an integer . What is really important is this integer to be unique , otherwise you will have duplicates . So you have data to Sphinx . A very important difference between Sphinx and MySQL is that , unlike MySQL , Sphinx doesn’t know to join different types of data . You can’t join apples with oranges like in MySQL and you get back a row . Sphinx can mix results from different indexes , but that’s what he will do : MIX . When mixing he will only see the document id . If gets common ids , however he knows to remove the duplicates . This is how you can achieve and pretty much how works distributed indexes and main+delta schemes .

Another very , very important thing : sphinx will return you the document id  , no texts – we already said it doesn’t keep the full text . Of course , it has the possibility , beside the text fields to index integer , boolean, float fields which will be returned . The reason is simple : those will be stored as they are . Last releases can have string attributes , which in contrast to text fields , are returned as well . But be careful ,there are some limitations and those string attributes needs more memory .

Why I said that Sphinx returns the document id ? Simply because only doing a search request in Sphinx is not enough ( in most cases ). Most likely you are displaying the text field(s) or other informations . So you need to make a DB call using the ids you got from Sphinx . You are not getting away from not querying your database . Sphinx does not replace the database server , it completes it .

Now , there are some aspect that are needed to be understood of how Sphinx works in real life . When you have an application that uses the databases , what you actually do ? You do requests to the database : to insert your data , to modify your data , to extract the data . Same way Sphinx works : you need to throw data at him . There are 2 ways to do that , two ways which defines the indexes types Sphinx have :

First way is the so called on-disk index . When you configure it , you give him that query . After that you need to run the indexer –  the tool that runs that query , get the data , put it in the index . You have the data . Ok , but in database you will get new data . Well , you need to run the indexer again to take the data . Wait , what ? Well , how Sphinx can know there’s new data ? What can you do ? Simple way is to run the indexer at a time , like once per day , per hour , per week if you don’t have frequent updates . If your data gets big or you want to have the info faster in the index there’s a procedure called main+delta . Basicly it consist in a big one , updated less frequent and a small one , updated more often . But , at the base , the delta does the same as stated above : it will run a query to get some data . I will not enter in details of the main+delta , I only want to emphasize one thing : using the on-disk , your updates will not appear magicly in Sphinx .

But there is the second index type – RT index or realtime index – that can do that . In short , it works almost as a mysql table : you put data in it ( using some SQL queries ) , data is available in several miliseconds.  What’s the catch : first , RT is considered a bit slower on big ( very big ) indexes than on-disk ones . Second and also very important : unlike on-disk where you can put the indexer tool to do the job, you need to do it . You added something new in the database ? great , immediatly you need to add it in Sphinx .

The way you retrieve the info is the same for both indexes  .  So on-disk : pro – fast & less code work ; con – info not available immediately(not without some tricks ); RT : pro – info available immediately ; con –  more code work & bit slow on huge data .

The less/more work  argument is very important for anyone who wants to add Sphinx to his application .  Once again : on-disk is easier to integrate , but not real time .

One more thing : people think Sphinx can do some Google-search magic on their searches . Well , no . Not by saying 1,2,3 .It has relevancy algorithms implemented , even more in latest results you can create your own rules for ranking , BUT it will NOT make a miracle search . Google does the magic because it’s not just index some data and will know what you want . It records any search you made , what link you clicked from that search , to add a counter on that page so it can be more relevant next time . Even the suggestions he makes in case of mispelling something are not based only on some algorithm to detect the word . That’s not so complicated , it’s called thesaurus , morphology or whatever . A wrong word gets the correct suggestion in a phrase because google recorded how many times the meaning was the right one associated with the words from THAT phrase . Sphinx has morphology too , it has prefix/infix options , it was word forms , but it cannot guess the real meaning of something . He follows some “robotic” rules that are providing to him . Of course , with some work you can really do some very relevant results .  And in many , many cases you have a lot of particular . Rules , rules , rules . You want results to be relevant . You also want to show something close to relevant if relevant is not found . This is called CUSTOM  , it’s not something that you plug-in . You need to analyze the data you have , HOW the users search – this is something to pay attention  . You can do whatever you want to do  if your users will simply not insert the search text as you might think . You might add sphinx in several hours to your project , but to achieve the level of relevancy you want , x more time hours could be needed for that . Let’s not forget that most likely you don’t want to loose so much performance , because improving relevancy can lead to slower speed . In this case , first you need to try to optimize things . If that doesn’t work  … you need more power ( the best is to keep the indexes in memory , also an index can be split in several chunks [actually it’s an index too , but Sphinx will know to mix them] so when a search is made to use the available cores )  .

Sphinx is fast , it’s also pretty smart and not at last , it can scale very well . Also there is a change that it might not fit for you , simply because it’s not suited for that  or  it’s too complicated to achieve what you need . There are other alternatives : Lucene /Solr, even Google search server etc .

In the end , several things good to know :

– from the business point : Sphinx is free , but implementing him into a system is not , you have 3 choices : put your developer(s) to learn about it  , find one that knows to work with him or contract a specialized company (SphinxTech Inc. is the company behind the project ) to help you .

– from a business point too : Sphinx wil not do any magic if your database or your code is slow. I’ve seen situations where simply bad logic was the main cause for slowness . Of course , Sphinx is faster ( a lot ) for some tasks than MySQL .

– from a development point : it’s not like pluging a USB stick and it will work , it’s need to be configurated and integrated . Also it’s good to use the latest version , even if you have to compile it ( compilation is one of the easiest I even seen ) –  it’s continuos developed and new features , fixes and improvments are added .

– It’s a complementar tool to the database , it does not replace the database

Word boosting in Sphinx

disclaimer : I’m not an expert in Sphinx

Boosting words

One feature that Sphinx is missing , but it’s found in Solr/Lucene is word boosting – a.k.a you want some words which if are found, the document gets a weight boost . in Solr/Lucene you can define this by using ^ operator , like word^x , where x is a number and a multiplicator for that word’s score .

How to do that in Sphinx ? Well , it’s another play with the query string . Please note that the procedure will not actually give you the word boosting found on Solr ,it’s more a workaround .

Let’s assume you have a title and a description that are indexed in sphinx .  Let’s take an example “techno music” . You will search this string in both title and description ( you might have different weight set for each field , in most cases title gets a field weight boost ) .  Let’s say you have a query like “^techno music$”|”techno music”|”techno music”~10 to have a general good search ( first is exact match on a field , second is exact phase , third is a 10 words promixity match ) . The reason you want this boosting is mainly because your titles might not have both words and you are intested of those who have techno word –  since music is more general . To boost techno you need to add another |”techno”/1 . What you will achieve is , beside the phase search that should match content , you are matching ( if exist ) the word tehno in the title . It could also match only techno in description , but since you consider it an important term , then it’s no problem , you achieved the desired search. If it’s only one word , then you don’t need the /1 operator . But if you would have techno house as boosted words , your extra query would be |”techno house”/1 , to give a match of at least one .

The list of boosted words can be held in a file which is readed when doing a search . You can also use a memory caching ( memcached , apc ,whatever ) .

UnBoosting words

What happends if you want the opposite ? Let’s take the above example : techno music . Music is a general term . You could implement the above procedure and make a list with music subgenres and boost them , but why not decrease the score for music term ? Sphinx has the stopwords list feature , but if you add music there , it won’t be counted in search . And you don’t want that , because you might want to match music too , if techno is not found . The idea is you want first the docs that match techno and then those that match music .

The implementation is similar with the one for boosting :

  • read the list of words that are not important
  • create a new string that contains the input string but without the words in the list above ( use str_replace to delete them)
  • to your query search add a |”new_string”/1

What we have now it will be existing_query | “new_string”/1 , so we basicly made the new_string to be a boosted list of words . So if you searched for “techno house music” and music is in your not important words , the query will be now “techno house music” | “techno house”/1 . First results will contain techno,house and music , then techno OR house . If you use a more relaxed query like |”techno house music”/10 , then you might get more results  ( with less relevance ) .

Both procedures will help your results especially when you are searching in multiple fields with variable weight . In both cases you will do an OR which search only for some words ( which are considered more important ) in case they can match in smaller fields ( like a title ) . Of course , you could do | @title term  and in this case you are making a match if exist only in that field .

Happy Sphixing 🙂

Synonyms in Sphinx

disclaimer : I’m not an expert in Sphinx

Sometimes when doing a search you want to search not only for the words included in the query , but also after their synonyms , to increase the number of results . Sphinx doesn’t come with this by default . Instead it comes with a feature names “wordforms” . Yet wordforms is not a fully featured synonyms feature . As it’s name , it take cares of forms ( variations) of a word , mispells OR direct several uncommon words to a single one . Bear in mind : a single one. You can’t declare dog > cat and then make cat > dog . So you could do dog > cat and mouse > cat , both will be replaced by cat when searching , but you can’t make to search for all 3.

So , how to do it? Since we don’t have  a feature implemented , only options left is to use the query string  : let’s say we search for “black cat” and we have for cat the synonym dog . Our query will transform from “black cat” into “black cat|dog”. Sphinx will return both “black cat” and “black dog” matches.

How to do that :

– first we create a file ( let’s say it synonyms.txt ) in which we put on every line a synonyms list

– when we receive a query string , we take the query string , split it in words and for every word we search in this file for a match

– match found , we modify the query string to by replacing the word with the words found in that line

– do the search

Problems :

– obviously , response time always grows with the length of the query search ( and filtering etc. ) . This new query with OR operators shouldn’t increase very much the response time ( well , you might notice it if your collection of data is big and you get a lot of traffic )

– searching for the synonyms . Here you could get trouble , especially if the file grows. The simple way is to read each line , explode it and search for every word if is in the array.  This is not very efficient and for a big file this is a problem , since you will consume a lot of memory. Alternatives might be :

  • use grep – it’s pretty fast and will return you the matched line ;
  • use memcached for a matched line of a certain word . You can store a key like ‘synonym_cat’ with content ‘cat|dog’ .  Best would be to have memcached on the same server , to avoid network lagging  .
  • use APC instead of memcached . You could also cache the function that does the search in the file

Here is an example on how to do ( please note that this solution is not optimal for large files ):

Each line of synonyms.txt will look like this :

cat | mouse | dog

If you use another separator , be carefull to replace it with | ( OR operator) when inserting in the query string .

$lines= array();
$synofile = file("synonyms.txt");
foreach($synofile as $line){
   $lines[] = trim($line);
}
$tmp_string = strtolower(str_replace(array('-','+'), " ",$input_string));
foreach ($tmp_string as $word){
  $extraword =false;
  foreach ($lines as $line){
    if(false !==strpos($line,$word)){
       $input_string= str_replace($word, $line, $input_string);
    }
  }
}

As I said , this is not a perfect solution , for example it should test the words for a minimum length .