What is Sphinx and what it can do

I’ve seen a lot from my experience from dealing with sphinx integration that many people don’t understand both technically and literally what is SphinxSearch , what can do and most important what is NOT and what CANNOT do.

First , let’s take the idea behind Sphinx . Sphinx probably appeared because it’s creator was not satisfied with MySQL ( or any other db ) performances when doing a search , more exactly when doing a TEXT one .  This main idea is , even now at version 2 ,  what Sphinx does and what is good at : FAST text search .  How it makes that ? Simple explained  , it uses some kind of inverted index . What’s that ? Well , if you have a text , you can represent in two ways : one it to store the text as it is , another one is to store all the words in that text and number of occurences and (maybe ) the position of the word in that text . First thing you need to know about Sphinx is that it never stores the full text .

Now you have some tables and you want Sphinx to do the search job . What exactly happends ? Sphinx needs to get that data . Currently there are 2 ways , either giving him a sql query or an xml file . Careful : a sql query ! It can short or long , but a query . It can be simple or can have 10 joins in it . What is really important is very connected to what you want to search . Even in a classic SQL search , in the end you want something returned from the search. A book , an user , a message , whatever . The important thing is this item you want it have an ID . In MySQL usually it’s a primary key , called id or whatever_id . So getting back to our query , you are throwing some data in Sphinx . Each item needs to have an unique id . Sphinx doesn’t know about primary key , it doesn’t care , for him the id you will give it , it will see it as an integer . What is really important is this integer to be unique , otherwise you will have duplicates . So you have data to Sphinx . A very important difference between Sphinx and MySQL is that , unlike MySQL , Sphinx doesn’t know to join different types of data . You can’t join apples with oranges like in MySQL and you get back a row . Sphinx can mix results from different indexes , but that’s what he will do : MIX . When mixing he will only see the document id . If gets common ids , however he knows to remove the duplicates . This is how you can achieve and pretty much how works distributed indexes and main+delta schemes .

Another very , very important thing : sphinx will return you the document id  , no texts – we already said it doesn’t keep the full text . Of course , it has the possibility , beside the text fields to index integer , boolean, float fields which will be returned . The reason is simple : those will be stored as they are . Last releases can have string attributes , which in contrast to text fields , are returned as well . But be careful ,there are some limitations and those string attributes needs more memory .

Why I said that Sphinx returns the document id ? Simply because only doing a search request in Sphinx is not enough ( in most cases ). Most likely you are displaying the text field(s) or other informations . So you need to make a DB call using the ids you got from Sphinx . You are not getting away from not querying your database . Sphinx does not replace the database server , it completes it .

Now , there are some aspect that are needed to be understood of how Sphinx works in real life . When you have an application that uses the databases , what you actually do ? You do requests to the database : to insert your data , to modify your data , to extract the data . Same way Sphinx works : you need to throw data at him . There are 2 ways to do that , two ways which defines the indexes types Sphinx have :

First way is the so called on-disk index . When you configure it , you give him that query . After that you need to run the indexer –  the tool that runs that query , get the data , put it in the index . You have the data . Ok , but in database you will get new data . Well , you need to run the indexer again to take the data . Wait , what ? Well , how Sphinx can know there’s new data ? What can you do ? Simple way is to run the indexer at a time , like once per day , per hour , per week if you don’t have frequent updates . If your data gets big or you want to have the info faster in the index there’s a procedure called main+delta . Basicly it consist in a big one , updated less frequent and a small one , updated more often . But , at the base , the delta does the same as stated above : it will run a query to get some data . I will not enter in details of the main+delta , I only want to emphasize one thing : using the on-disk , your updates will not appear magicly in Sphinx .

But there is the second index type – RT index or realtime index – that can do that . In short , it works almost as a mysql table : you put data in it ( using some SQL queries ) , data is available in several miliseconds.  What’s the catch : first , RT is considered a bit slower on big ( very big ) indexes than on-disk ones . Second and also very important : unlike on-disk where you can put the indexer tool to do the job, you need to do it . You added something new in the database ? great , immediatly you need to add it in Sphinx .

The way you retrieve the info is the same for both indexes  .  So on-disk : pro – fast & less code work ; con – info not available immediately(not without some tricks ); RT : pro – info available immediately ; con –  more code work & bit slow on huge data .

The less/more work  argument is very important for anyone who wants to add Sphinx to his application .  Once again : on-disk is easier to integrate , but not real time .

One more thing : people think Sphinx can do some Google-search magic on their searches . Well , no . Not by saying 1,2,3 .It has relevancy algorithms implemented , even more in latest results you can create your own rules for ranking , BUT it will NOT make a miracle search . Google does the magic because it’s not just index some data and will know what you want . It records any search you made , what link you clicked from that search , to add a counter on that page so it can be more relevant next time . Even the suggestions he makes in case of mispelling something are not based only on some algorithm to detect the word . That’s not so complicated , it’s called thesaurus , morphology or whatever . A wrong word gets the correct suggestion in a phrase because google recorded how many times the meaning was the right one associated with the words from THAT phrase . Sphinx has morphology too , it has prefix/infix options , it was word forms , but it cannot guess the real meaning of something . He follows some “robotic” rules that are providing to him . Of course , with some work you can really do some very relevant results .  And in many , many cases you have a lot of particular . Rules , rules , rules . You want results to be relevant . You also want to show something close to relevant if relevant is not found . This is called CUSTOM  , it’s not something that you plug-in . You need to analyze the data you have , HOW the users search – this is something to pay attention  . You can do whatever you want to do  if your users will simply not insert the search text as you might think . You might add sphinx in several hours to your project , but to achieve the level of relevancy you want , x more time hours could be needed for that . Let’s not forget that most likely you don’t want to loose so much performance , because improving relevancy can lead to slower speed . In this case , first you need to try to optimize things . If that doesn’t work  … you need more power ( the best is to keep the indexes in memory , also an index can be split in several chunks [actually it’s an index too , but Sphinx will know to mix them] so when a search is made to use the available cores )  .

Sphinx is fast , it’s also pretty smart and not at last , it can scale very well . Also there is a change that it might not fit for you , simply because it’s not suited for that  or  it’s too complicated to achieve what you need . There are other alternatives : Lucene /Solr, even Google search server etc .

In the end , several things good to know :

– from the business point : Sphinx is free , but implementing him into a system is not , you have 3 choices : put your developer(s) to learn about it  , find one that knows to work with him or contract a specialized company (SphinxTech Inc. is the company behind the project ) to help you .

– from a business point too : Sphinx wil not do any magic if your database or your code is slow. I’ve seen situations where simply bad logic was the main cause for slowness . Of course , Sphinx is faster ( a lot ) for some tasks than MySQL .

– from a development point : it’s not like pluging a USB stick and it will work , it’s need to be configurated and integrated . Also it’s good to use the latest version , even if you have to compile it ( compilation is one of the easiest I even seen ) –  it’s continuos developed and new features , fixes and improvments are added .

– It’s a complementar tool to the database , it does not replace the database