disclaimer : I’m not an expert in Sphinx
Boosting words
One feature that Sphinx is missing , but it’s found in Solr/Lucene is word boosting – a.k.a you want some words which if are found, the document gets a weight boost . in Solr/Lucene you can define this by using ^ operator , like word^x , where x is a number and a multiplicator for that word’s score .
How to do that in Sphinx ? Well , it’s another play with the query string . Please note that the procedure will not actually give you the word boosting found on Solr ,it’s more a workaround .
Let’s assume you have a title and a description that are indexed in sphinx . Let’s take an example “techno music” . You will search this string in both title and description ( you might have different weight set for each field , in most cases title gets a field weight boost ) . Let’s say you have a query like “^techno music$”|”techno music”|”techno music”~10 to have a general good search ( first is exact match on a field , second is exact phase , third is a 10 words promixity match ) . The reason you want this boosting is mainly because your titles might not have both words and you are intested of those who have techno word – since music is more general . To boost techno you need to add another |”techno”/1 . What you will achieve is , beside the phase search that should match content , you are matching ( if exist ) the word tehno in the title . It could also match only techno in description , but since you consider it an important term , then it’s no problem , you achieved the desired search. If it’s only one word , then you don’t need the /1 operator . But if you would have techno house as boosted words , your extra query would be |”techno house”/1 , to give a match of at least one .
The list of boosted words can be held in a file which is readed when doing a search . You can also use a memory caching ( memcached , apc ,whatever ) .
UnBoosting words
What happends if you want the opposite ? Let’s take the above example : techno music . Music is a general term . You could implement the above procedure and make a list with music subgenres and boost them , but why not decrease the score for music term ? Sphinx has the stopwords list feature , but if you add music there , it won’t be counted in search . And you don’t want that , because you might want to match music too , if techno is not found . The idea is you want first the docs that match techno and then those that match music .
The implementation is similar with the one for boosting :
- read the list of words that are not important
- create a new string that contains the input string but without the words in the list above ( use str_replace to delete them)
- to your query search add a |”new_string”/1
What we have now it will be existing_query | “new_string”/1 , so we basicly made the new_string to be a boosted list of words . So if you searched for “techno house music” and music is in your not important words , the query will be now “techno house music” | “techno house”/1 . First results will contain techno,house and music , then techno OR house . If you use a more relaxed query like |”techno house music”/10 , then you might get more results ( with less relevance ) .
Both procedures will help your results especially when you are searching in multiple fields with variable weight . In both cases you will do an OR which search only for some words ( which are considered more important ) in case they can match in smaller fields ( like a title ) . Of course , you could do | @title term and in this case you are making a match if exist only in that field .
Happy Sphixing