Word boosting in Sphinx

disclaimer : I’m not an expert in Sphinx

Boosting words

One feature that Sphinx is missing , but it’s found in Solr/Lucene is word boosting – a.k.a you want some words which if are found, the document gets a weight boost . in Solr/Lucene you can define this by using ^ operator , like word^x , where x is a number and a multiplicator for that word’s score .

How to do that in Sphinx ? Well , it’s another play with the query string . Please note that the procedure will not actually give you the word boosting found on Solr ,it’s more a workaround .

Let’s assume you have a title and a description that are indexed in sphinx .  Let’s take an example “techno music” . You will search this string in both title and description ( you might have different weight set for each field , in most cases title gets a field weight boost ) .  Let’s say you have a query like “^techno music$”|”techno music”|”techno music”~10 to have a general good search ( first is exact match on a field , second is exact phase , third is a 10 words promixity match ) . The reason you want this boosting is mainly because your titles might not have both words and you are intested of those who have techno word –  since music is more general . To boost techno you need to add another |”techno”/1 . What you will achieve is , beside the phase search that should match content , you are matching ( if exist ) the word tehno in the title . It could also match only techno in description , but since you consider it an important term , then it’s no problem , you achieved the desired search. If it’s only one word , then you don’t need the /1 operator . But if you would have techno house as boosted words , your extra query would be |”techno house”/1 , to give a match of at least one .

The list of boosted words can be held in a file which is readed when doing a search . You can also use a memory caching ( memcached , apc ,whatever ) .

UnBoosting words

What happends if you want the opposite ? Let’s take the above example : techno music . Music is a general term . You could implement the above procedure and make a list with music subgenres and boost them , but why not decrease the score for music term ? Sphinx has the stopwords list feature , but if you add music there , it won’t be counted in search . And you don’t want that , because you might want to match music too , if techno is not found . The idea is you want first the docs that match techno and then those that match music .

The implementation is similar with the one for boosting :

  • read the list of words that are not important
  • create a new string that contains the input string but without the words in the list above ( use str_replace to delete them)
  • to your query search add a |”new_string”/1

What we have now it will be existing_query | “new_string”/1 , so we basicly made the new_string to be a boosted list of words . So if you searched for “techno house music” and music is in your not important words , the query will be now “techno house music” | “techno house”/1 . First results will contain techno,house and music , then techno OR house . If you use a more relaxed query like |”techno house music”/10 , then you might get more results  ( with less relevance ) .

Both procedures will help your results especially when you are searching in multiple fields with variable weight . In both cases you will do an OR which search only for some words ( which are considered more important ) in case they can match in smaller fields ( like a title ) . Of course , you could do | @title term  and in this case you are making a match if exist only in that field .

Happy Sphixing 🙂

Synonyms in Sphinx

disclaimer : I’m not an expert in Sphinx

Sometimes when doing a search you want to search not only for the words included in the query , but also after their synonyms , to increase the number of results . Sphinx doesn’t come with this by default . Instead it comes with a feature names “wordforms” . Yet wordforms is not a fully featured synonyms feature . As it’s name , it take cares of forms ( variations) of a word , mispells OR direct several uncommon words to a single one . Bear in mind : a single one. You can’t declare dog > cat and then make cat > dog . So you could do dog > cat and mouse > cat , both will be replaced by cat when searching , but you can’t make to search for all 3.

So , how to do it? Since we don’t have  a feature implemented , only options left is to use the query string  : let’s say we search for “black cat” and we have for cat the synonym dog . Our query will transform from “black cat” into “black cat|dog”. Sphinx will return both “black cat” and “black dog” matches.

How to do that :

– first we create a file ( let’s say it synonyms.txt ) in which we put on every line a synonyms list

– when we receive a query string , we take the query string , split it in words and for every word we search in this file for a match

– match found , we modify the query string to by replacing the word with the words found in that line

– do the search

Problems :

– obviously , response time always grows with the length of the query search ( and filtering etc. ) . This new query with OR operators shouldn’t increase very much the response time ( well , you might notice it if your collection of data is big and you get a lot of traffic )

– searching for the synonyms . Here you could get trouble , especially if the file grows. The simple way is to read each line , explode it and search for every word if is in the array.  This is not very efficient and for a big file this is a problem , since you will consume a lot of memory. Alternatives might be :

  • use grep – it’s pretty fast and will return you the matched line ;
  • use memcached for a matched line of a certain word . You can store a key like ‘synonym_cat’ with content ‘cat|dog’ .  Best would be to have memcached on the same server , to avoid network lagging  .
  • use APC instead of memcached . You could also cache the function that does the search in the file

Here is an example on how to do ( please note that this solution is not optimal for large files ):

Each line of synonyms.txt will look like this :

cat | mouse | dog

If you use another separator , be carefull to replace it with | ( OR operator) when inserting in the query string .

$lines= array();
$synofile = file("synonyms.txt");
foreach($synofile as $line){
   $lines[] = trim($line);
}
$tmp_string = strtolower(str_replace(array('-','+'), " ",$input_string));
foreach ($tmp_string as $word){
  $extraword =false;
  foreach ($lines as $line){
    if(false !==strpos($line,$word)){
       $input_string= str_replace($word, $line, $input_string);
    }
  }
}

As I said , this is not a perfect solution , for example it should test the words for a minimum length .