disclaimer : I’m not an expert in Sphinx

Sometimes when doing a search you want to search not only for the words included in the query , but also after their synonyms , to increase the number of results . Sphinx doesn’t come with this by default . Instead it comes with a feature names “wordforms” . Yet wordforms is not a fully featured synonyms feature . As it’s name , it take cares of forms ( variations) of a word , mispells OR direct several uncommon words to a single one . Bear in mind : a single one. You can’t declare dog > cat and then make cat > dog . So you could do dog > cat and mouse > cat , both will be replaced by cat when searching , but you can’t make to search for all 3.

So , how to do it? Since we don’t have  a feature implemented , only options left is to use the query string  : let’s say we search for “black cat” and we have for cat the synonym dog . Our query will transform from “black cat” into “black cat|dog”. Sphinx will return both “black cat” and “black dog” matches.

How to do that :

- first we create a file ( let’s say it synonyms.txt ) in which we put on every line a synonyms list

- when we receive a query string , we take the query string , split it in words and for every word we search in this file for a match

- match found , we modify the query string to by replacing the word with the words found in that line

- do the search

Problems :

- obviously , response time always grows with the length of the query search ( and filtering etc. ) . This new query with OR operators shouldn’t increase very much the response time ( well , you might notice it if your collection of data is big and you get a lot of traffic )

- searching for the synonyms . Here you could get trouble , especially if the file grows. The simple way is to read each line , explode it and search for every word if is in the array.  This is not very efficient and for a big file this is a problem , since you will consume a lot of memory. Alternatives might be :

  • use grep – it’s pretty fast and will return you the matched line ;
  • use memcached for a matched line of a certain word . You can store a key like ‘synonym_cat’ with content ‘cat|dog’ .  Best would be to have memcached on the same server , to avoid network lagging  .
  • use APC instead of memcached . You could also cache the function that does the search in the file

Here is an example on how to do ( please note that this solution is not optimal for large files ):

Each line of synonyms.txt will look like this :

cat | mouse | dog

If you use another separator , be carefull to replace it with | ( OR operator) when inserting in the query string .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$lines= array();
$synofile = file("synonyms.txt");
foreach($synofile as $line){
   $lines[] = trim($line);
}
$tmp_string = strtolower(str_replace(array('-','+'), " ",$input_string));
foreach ($tmp_string as $word){
  $extraword =false;
  foreach ($lines as $line){
    if(false !==strpos($line,$word)){
       $input_string= str_replace($word, $line, $input_string);
    }
  }
}

As I said , this is not a perfect solution , for example it should test the words for a minimum length .