disclaimer : I’m not an expert in Sphinx
Sometimes when doing a search you want to search not only for the words included in the query , but also after their synonyms , to increase the number of results . Sphinx doesn’t come with this by default . Instead it comes with a feature names “wordforms” . Yet wordforms is not a fully featured synonyms feature . As it’s name , it take cares of forms ( variations) of a word , mispells OR direct several uncommon words to a single one . Bear in mind : a single one. You can’t declare dog > cat and then make cat > dog . So you could do dog > cat and mouse > cat , both will be replaced by cat when searching , but you can’t make to search for all 3.
So , how to do it? Since we don’t have a feature implemented , only options left is to use the query string : let’s say we search for “black cat” and we have for cat the synonym dog . Our query will transform from “black cat” into “black cat|dog”. Sphinx will return both “black cat” and “black dog” matches.
How to do that :
- first we create a file ( let’s say it synonyms.txt ) in which we put on every line a synonyms list
- when we receive a query string , we take the query string , split it in words and for every word we search in this file for a match
- match found , we modify the query string to by replacing the word with the words found in that line
- do the search
Problems :
- obviously , response time always grows with the length of the query search ( and filtering etc. ) . This new query with OR operators shouldn’t increase very much the response time ( well , you might notice it if your collection of data is big and you get a lot of traffic )
- searching for the synonyms . Here you could get trouble , especially if the file grows. The simple way is to read each line , explode it and search for every word if is in the array. This is not very efficient and for a big file this is a problem , since you will consume a lot of memory. Alternatives might be :
- use grep – it’s pretty fast and will return you the matched line ;
- use memcached for a matched line of a certain word . You can store a key like ‘synonym_cat’ with content ‘cat|dog’ . Best would be to have memcached on the same server , to avoid network lagging .
- use APC instead of memcached . You could also cache the function that does the search in the file
Here is an example on how to do ( please note that this solution is not optimal for large files ):
Each line of synonyms.txt will look like this :
cat | mouse | dog
If you use another separator , be carefull to replace it with | ( OR operator) when inserting in the query string .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | $lines= array(); $synofile = file("synonyms.txt"); foreach($synofile as $line){ $lines[] = trim($line); } $tmp_string = strtolower(str_replace(array('-','+'), " ",$input_string)); foreach ($tmp_string as $word){ $extraword =false; foreach ($lines as $line){ if(false !==strpos($line,$word)){ $input_string= str_replace($word, $line, $input_string); } } } |
As I said , this is not a perfect solution , for example it should test the words for a minimum length .
