Knowledge Base > Languages and characters > Relevanssi and languages

Relevanssi and languages

Relevanssi is language-agnostic in itself. It does not know any language and doesn’t care about which language the site uses.

However, there are a few things that you need to consider when using Relevanssi in languages other than English.

Characters: use UTF8

As long as your site uses UTF8 characters, Relevanssi can handle just about anything you throw at it – you can even search for emojis. UTF8 is the standard in WordPress, and you generally don’t have to worry about it.

Words: bad news for Chinese and Japanese

While Relevanssi can read Chinese, Japanese and many other characters without problems, the lack of distinct words in these languages is a problem for Relevanssi.

Relevanssi works by splitting the posts into words at spaces and then counting how many times those words appear. Since Chinese and Japanese texts don’t have spaces separating words, Relevanssi can’t do this.

As a result, Relevanssi can search for Chinese and Japanese characters or character sequences, especially if you enable one-character words and inside-word matching in Relevanssi settings. Still, since the weights for the posts are essentially random, the results won’t be of high quality.

Unfortunately, making the search work well in Chinese, Japanese and other languages with similar characteristics requires advanced linguistics and is far beyond our capabilities.

Update 25.11.2020: Matthew Wang has suggested using a Chinese language segmentation tool like phpjieba. If you have the jieba() function installed on your site, you can use it for tokenizing Chinese text like this:

add_filter( 'relevanssi_remove_punctuation', 'rlv_use_jieba' );
function rlv_use_jieba( $string ) {
    $string = jieba( $string, 1, 1500 );
    $string = @implode( ' ', $string );
    return $string;
}

For Japanese, there’s Limelight.

Did you mean suggestions: limited to Latin characters

While Relevanssi can search Arabic, Russian or other non-Latin character sets, the “Did you mean” suggestions in Relevanssi Premium only support Latin characters.

The way these suggestions work is that when Relevanssi searches, Relevanssi then modifies the search term in different ways by adding or removing letters in it. Relevanssi does these modifications with the Latin alphabet (mainly the English alphabet, with a few extra umlauts thrown in). This alphabet use restricts the Premium “Did you mean” feature to text in the Latin alphabet.

The simpler “Did you mean” feature in the free version of Relevanssi should work with most character sets, as it uses the user searches, but it’s less reliable in other ways.

Relevanssi has a filter hook relevanssi_didyoumean_alphabet for replacing the alphabet used. Here are some replacement alphabets:

Russian

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'абвгдеёжзийкмнопрстуфхцчшщъыьэюя'; } );

Arabic

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'ابتثجحخدذرزسشصضطظعغفقكفمنههيآإأؤئى'; } );

Polish

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'aąbbcćdeęfghijklłmnńoóprsśtuwyzźż'; } );

Vietnamese

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'aáàâậăằảbcdđeẹêệềghiịklmnoóọôộơớợpqrstuụủưựứvxyỹ'; } );

Hebrew

add_filter( 'relevanssi_didyoumean_alphabet', function() { return 'אבגדהוזחטיכלמנעפצקרשתםןףךץ'; } );

Stemming and suffix stripping

Relevanssi Premium includes a simple English-language stemmer that changes word forms to more basic forms to make the searching less dependent on exact word form.

To enable the English stemmer, add this to your site and rebuild the index:

add_filter( 'relevanssi_stemmer', 'relevanssi_simple_english_stemmer' );

Other languages:

Finnish: Simple Finnish stemmer.
French: Simple French plural support.
German: Simple German stemmer.
Korean: Korean postposition stripping.

These simple stemmers are not very good, though, so I recommend using a proper Snowball stemmer. It’s available as an add-on plugin and is slightly harder to set up, but the results are better, and the plugin supports over dozen languages.

Get the Snowball Stemmer add-on plugin here.

Arabic diacritics

You can improve the Relevanssi Arabic support by removing diacritics with this function. Add this to your site:

add_filter( 'relevanssi_remove_punctuation', 'rlv_arabic_remap', 9 );

/**
 * Remove Arabic diacritics.
 *
 * @param string $a The text to remove punctuation from.
 *
 * @return string The same text with punctuation and diacritics removed.
 */
function rlv_arabic_remap( $a ) {
    $remap = array(
        'إ' => 'ا',
        'آ' => 'ا',
        'أ' => 'ا',
        'ئ' => 'ى',
        'ة' => 'ه',
        'ؤ' => 'و',
        'ـ' => '',
        'آ' => 'ا',
    );

    $diacritics = array(
        '~[\x{0600}-\x{061F}]~u',   
        '~[\x{063B}-\x{063F}]~u',   
        '~[\x{064B}-\x{065E}]~u',   
        '~[\x{066A}-\x{06FF}]~u',   
    );

    $a = preg_replace( $diacritics, '', $a );
    $a = str_replace( array_keys( $remap ), array_values( $remap ), $a );

    return $a;
}

After adding the code, make sure you rebuild the index. This function will remove the diacritics and map some characters to their simpler forms in the index and user searches, enabling the search to find more results.

17 comments Relevanssi and languages

Aleksandr says:

April 19, 2024 at 12:34 pm

Hi guys, please help me figure this out. I use your plugins (free version) on my site there are 2 languages English and Estonian for some reason in the search I look only for English I can’t find the reason for the translation plugin TranslatePress – Multilingual. Please direct me where to look. Thank you very much!

Reply
1. Mikko Saari says:
  
  April 19, 2024 at 12:47 pm
  
  Aleksandr, Relevanssi can’t work with TranslatePress. It uses a method that is unfortunately incompatible with Relevanssi.
  
  Reply
Harry Wang says:

June 11, 2024 at 3:27 pm

Hi, I would like to know if there is a detailed tutorial for the Chinese Segmentation Tool? I don’t quite understand how to install and use the phpjieba tool. A detailed way of using it would be much appreciated!

Reply
1. Mikko Saari says:
  
  June 11, 2024 at 3:42 pm
  
  Harry, unfortunately not. It’s a complex tool, and I don’t have the resources to produce a detailed tutorial for it at the moment.
  
  Reply
Reza says:

October 9, 2024 at 11:59 am

Hey Mikko;
We use persian language. It’s like arabic characters but it has 4 more character than arabic.
Is there any way that Relevanssi guess the equal character of the english in persian and vice versa.

Specially in names and brands, people search either in persian or english.

for example SAMSUNG is written like “سامسونگ” in persian
S=س
A=ا
M=م
S=س
U=و
N=ن
G=گ

and for sentense they can search link:
SAMSUNG mobile=
موبایل سامسونگ =
موبایل SAMSUNG

Is there any way that relevanssi can understand this equal characters or not?
Or I should make a synonym for every brand and names in plugin setting?

Reply
1. Mikko Saari says:
  
  October 9, 2024 at 12:16 pm
  
  Yes, it’s possible – you could have Relevanssi transliterate everything so all the content would be indexed in Latin or Persian characters. I suppose in your case, it would make more sense to transliterate Latin characters to Persian.
  
  You can find examples of such transliteration functions in the relevanssi_remove_punctuation documentation. If you create a function that converts each Latin character to the matching Persian character, your search will work in Persian and Latin.
  
  Reply
  1. Reza says:
    
    October 9, 2024 at 1:17 pm
    
    For the SAMSUNG example above I should use this:
    
    add_filter( ‘relevanssi_remove_punctuation’, function( $string ) {
    $chars = array(
    ‘S’ => ‘س’,
    ‘A’ => ‘ا’,
    ‘M’ => ‘م’,
    ‘U’ => ‘و’,
    ‘N’ => ‘ن’,
    ‘G’ => ‘گ’,
    ‘س’ => ‘S’,
    ‘ا’ => ‘A’,
    ‘م’ => ‘M’,
    ‘و’ => ‘U’,
    ‘ن’ => ‘N’,
    ‘گ’ => ‘G’,
    
    );
    return strtr( $string, $chars );
    }, 8 );
    
    Is it right?
    
    And can i assign one character to more than one? like:
    ‘S’ => ‘س’,
    ‘S’ => ‘ث’,
    ‘S’ => ‘ص’,
    
    Because all these characters pronounce as S in english.
    And whats that number “8” at the end of the code.
    
    I appreciate your help.
    
    Reply
    1. Mikko Saari says:
      
      October 9, 2024 at 1:31 pm
      
      Do it one way only. So in this case, you’d do ‘S’ => ‘س’,, but not ‘س’ => ‘S’,. You can only map each letter to one letter; after the first replacement is made, there are no S characters remaining in the text, so the other replacements won’t happen.
      
      The number 8 is the priority of the filter function. 8 is good there.
      
      Reply
      1. Reza says:
        
        October 9, 2024 at 1:40 pm
        
        You mean if I add
        “‘S’ => ‘س’”
        in the filter, it will work in both ways, I mean:
        
        If there is a word in text with “س” , plugin will see it as a “S”
        If there is a word in text with “S” , plugin will see it as a “س”
        
        Is it right?
Mikko Saari says:

October 9, 2024 at 1:47 pm

No, it means that if there’s an “S” in the text, the plugin will see it as “س”, and if there’s an “S” in the search terms, it will also be seen as “س”. This way it’ll work out fine: it doesn’t matter which form is in the text and in the search terms, internally it’s all Persian.

Reply
Reza says:

October 10, 2024 at 8:47 am

I’ve added this snippet:

add_filter( ‘relevanssi_remove_punctuation’, function( $string ) {
$chars = array(
‘A’ => ‘ا’,
‘B’ => ‘ب’,
‘C’ => ‘س’,
‘D’ => ‘د’,
‘E’ => ‘ی’,
‘F’ => ‘ف’,
‘G’ => ‘ج’,
‘H’ => ‘ح’,
‘I’ => ‘ی’,
‘J’ => ‘ج’,
‘K’ => ‘ک’,
‘L’ => ‘ل’,
‘M’ => ‘م’,
‘N’ => ‘ن’,
‘O’ => ‘و’,
‘P’ => ‘پ’,
‘Q’ => ‘ک’,
‘R’ => ‘ر’,
‘S’ => ‘س’,
‘T’ => ‘ت’,
‘U’ => ‘ئ’,
‘V’ => ‘و’,
‘W’ => ‘و’,
‘X’ => ‘ز’,
‘Y’ => ‘ی’,
‘Z’ => ‘ز’,
);
return strtr( $string, $chars );
}, 8 );

And my WP got down with this error in error_log:

[10-Oct-2024 05:43:24 UTC] PHP Fatal error: Uncaught Error: Undefined constant “‘relevanssi_remove_punctuation’” in /functions.php:207
Stack trace:
#0 /wp-settings.php(668): include()
#1 /wp-config.php(113): require_once(‘/home/kazemibi/…’)
#2 /wp-load.php(50): require_once(‘/home/kazemibi/…’)
#3 /wp-blog-header.php(13): require_once(‘/home/kazemibi/…’)
#4 //index.php(17): require(‘/home/kazemibi/…’)
#5 {main}
thrown in /functions.php on line 207

Reply
1. Mikko Saari says:
  
  October 10, 2024 at 8:48 am
  
  Replace the backticks (`) around relevanssi_remove_punctuation with simple apostrophes (‘).
  
  Reply
Reza says:

November 10, 2024 at 6:09 pm

How can I make this characters work? (because of special characters I get PHP error)

‘,’ => ‘و’,
‘;’ => ‘ک’,
‘’’ => ‘گ’,
‘[’ => ‘ج’,
‘]’ => ‘چ’,
‘\’ => ‘پ’,

Persian keyboard map is like: https://kbdlayout.info/KBDFA/
I wanna make a replacement for this characters to make a meaningful search if a user types persian in english keyboard,

Reply
1. Mikko Saari says:
  
  November 12, 2024 at 6:46 am
  
  Reza, what’s the exact error? In general, you don’t need to make punctuation work – it’s ignored in searching anyway, so you should instead remove the matching Persian characters.
  
  Reply
Yair Sonic says:

December 20, 2024 at 6:57 am

as you say For Japanese, there’s Limelight.

but setting up Limelight require mecab which official download link is death
any other workaround way please ?

OR

what if i search (2katakana)(3kanji)(2katakana)

i want to make searchable to 3kanji .. which is not able to search in currently

Reply
1. Mikko Saari says:
  
  December 20, 2024 at 6:59 am
  
  Yair, sorry, but I have no idea.
  
  You can set up Relevanssi to search within words.
  
  Reply
  1. Yair Sonic says:
    
    December 20, 2024 at 7:15 am
    
    Aren’t no way .. i can able to search middle words now .. Truly Golden 👍
    
    Reply

Relevanssi and languages

Characters: use UTF8

Words: bad news for Chinese and Japanese

Did you mean suggestions: limited to Latin characters

Russian

Arabic

Polish

Vietnamese

Hebrew

Stemming and suffix stripping

Arabic diacritics

Related Posts

relevanssi_remove_punctuation

Indexing HTML comments

Premium 1.7.5

Simple French plurals

Korean postpositions

Premium 1.5

17 comments Relevanssi and languages

Leave a Reply Cancel reply