How is the relevancy (or hitcount) calculated?
The math behind Boolean operators “AND” and “OR” to calculate query relevancy is:
AND = permutation
OR = addition
Simple Example: Assume the query (cook* AND plan) OR (fish AND chef*) is run on the following text:
“All the cooks were planning a dish that would encompass a variety of ingredients. Once the culinary competition was underway each chef began cooking his or her respective dish. Many of the chefs planned to prepare fish, but there was not enough to go around. Instead some were left to cook poultry and meats.”
For the above text, a wildcard (or query term with an asterisk), cook* and chef*, will get a double score (of 2) for all words it matches, unless the word directly matches the prefix – ie. “cook” – in which case it will receive a regular score of 1.
For words “plan” and “fish” in the query, only stemmed forms of the word contribute to the hitcount with a regular score of 1.
Therefore, for the first block of the query, the hit count will be:
(cooks=2) + (cooking=2) + (cook =1) = 5
(planning=1) + (planned=1) = 2
Permutation of (cook* AND plan) = 5 * 2 = 10
The second block of the query will have a hit count of:
(fish=1) = 1
(chef=1) + (chefs=2) = 3
Permutation of (fish AND chef*) = 1 * 3 = 3
Given the OR operator combining the 2 blocks of the query, the final hitcount is calculated by adding the hitcounts of each block:
10 + 3 = 13.
Exclusionary operators such as NOT and EXCLUDE do not affect the hitcount calculation.
The NEAR operator can have different permutations depending on the word order and distance between the words. For example, given the query A NEAR B you get all close pairs of A and B included in the hit count. Therefore, in a sentence of:
A B B B A.
the hitcount is 6 because there are 3 Bs, each near 2 As. Assuming a longer sentence or document like the following:
A B... A B.
the hitcount will be 2 because each A is only near one B giving only 2 pairings.