In terms of Semantria, there are categories (concept matrix). Sample taxonomy nodes look like this:
|Taxonomy Node Name||Taxonomy Examples|
|Product_Toyota||Matrix, Corolla, Camry|
|Product_Volkswagen||Golf, Jetta, Phaeton, Passat|
|Product_Ford||Focus, Focus, F-150|
Is there a specific format or procedure to install a taxonomy file for a Semantria user?
No there aren’t any procedures. Customers just need to pass categories on the server side using “update categories” end-point.
What are the best practices for developing concept topics taxonomy? Should they be narrow or wider (as in how does it work with deep taxonomy? Are there any limits to concept matrix?)
Using concept topics for broad categorization seems to work best. Because concept topics are based on the conceptual relationships found in Semantria’s model of Wikipedia, the same level of very granular control that you might have with a Boolean query language is not available. That said, it is possible to develop fairly large and complex concept topic queries that are able to achieve some rather specific classifications.
What should I set the weight to for my category?
The category weight affects the category relevancy score of a given category. Out-of-the-box, the threshold limit is 0.45. Any categories with a relevancy score lower than 0.45 will not be reported in the output. If the weight of a category is set to 0, the category relevancy score will also be 0 and not return any results. Therefore, it is recommended to set the weight between 0.85 and 2, depending on the importance of the category in the given text.
If the weight is left undefined Semantria will report the relevancy score for the category as above the 0.45 default threshold.
What are the best ways to improve precision in the concept matrix?
The best way to improve precision in the concept matrix is to use the underscore, NOT and CONTEXT operators in concept queries in order to filter out terms that should not contribute to a concept match and other terms that give meaning to the general subject area in which to find the match.
When the concept matrix is given a phrase, it matches both the phrase form as well as the individual words. Thus 'power plant', while matching stories about electric generation most strongly, may also pull in articles about plant life. In most queries the individual words in a phrase are related and contributepositively. But in cases where the individual words mean something different on their own, underscore instructs the engine to only use the phrase form. Thus 'power_plant' will not match articles about flowers at all.
NOT excludes certain ideas from consideration. This operator is primarily intended for narrowing down the meanings of words and phrases, or otherwise limiting the scope implied by a word or phrase.
For example: 'food NOT meat, animal' would not match an article about hamburgers. However, it will still match an article that talks about salads and hamburgers (just not as strongly as it otherwise would have).
While NOT excludes certain ideas implied by a query, CONTEXT highlights certain ideas. Consider the case where you are interested in automobile manufacturing. The query 'automobile, manufacturing' is likely to get relevant results, but may also pull in articles about manufacturing in general. The query 'automobile_manufacturing' is highly specific, but possibly overly so. Automobile CONTEXT manufacturing will result in a search for automotive in general, with a focus on manufacturing. It will not return results just about manufacturing. Specifically, the text to the left of CONTEXT supplies the general idea being searched for, and the text to the right supplies the ideas you want that topic to be discussed with.