Life is slowly returning to order once again. I am attempting to slog through almost 1,000 messages in my MySQL folder, most of which are list questions that have already been answered, so it does not take long to get through them. However, occasionally I find a question that has not been answered, or a gem of a question that I want to expose to a wider audience.
This question fell under both categories. Basically, someone wanted stopwords in other languages, and wondered if there was a place to get them. (English stopwords can be found at http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html.)
I did a quick web search and found a site that has a bunch of language stopwords:
http://www.ranks.nl/stopwords/
Catalan stopwords
Czech stopwords
Danish stopwords
Dutch stopwords
French stopwords
English stopwords (default)
German stopwords
Hungarian stopwords
Italian stopwords
Norwegian stopwords
Polish stopwords
Portugese stopwords
Spanish stopwords
Turkish stopwords
I hope this helps some folks.....
And now, a tricker question -- if there are folks doing fulltext matching in other languages, what is your list of stopwords or where did you get it from? (I am very sure there are tons of sites in the native langauge that lists stopwords, but for admins that do not speak every language their application supports, they can be hard to find!)
What about folks doing searches within a field that may contain multiple languages? Have you created a file to include the stopwords all languages that your application supports? If you have not, should you?
(The documentation on how to change the stopword file parameter is at http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html)
Wikipedia happens to be
Wikipedia happens to be grouped by language so that can be a nice test corpus. What I actually did for Wikipedia when we still used MySQL fulltext is documented at http://meta.wikimedia.org/wiki/Stop_word_list and a consolidated list for en, de, es, fr, it, lt, pt and some MediaWiki markup is at http://meta.wikimedia.org/wiki/Stop_word_list/consolidated_stop_word_lis... .
One handy approach can be to set the MySQL minimum word length to 2 letters then use the fulltext index analysis tool to show you the most popular words. Then you can consider making the most common words stop words. You can see the output for the top 5000 words in en Wikipedia at that time at http://meta.wikimedia.org/wiki/Fulltext_index_statistics_for_en_Wikipedi...
Wikipedia has moved on since then and now uses Lucene, so those lists are not being maintained.