Detect Chinese Character in Unicode String
Recently, when trying to convert some directory/file names between Chinese and English, it is necessary to detect if a Unicode string contains Chinese characters. Unfortunately, Chinese language detection, or language detection, is not easy. There are several options:
- Use API of Microsoft Language Detection in Extended Linguistic Services
- Use the Detect API of Microsoft Translator
- Microsoft has a sample C# package for language identification
- Take the character range of East Asia languages (CJK Unified Ideographs (Han), where CJK means Chinese-Japanese-Korean) from the Unicode charts, and detect whether each character is in the range.
- Use Google Chrome’s language detector, since Chrome is open source.
These are all practical, but it would be nice if there is a simple stupid solution. Actually .NET has an infamous enum System.Globalization.UnicodeCategory, it has 29 members:
- UppercaseLetter
- LowercaseLetter
- OpenPunctuation
- ClosePunctuation
- MathSymbol
- OtherLetter
- …
And there are 2 APIs accepting a char and returning the char’s UnicodeCategory:
- char.GetUnicodeCategory
- CharUnicodeInfo.GetUnicodeCategory
So, generally, the following extension method detects if a string contains char in the specified UnicodeCategory:
public static bool Any(this string value, UnicodeCategory category) => !string.IsNullOrWhiteSpace(value) && value.Any(@char => char.GetUnicodeCategory(@char) == category);
Chinese characters are categorized into OtherLetter, so the Chinese detection problem can becomes OtherLetter detection.
public static bool HasOtherLetter(this string value) => value.Any(UnicodeCategory.OtherLetter);
The detection is easy:
bool hasOtherLetter = text.HasOtherLetter();
It is not totally accurate for Chinese language, but it works very well to distinguish English string and Chinese string.