Detect Chinese Character in Unicode String

Recently, when trying to convert some directory/file names between Chinese and English, it is necessary to detect if a Unicode string contains Chinese characters. Unfortunately, Chinese language detection, or language detection, is not easy. There are several options:

These are all practical, but it would be nice if there is a simple stupid solution. Actually .NET has an infamous enum System.Globalization.UnicodeCategory, it has 29 members:

  • UppercaseLetter
  • LowercaseLetter
  • OpenPunctuation
  • ClosePunctuation
  • MathSymbol
  • OtherLetter

And there are 2 APIs accepting a char and returning the char’s UnicodeCategory:

  • char.GetUnicodeCategory
  • CharUnicodeInfo.GetUnicodeCategory

So, generally, the following extension method detects if a string contains char in the specified UnicodeCategory:

public static bool Any(this string value, UnicodeCategory category) =>
    !string.IsNullOrWhiteSpace(value)
    && value.Any(@char => char.GetUnicodeCategory(@char) == category);

Chinese characters are categorized into OtherLetter, so the Chinese detection problem can becomes OtherLetter detection.

public static bool HasOtherLetter(this string value) => value.Any(UnicodeCategory.OtherLetter);

The detection is easy:

bool hasOtherLetter = text.HasOtherLetter();

It is not totally accurate for Chinese language, but it works very well to distinguish English string and Chinese string.

12 Comments

Add a Comment

As it will appear on the website

Not displayed

Your website