-
Star
(322)
You must be signed in to star a gist -
Fork
(94)
You must be signed in to fork a gist
-
-
Save terrancesnyder/1345094 to your computer and use it in GitHub Desktop.
| Regex for matching ALL Japanese common & uncommon Kanji (4e00 – 9fcf) ~ The Big Kahuna! | |
| ([一-龯]) | |
| Regex for matching Hirgana or Katakana | |
| ([ぁ-んァ-ン]) | |
| Regex for matching Non-Hirgana or Non-Katakana | |
| ([^ぁ-んァ-ン]) | |
| Regex for matching Hirgana or Katakana or basic punctuation (、。’) | |
| ([ぁ-んァ-ン\w]) | |
| Regex for matching Hirgana or Katakana and random other characters | |
| ([ぁ-んァ-ン!:/]) | |
| Regex for matching Hirgana | |
| ([ぁ-ん]) | |
| Regex for matching full-width Katakana (zenkaku 全角) | |
| ([ァ-ン]) | |
| Regex for matching half-width Katakana (hankaku 半角) | |
| ([ァ-ン゙゚]) | |
| Regex for matching full-width Numbers (zenkaku 全角) | |
| ([0-9]) | |
| Regex for matching full-width Letters (zenkaku 全角) | |
| ([A-z]) | |
| Regex for matching Hiragana codespace characters (includes non phonetic characters) | |
| ([ぁ-ゞ]) | |
| Regex for matching full-width (zenkaku) Katakana codespace characters (includes non phonetic characters) | |
| ([ァ-ヶ]) | |
| Regex for matching half-width (hankaku) Katakana codespace characters (this is an old character set so the order is inconsistent with the hiragana) | |
| ([ヲ-゚]) | |
| Regex for matching Japanese Post Codes | |
| /^¥d{3}¥-¥d{4}$/ | |
| /^¥d{3}-¥d{4}$|^¥d{3}-¥d{2}$|^¥d{3}$/ | |
| Regex for matching Japanese mobile phone numbers (keitai bangou) | |
| /^¥d{3}-¥d{4}-¥d{4}$|^¥d{11}$/ | |
| /^0¥d0-¥d{4}-¥d{4}$/ | |
| Regex for matching Japanese fixed line phone numbers | |
| /^[0-9-]{6,9}$|^[0-9-]{12}$/ | |
| /^¥d{1,4}-¥d{4}$|^¥d{2,5}-¥d{1,4}-¥d{4}$/ |
This doesn't cover all kanjis. Simple example: 𧓈
To be fair those kanjis are extremely rare and are not used (they would not show up in dictionnaires or rikaichan like extensions) and 99.99% Japanese would not know about them:
https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_Extension_B
Now you can match them with: [𠀀-𪛟]
and to match everything you would simply do: [𠀀-𪛟]|[一-龯]
@cb372, your list comes close to covering all the kana, but a few characters are still missing. You got 「ゞ」 but missed 「ゝ」 and 「ゟ」, and a few others. I believe this would cover all Hiragana and Katakana separately:
Hiragana = [ぁ-ゖ゛-ゟー]
Katakana = [゠-ヿ]
Combined Hiragana & Katakana would be:
Hiragana+Katakana = [ぁ-ゖ゛-ゟ゠-ヿ]
I used the above hiragana+katakana regex to validate the kana portions of the downloadable version of JMDICT and can confirm that apart from a few errors in the JMDICT data, the kana validation works.
There is a much easier way to do this:
/\p{Script=Han}|\p{Script=Katakana}|\p{Script=Hiragana}/usee https://www.regular-expressions.info/unicode.html #Unicode Scripts
not enough it miss some Katakana range
so far i'm using
// CJK Symbols and Punctuation - 3000-303F
// Hiragana - 3040-309F
// Katakana - 30A0-30FF
// CJK Unified Ideographs - 4E00-9FFF
// CJK Unified Ideographs Extension A - 3400-4DBF
// Halfwidth and Fullwidth Forms - FF00-FFEF
// CJK Radicals Supplement - 2E80-2EFF
// Kangxi Radicals - 2F00-2FDF
// CJK Compatibility Ideographs - F900-FAFF
I use it to tokenize some Japanese games for translation. xD good enough.. If a character was missed I just go to
https://apps.timwhitlock.info/unicode/inspect?s=%E3%83%BC
then read the range below xD
I'm working on Android and
\dmatches0(U+FF10), too.