NAME Lingua::JA::NormalizeText - text normalizer SYNOPSIS use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); print $normalizer->normalize('鳥が㌧㌦でありんす♥'); # -> 鳥がトンドルです♥ sub dearinsu_to_desu { my $text = shift; $text =~ s/でありんす/です/g; return $text; } # or use Lingua::JA::NormalizeText qw/nfkc decode_entities/; use utf8; my $text = '鳥が㌧㌦でありんす♥'; print dearinsu_to_desu( decode_entities( nfkc($text) ) ); # -> 鳥がトンドルです♥ sub dearinsu_to_desu { my $text = shift; $text =~ s/でありんす/です/g; return $text; } DESCRIPTION Lingua::JA::NormalizeText normalizes text. METHODS new(@options) Creates a new Lingua::JA::NormalizeText instance. The following options are available. OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- ------------------ ----------------------- lc DdD ddd uc DdD DDD nfkc ㌦ ドル (length: 2) nfkd ㌦ ドル (length: 3) nfc nfd decode_entities ♥ ♥ strip_html あ alnum_z2h ABC123 ABC123 alnum_h2z ABC123 ABC123 space_z2h space_h2z katakana_z2h ハァハァ ハァハァ katakana_h2z スーハースーハー スーハースーハー katakana2hiragana パンツ ぱんつ hiragana2katakana ぱんつ パンツ unify_3dots はぁ。。。 はぁ… wave2tilde 〜 ~ tilde2wave ~ 〜 wavetilde2long 〜, ~ ー wave2long 〜 ー tilde2long ~ ー fullminus2long − ー dashes2long — ー drawing_lines2long ─ ー unify_long_repeats ヴァーーー ヴァー nl2space (new line) (space) unify_long_spaces (space)(space) (space) remove_head_space (space)あ(space)あ あ(space)あ remove_tail_space ああ(space)(space) ああ modernize_kana_usage ゐヰゑヱ いイえエ The order these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied finally.) External functions are also addable. (See dearinsu_to_desu function of SYNOPSIS section) normalize($text) normalizes $text. AUTHOR pawa SEE ALSO LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.