NAME Lingua::JA::TermExtractor - Term Extractor SYNOPSIS use Lingua::JA::TermExtractor; use utf8; use feature qw/say/; use Data::Printer; my $extractor = Lingua::JA::TermExtractor->new( df_file => './df.tch', # Please download from http://misc.pawafuru.com/webidf/. pos1_filter => [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能 サ変接続/], ng_word => [qw/編集 本人 自身 自分 たち さん/], ); p $extractor->extract($document)->dump; p $extractor->extract(\@documents)->dump; for my $result (@{ $extractor->extract(\@documents)->list(50) }) { my ($word, $score) = each %{$result}; say "$word: $score"; } DESCRIPTION Lingua::JA::TermExtractor is a term extractor. This extracts terms from one or more documents and sorts them based on their TF*WebIDF or BM25 scores. METHODS new( %config || \%config ) Creates a new Lingua::JA::TermExtractor instance. The following configuration is used if you don't set %config. KEY DEFAULT VALUE ----------- --------------- k1 2.0 b 0.75 pos1_filter [qw/非自立 代名詞 ナイ形容詞語幹 副詞可能/] pos2_filter [] pos3_filter [] ng_word [] term_length_min 2 term_length_max 30 concat_max 30 tf_min 1 df_min 0 df_max 250_0000_0000 fetch_unk_word_df 0 db_auto 1 guess_df 1 idf_type 3 api 'YahooPremium' appid undef driver 'TokyoCabinet' df_file './df.tch' fetch_df 0 expires_in 365 documents 250_0000_0000 Furl_HTTP undef verbose 0 k1 => $weight The weight of term frequency(TF). b => $weight The weight of document length normalization. pos(1|2|3)_filter, ng_word, term_length_(min|max), concat_max, tf_min, df_(min|max), fetch_unk_word_df, db_auto, guess_df See Lingua::JA::TFWebIDF. idf_type, api, appid, driver, df_file, fetch_df, expires_in, documents, Furl_HTTP, verbose See Lingua::JA::WebIDF. extract( $document || \@documents ) Extracts terms from $document or \@documents and sorts them based on their TF*WebIDF or BM25 scores. If $document, TF*WebIDF is used. If \@documents, BM25 is used. Word segmentation and POS tagging are done via MeCab. tfidf, tf See Lingua::JA::TFWebIDF. idf, df, purge, db_open, db_close See Lingua::JA::WebIDF. AUTHOR pawa SEE ALSO Lingua::JA::TFWebIDF Lingua::JA::WebIDF Lingua::JA::WebIDF::Driver::TokyoTyrant LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.