NAME Lingua::JA::WebIDF - WebIDF calculator SYNOPSIS use Lingua::JA::WebIDF; my $webidf = Lingua::JA::WebIDF->new(%config); print $webidf->idf("東京"); # low print $webidf->idf("スリジャヤワルダナプラコッテ"); # high DESCRIPTION Lingua::JA::WebIDF calculates WebIDF weight. WebIDF(Inverse Document Frequency) weight represents the rarity of a word on the Web. The WebIDF weight of a rare word is high. Conversely, the WebIDF weight of a common word is low. IDF is based on the intuition that a query term which occurs in many documents is not a good discriminator and should be given less weight than one which occurs in few documents. METHODS new( %config || \%config ) Creates a new Lingua::JA::WebIDF instance. The following configuration is used if you don't set %config. KEY DEFAULT VALUE ----------- --------------- idf_type 1 api 'YahooPremium' appid undef driver 'TokyoCabinet' df_file './df.tch' fetch_df 0 expires_in 365 documents 250_0000_0000 Furl_HTTP undef verbose 1 idf_type => 1 || 2 || 3 The type1 is the most commonly cited form of IDF. N idf(t_i) = log ----- (1) n_i N : the number of documents n_i: the number of documents which contain term t_i t_i: term The type2 is a simple version of the RSJ weight. N - n_i + 0.5 idf(t_i) = log ---------------- (2) n_i + 0.5 The type3 is a modification of (2). N + 0.5 idf(t_i) = log ----------- (3) n_i + 0.5 api => 'Yahoo' || 'YahooPremium' Uses the specified Web API when fetches WebDF(Document Frequency). driver => 'Storable' || 'TokyoCabinet' Fetches and saves WebDF with the specified driver. df_file => $path Saves WebDF to the specified path. In order to reduce access to Web API, please download a big df file from . I recommend that you change the file depending on the type of Web API you specifies because WebDF may be different depending on it. fech_df => 0 Never fetches WebDF from the Web if 0 is specified. If the WebDF you want to know has already saved, it is used. If it is not so, returns undef. expires_in => $days If 365 is specified, WebDF expires in 365 days after fetches it. Furl_HTTP => \%option Sets the options of Furl::HTTP->new. If you want to use proxy server, you have to use this option. verbose => 1 || 0 If 1 is specified, shows verbose error messages. idf($word) Calculates the WebIDF weight of $word via df($word) method. df($word) Fetches the WebDF of $word. If the WebDF of $word has not been saved yet or has expired, fetches it by using the Web API you specified and saves it. If the WebDF of $word has expired and fetch_df is 0, the expired WebDF is used. db_open($mode) Opens the database file which is located in $path. If you use TokyoCabinet, you have to open the database file via this method before idf|df|db_close|purge method is called. $mode is 'read' or 'write'. db_close Closes the database file which is located in $path. This method is called automatically when the object is destroyed, so you might not need to use this method explicitly. purge($expires_in) Purges old data in df_file. If 365 is specified, the data which 365 days elapsed are purged. AUTHOR pawa SEE ALSO Lingua::JA::TFWebIDF Lingua::JA::WebIDF::Driver::TokyoTyrant Yahoo API: Tokyo Cabinet: S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503-520, 2004. LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.