NAME Text::SpeedyFx - tokenize/hash large amount of strings efficiently VERSION version 0.004 SYNOPSIS use Data::Dumper; use Text::SpeedyFx; my $sfx = Text::SpeedyFx->new; my $words_bag = $sfx->hash('To be or not to be?'); print Dumper $words_bag; #$VAR1 = { # '1422534433' => '1', # '4120516737' => '2', # '1439817409' => '2', # '3087870273' => '1' # }; my $feature_vector = $sfx->hash_fv("thats the question", 5); print Dumper $feature_vector; #$VAR1 = [ # '0', # '1', # '0', # '1', # '0' # ]; DESCRIPTION XS implementation of a very fast combined parser/hasher which works well on a variety of *bag-of-word* problems. Original implementation is in Java and was adapted for a better Unicode compliance. METHODS new([$seed]) Initialize parser/hasher, optionally using a specified $seed (default: 1). hash($string) Parses $string and returns a hash reference where keys are hashed tokens and values are respective count. hash_fv($string, $n) Parses $string and returns a feature vector with $n elements. hash_min($string) Parses $string and returns the hash with the lowest value. REFERENCES * Extremely Fast Text Feature Extraction for Classification and Indexing by George Forman and Evan Kirshenbaum AUTHOR Stanislaw Pusep COPYRIGHT AND LICENSE This software is copyright (c) 2012 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.