NAME Text::SpeedyFx - tokenize/hash large amount of strings efficiently VERSION version 0.008 SYNOPSIS use Data::Dumper; use Text::SpeedyFx; my $sfx = Text::SpeedyFx->new; my $words_bag = $sfx->hash('To be or not to be?'); print Dumper $words_bag; #$VAR1 = { # '1422534433' => '1', # '4120516737' => '2', # '1439817409' => '2', # '3087870273' => '1' # }; my $feature_vector = $sfx->hash_fv("thats the question", 8); print unpack('b*', $feature_vector); # 01001000 DESCRIPTION XS implementation of a very fast combined parser/hasher which works well on a variety of *bag-of-word* problems. Original implementation is in Java and was adapted for a better Unicode compliance. METHODS new([$seed, $bits]) Initialize parser/hasher, can be customized with the options: $seed Hash seed (default: 1). $bits How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses *Basic Multilingual*, *Supplementary Multilingual* and *Supplementary Ideographic* planes. See also "UNICODE SUPPORT" hash($octets) Parses $octets and returns a hash reference where keys are the hashed tokens and values are their respective count. $octets are assumed to represent UTF-8 string unless Text::SpeedyFx is instantiated with "$bits" == 8 (which forces Latin-1 mode, see "UNICODE SUPPORT"). Note that this is the slowest form due to the (computational) complexity of the Perl hash structure itself: "hash_fv()"/"hash_min()" variants are up to 1000% faster! hash_fv($octets, $n) Parses $octets and returns a feature vector (string of bits) with length $n. $n is supposed to be a multiplier of 8, as the length of the resulting feature vector is "ceil($n / 8)". See the included utilities cosine_cmp and uniq_wc. hash_min($octets) Parses $octets and returns the hash with the lowest value. Useful in MinHash implementation. See also the included minhash_cmp utility. UNICODE SUPPORT Due to the nature of Perl, Unicode support is handled differently from the original implementation. By default, Text::SpeedyFx recognizes UTF-8 encoded code points in the range *00000-2FFFF*: * Plane 0, the Basic Multilingual Plane (BMP, *0000–FFFF*) * Plane 1, the Supplementary Multilingual Plane (SMP, *10000–1FFFF*) * Plane 2, the Supplementary Ideographic Plane (SIP, *20000–2FFFF*) * There are planes up to 16; however, as in Perl v5.16.2, there are no code points matching "isALNUM_utf8()" there (so it's irrelevant for proper algorithm operation). Although, there is a major drawback: in this mode, each instance allocates up to 1 MB of memory. If the application doesn't need to support code points beyond the Plane 0 (like the original SpeedyFx implementation) it is possible to constraint the address space to 16 bits, which lowers memory allocation to up to 256 KB. In fact, Text::SpeedyFx constructor accepts bit range between 8 and 18 to address code points. LATIN-1 SUPPORT 8 bit address space has one special meaning: it completely disables multibyte support. In 8 bit mode, each instance will only allocate 256 bytes and hashing will run up to 300% faster! Tokenization will fallback to *ISO 8859-1 West European languages (Latin-1)* character definitions. BENCHMARK The test platform configuration: * Intel® Xeon® E5620 CPU @ 2.40GHz (similar to the one cited in the reference paper); * Debian 6.0.6 (Squeeze, 64-bit); * Perl v5.16.1 (installed via perlbrew); * enwik8 from the Large Text Compression Benchmark . Rate murmur_utf8 hash_utf8 hash hash_fv_utf8 hash_min hash_fv murmur_utf8 6 MB/s -- -53% -62% -87% -97% -97% hash_utf8 13 MB/s 111% -- -19% -73% -93% -93% hash 16 MB/s 161% 23% -- -67% -91% -92% hash_fv_utf8 49 MB/s 693% 275% 204% -- -73% -75% hash_min 179 MB/s 2815% 1278% 1018% 267% -- -7% hash_fv 193 MB/s 3045% 1387% 1106% 296% 8% -- All the tests except the ones with "_utf8" suffix were made in Latin-1 mode. For comparison, "murmur_utf8" was implemented using Digest::MurmurHash hasher and native regular expression tokenizer: ++$fv->{murmur_hash(lc $1)} while $data =~ /(\w+)/gx; See also the eg/benchmark.pl script. REFERENCES * Extremely Fast Text Feature Extraction for Classification and Indexing by George Forman and Evan Kirshenbaum * MinHash — выявляем похожие множества * Фильтр Блума * cosine_cmp, minhash_cmp and uniq_wc utilities from this distribution AUTHOR Stanislaw Pusep COPYRIGHT AND LICENSE This software is copyright (c) 2012 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.