NAME AI::Categorize - Automatically categorize documents based on content SYNOPSIS ### This is one of the categorizers available (see below for more) use AI::Categorize::NaiveBayes; my $c = new AI::Categorize::NaiveBayes(); ### Supply some training documents so it can learn how to categorize $c->stopwords('the','a','and','but','I'); # Ignore these words $c->add_document($name, \@categories, $content); ... repeat for many documents, then: $c->crunch(); $c->save_state('filename'); # Save machine for later use ### Categorize a new unknown document my $c = new AI::Categorize::NaiveBayes(); $c->restore_state('filename'); my $results = $c->categorize($content); if ($results->in_category('sports')) { ... } my @cats = $results->categories; my @scores = $results->scores(@cats); DESCRIPTION This module implements several algorithms for automatically guessing category information of documents based on the category information of existing documents. For example, one might categorize incoming email messages in order to place them into existing mailboxes, or one might categorize newspaper articles by general topic (business, sports, etc.). All of the categorizers learn their categorization rules from a body of existing pre-categorized documents. Disclaimer: the results of any of these algorithms are far from infallible (close to fallible?). Categorization of documents is often a difficult task even for humans well-trained in the particular domain of knowledge, and there are many things a human would consider that none of these algorithms consider. These are only statistical tests - at best they are neat tricks or helpful assistants, and at worst they are totally unreliable. If you plan to use this module for anything important, human supervision is essential. But this voodoo can be quite fun. =) ALGORITHMS Currently two different algorithms are implemented in this bundle: AI::Categorize::NaiveBayes AI::Categorize::kNN These are all subclasses of `AI::Categorize'. Please see the documentation of these individual modules for more details on their guts and quirks. The common interface for all the algorithms is described here. All these classes are designed to be subclassible so you can modify their behavior to suit your needs. AI::Categorize Methods * new() Creates a new categorizer object (hereafter referred to as `$c'). The arguments to `new()' will depend on which subclass of `AI::Categorize' you happen to be using. See the subclasses' individual documentation for more info. * $c->stopwords() * $c->stopwords(@words) Gets (and optionally sets) the list of stopwords. Stopwords are words that should be ignored by the categorizer, and typically they are the most common non-informative words in the documents. The most common reason to use stopwords is to reduce processing time. The stoplist should be set before processing any documents. * $c->stopwords_hash() Returns the stopwords as the keys of a hash reference. The corresponding values are all 1. Can be useful for quick checking of whether a word is a stopword. * $c->add_stopword($word) Adds a single entry to the stopword list. * $c->add_document($name, $categories, $content) Adds a new training document to the database. `$name' should be a unique string identifying this document. `$categories' may be either the name of a single category to which this document belongs, or a reference to an array containing the names of several categories. `$content' is the text content of the document. To ease syntax, in the future `$content' may be allowed to be given as a path to the document, which will be opened and parsed. * $c->crunch() After all documents have been added, call `crunch()' so that the categorizer can compute some statistics on the training data and get ready to categorize new documents. * $c->categorize($content) Processes the text in `$content' and returns an object blessed into the `AI::Categorize::Result' class (hereafter abbreviated as `$r'). To ease memory requirements, in the future `$content' may be allowed to be passed as a filehandle. * $c->cat_map() Returns the 'category map' object, which is an object of the class `AI::Categorize::Map'. See the documentation for this class below. * $c->save_state($filename) At any time you may save the state of the categorizer to a file, so that you can reload it later using the `restore_state()' method. * $c->restore_state($filename) Reads in the categorizer data from $filename, which should have previously been saved using the `save_state()' method. * $c->error(\@assigned_categories, \@correct_categories) Returns the fraction of the binary (yes/no) categorization decisions that were incorrect. For instance, if there are 4 entries in `@assigned_categories', 5 entries in `@correct_categories', 2 of those catagories overlap, and there are a total of 20 categories to choose from, then the error is `(2+3)/20 = 5/20 = 0.25'. In this case, 2 errors of false assignment were made, and 3 errors of omission were made. The accuracy and error will always have a sum of 1. * $c->accuracy(\@assigned_categories, \@correct_categories) Returns the fraction of the binary (yes/no) categorization decisions that were correct. For instance, if there are 4 entries in `@assigned_categories', 5 entries in `@correct_categories', 2 of those catagories overlap, and there are a total of 20 categories to choose from, then the accuracy is `(2+13)/20 = 15/20 = 0.75'. In this case, 2 correct assignments were made, and 13 categories were correctly omitted. The accuracy and error will always have a sum of 1. * $c->F1(\@assigned_categories, \@correct_categories) This method computes the F1 measure, which is helpful for evaluating how well the categorizer did when it assigned categories. The F1 measure is defined to be 2 times the number of correctly assigned categories divided by the sum of the number of assigned categories and correct categories. In other words, if A is the set of categories that were assigned by the system, C is the set of categories that should have been assigned by the system, and I is the intersection of A and C, then 2*I F1 = ------- A + C (Other sources may define F1 as `2*recall*precision/(recall+precision)', which is equivalent to the above formula but forces division by zero if either A or C is empty.) A perfect job categorizing (all correct categories were assigned and no extras were assigned) will have an F1 score of 1. A terrible job categorizing (no overlap between correct & assigned categories) will have an F1 score of 0. Medium jobs will be somewhere in between. * $r->extract_words($text) Returns a reference to a hash whose keys are the words contained in `$text' and whose values are the number of times each word appears. Stopwords are omitted and words are put into canonical form (lower-cased, leading & trailing non-word characters stripped). Don't call this method directly, as it is used internally by the various categorization modules. However, you may be interested in subclassing one of the modules and overriding `extract_words()' to behave differently. For instance, you may want to "lemmatize" your words to remove affixes so that "abominable", "abominableness", "abominably", "abominate", "abomination", and "abominator" all share a single entry in the categorizer. `AI::Categorize::Result' Methods An `AI::Categorize::Result' object is returned by the `$c->categorize' method, described above. * $r->in_category($category) Returns true or false depending on whether the document was placed in the given category. * $r->categories() Returns an ordered list of the categories the document was placed in, with best matches first. * $r->scores(@categories) Returns a list of result scores for the given categories. Since the interface is still changing, not very much can officially be said about the scores, except that a good score is higher than a bad score. This may change to something like a probability scale, with all numbers between 0 and 1, and a threshold for membership somewhere in between. Please consider the scoring feature somewhat unstable for now. `AI::Categorize::Map' Methods The `AI::Categorize::Map' class manages the relationships between documents and categories. It is designed to support fast lookups either by category or by document. An `AI::Categorize::Map' object is returned by the `$c->cat_map' method (described above) and will be called `$m' for convenience in this documentation. In general, feel free to make any queries of the map, but don't make any changes to its data. Changes should usually only be made through the categorizer object. * $m->add_document($doc_name, \@cat_names) Adds the given document to the map with membership in the given categories. The `AI::Categorize' `add_document()' method calls this internally. * $m->categories() Returns a list of all known categories in a list context, or the number of known categories in a scalar context. * $m->documents() Returns a list of all known categories in a list context, or the number of known categories in a scalar context. * $m->categories_of($doc_name) Returns a list of categories that `$doc_name' belongs to in a list context, or the number of such categories in a scalar context. * $m->documents_of($cat_name) Returns a list of documents that `$cat_name' contains in a list context, or the number of such documents in a scalar context. * $m->is_in_category($doc_name, $cat_name) Returns true if `$doc_name' belongs to the `$cat_name' category, or false otherwise. * $m->contains_document($cat_name, $doc_name) Returns true if `$cat_name' contains the document `$doc_name', or false otherwise. Note that this is just a synonym for the `is_in_category()' method with the names turned around. CAVEATS Don't depend on the specific scores given by `$r->scores'. They may change in future releases. The entire categorizer is currently created in memory, which can get pretty demanding if you have a lot of data. If this turns out to be a problem, future versions may try to cache large chunks on disk. This would come with a speed penalty. Finally, I am not an expert in document categorization. I have thought about it some, and I have written these modules largely as a way to concretize my thinking and learn more about the processes. If you know of ways to improve accuracy, please let me know. TO DO Idea from obvy: try tying into infobot (purl) to identify IRC moods: @moods = qw(indifferent flame_mode pissed inebriated happy sad) AUTHOR Ken Williams, ken@forum.swarthmore.edu COPYRIGHT Copyright 2000-2001 Ken Williams. All rights reserved. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. SEE ALSO perl(1), DBI(3). "A re-examination of text categorization methods" by Yiming Yang the section on "http://www.cs.cmu.edu/~yiming/publications.html" Other links from Na'im Tyson: www.ruf.rice.edu/~barlow/corpus.html (corp. lx.) ciir.cs.umass.edu (info. ret) www.georgetown.edu/wilson/IR/IR.html (class in IR @ Georgetown University) www.research.att.com/~lewis (professional homepage of David Lews, one of the leaders in document categorization. you may want to visit his site sooner than the others since he has left AT&T research.)