NAME
    Set::Similarity - similarity measures for sets

SYNOPSIS
     use Set::Similarity::Dice;
 
     # object method
     my $dice = Set::Similarity::Dice->new;
     my $similarity = $dice->similarity('Photographer','Fotograf');
 
     # class method
     my $dice = 'Set::Similarity::Dice';
     my $similarity = $dice->similarity('Photographer','Fotograf');
 
     # from 2-grams
     my $width = 2;
     my $similarity = $dice->similarity('Photographer','Fotograf',$width);
 
     # from arrayref of tokens
     my $similarity = $dice->similarity(['a','b'],['b']);

     # from hashref of features 
     my $bird = {
       wings    => true,
       eyes     => true,
       feathers => true,
       hairs    => false,
       legs     => true,
       arms     => false,
     };
     my $mammal = {
       wings    => false,
       eyes     => true,
       feathers => false,
       hairs    => true,
       legs     => true,
       arms     => true, 
     };
     my $similarity = $dice->similarity($bird,$mammal);
 
     # from arrayref sets
     my $bird = [qw(
       wings
       eyes
       feathers
       legs
     )];
     my $mammal = [qw(
       eyes
       hairs
       legs
       arms
     )];
     my $similarity = $dice->from_sets($bird,$mammal);

DESCRIPTION
    This is the base class including mainly helper and convenience methods.

  Overlap coefficient
    ( A intersect B ) / min(A,B)

  Jaccard Index
    The Jaccard coefficient measures similarity between sample sets, and is
    defined as the size of the intersection divided by the size of the union
    of the sample sets

    ( A intersect B ) / (A union B)

    The Tanimoto coefficient is the ratio of the number of features common
    to both sets to the total number of features, i.e.

    ( A intersect B ) / ( A + B - ( A intersect B ) ) # the same as Jaccard

    The range is 0 to 1 inclusive.

  Dice coefficient
    The Dice coefficient is the number of features in common to both sets
    relative to the average size of the total number of features present,
    i.e.

    ( A intersect B ) / 0.5 ( A + B ) # the same as sorensen

    The weighting factor comes from the 0.5 in the denominator. The range is
    0 to 1.

METHODS
    All methods can be used as class or object methods.

  new
      $object = Set::Similarity->new();

  similarity
      my $similarity = $object->similarity($any1,$any1,$width);

    $any can be an arrayref, a hashref or a string. Strings are tokenized
    into n-grams of width $width.

    $width must be integer, or defaults to 1.

  from_tokens
      my $similarity = $object->from_tokens(['a','b'],['b']);

  from_sets
      my $similarity = $object->from_sets(['a'],['b']);

    Croaks if called directly. This method should be implemented in a child
    module.

  intersection
      my $intersection_size = $object->intersection(['a'],['b']);

  uniq
      my @uniq = $object->uniq(['a','b']);

    Transforms an arrayref of strings into an array of unique elements.

  combined_length
      my $set_size_sum = $object->combined_length(['a'],['b']);

  min
      my $min_set_size = $object->min(['a'],['b']);

  ngrams
      my @monograms = $object->ngrams('abc');
      my @bigrams = $object->ngrams('abc',2);

  _any
      my $arrayref = $object->_any($any,$width);

SEE ALSO
    Set::Similarity::Cosine

    Set::Similarity::Dice

    Set::Similarity::Jaccard

    Set::Similarity::Overlap

    Bag::Similarity doing the same for bags or multisets.

    Text::Levenshtein for distance measures of strings, and a very overview
    of similar modules,

    <http://en.wikipedia.org/wiki/String_metric> for an overview of
    similarity measures.

    Cluster::Similarity for clusters.

SOURCE REPOSITORY
    <http://github.com/wollmers/Set-Similarity>

AUTHOR
    Helmut Wollmersdorfer, <helmut.wollmersdorfer@gmail.com>

COPYRIGHT AND LICENSE
    Copyright (C) 2013-2014 by Helmut Wollmersdorfer

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.