UMLS::Similarity SYNOPSIS This package consists of Perl modules along with supporting Perl programs that implement the semantic similarity and relatedness measures described by Leacock & Chodorow (1998), Wu & Palmer (1994), Nguyen and Al-Mubaid (2006), Rada, et. al. 1989, Patwardhan (2003), Jiang & Conrath (1997), Resnik (1995), Lin (1998), Patwardhan and Pedersen (2006) and a simple path based measure. UMLS::Similarity requires the UMLS::Interface module to access the Unified Medical Language System (UMLS) in order to determine the similarity between two UMLS concepts. The Perl modules are designed as objects with methods that take as input two concepts from the UMLS. The semantic relatedness of these concepts is returned by these methods. A quantitative measure of the degree to which the two concepts are related has wide ranging applications in numerous areas, such as word sense disambiguation, information retrieval, etc. For example, in order to determine which sense of a given word is being used in a particular context, the sense having the highest relatedness with its context word senses is most likely to be the sense being used. Similarly, in information retrieval, retrieving documents containing highly related concepts are more likely to have higher precision and recall values. The following sections describe the organization of this software package and how to use it. A few typical examples are given to help clearly understand the usage of the modules and the supporting utilities. SEMANTIC RELATEDNESS We observe that humans find it extremely easy to say if two words are related and if one word is more related to a given word than another. For example, if we come across two words -- 'car' and 'bicycle', we know they are related as both are means of transport. Also, we easily observe that 'bicycle' is more related to 'car' than 'fork' is. But is there some way to assign a quantitative value to this relatedness? Some ideas have been put forth by researchers to quantify the concept of relatedness of words, with encouraging results. A number of different measures of relatedness have been implemented in this software package. These include a simple edge counting approach. The measures require the UMLS-Interface that define UMLS concepts, and some basic relationships between these concepts. CONTENTS All the modules that will be installed in the Perl system directory are present in the '/lib' directory tree of the package. These include the semantic relatedness modules -- UMLS/Similarity/lch.pm UMLS/Similarity/path.pm UMLS/Similarity/wup.pm UMLS/Similarity/nam.pm UMLS/Similarity/cdist.pm UMLS/Similarity/res.pm UMLS/Similarity/lin.pm UMLS/Similarity/jcn.pm UMLS/Similarity/random.pm UMLS/Similarity/vector.pm (beta) -- present in the lib/ subdirectory. All these modules, once installed in the Perl system directory, can be directly used by Perl programs. The package contains a utils/ directory that contain Perl utility programs. These utilities use the modules or provide some supporting functionality. umls-similarity.pl -- returns the semantic similarity of two terms or UMLS CUIs given a specified measure (and view of the UMLS). INSTALL To install these modules run: perl Makefile.PL make make test make install This will install the modules in the standard locations. You will, most probably, require root privileges to install in standard system directories. To install in a non-standard directory, specify a prefix during the 'perl Makefile.PL' stage as: perl Makefile.PL PREFIX=/home It is possible to modify other parameters during installation. The details of these can be found in the ExtUtils::MakeMaker documentation. However, it is highly recommended not messing around with other parameters, unless you know what you're doing. SOFTWARE COPYRIGHT AND LICENSE Copyright (C) 2004-2009 Bridget T McInnes, Siddharth Patwardhan, Serguei Pakhomov and Ted Pedersen This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file 'GPL.txt' that you should have received with this distribution. REFERENCES 1 Wu Z. and Palmer M. 1994. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, New Mexico. 2 Resnik P. 1995. Using information content to evaluate semantic similarity. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448-453, Montreal. 3 Jiang J. and Conrath D. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics, Taiwan. 4 Fellbaum C., editor. WordNet: An electronic lexical database. MIT Press, 1998. 5 Leacock C. and Chodorow M. 1998. Combining local context and WordNet similarity for word sense identification. In Fellbaum 1998, pp. 265-283. 6 Lin D. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI. 7 Hirst G. and St-Onge D. 1998. Lexical Chains as representations of context for the detection and correction of malapropisms. In Fellbaum 1998, pp. 305-332. 8 Schütze H. 1998. Automatic Word Sense Discrimination. Computational Linguistics, 24(1):97-123. 9 Resnik P. 1999. Semantic Similarity in a Taxonomy: An Information- Based Measure and its Applications to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11, 95-130. 10 Budanitsky A. and Hirst G. 2001. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics. Pittsburgh, PA. 11 Banerjee S. and Pedersen T. 2002. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceeding of the Fourth International Conference on Computational Linguistics and Intelligent Text Processing (CICLING-02). Mexico City. 12 Patwardhan S., Banerjee S. and Pedersen T. 2002. Using Semantic Relatedness for Word Sense Disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City. 13 Banerjee S. Adapting the Lesk algorithm for word sense disambiguation to WordNet. Master Thesis, University of Minnesota, Duluth, 2002. 14 Patwardhan S. Incorporating dictionary and corpus information into a vector measure of semantic relatedness. Master Thesis, University of Minnesota, Duluth, 2003. 15 Patwardhan, S. and Pedersen T. Using WordNet Based Context Vectors to Estimate the Semantic Relatedness of Concepts. In Proceedings of the EACL 2006 Workshop Making Sense of Sense - Bringing Computational Linguistics and Psycholinguistics Together, pp. 1-8, April 4, 2006, Trento, Italy. 16 Rada, R., Mili, H., Bicknell, E. and Blettner, M. Development and application of a metric on semantic nets. In Proceedings of the IEEE Transactions on Systems, Man, and Cybernetics, volume 19, pages 17-30, 1989. 17 Nguyen, H.A. and Al-Mubaid, H. New ontology based semantic similarity mesaure for the biomedical domain. In Proceedings of the IEEE International Conference on Granular Computing, pages 623-628, 2006. SEE ALSO CONTACT US If you have any trouble installing and using UMLS-Interface, please contact us via the users mailing list : umls-similarity@yahoogroups.com You can join this group by going to: You may also contact us directly if you prefer : Bridget T. McInnes: bthomson at cs.umn.edu Ted Pedersen : tpederse at d.umn.edu AUTHORS Bridget T McInnes, University of Minnesota Twin Cities bthomson at cs.umn.edu Siddharth Patwardhan, University of Utah sidd at cs.utah.edu Serguei Pakhomov, University of Minnesota Twin Cities pakh002 at umn.edu Ted Pedersen, University of Minnesota Duluth tpederse at d.umn.edu Ying Liu, University of Minnesota liux0395 at umn.edu DOCUMENTATION COPYRIGHT AND LICENSE Copyright (C) 2003-2009 Bridget T. McInnes, Siddharth Patwardhan, Serguei Pakhomov and Ted Pedersen. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. Note: a copy of the GNU Free Documentation License is available on the web at: and is included in this distribution as FDL.txt.