Share on Facebook Share on Twitter Email
Answers.com

Cosine similarity

 
Wikipedia: Cosine similarity

Cosine similarity [1][2] is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

 \text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|}.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison.

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

This cosine similarity metric may be extended such that it yields the Jaccard coefficient in the case of binary attributes. This is the Tanimoto coefficient, T(AB), represented as

 T(A,B) = {A \cdot B \over \|A\|^2 +\|B\|^2 - A \cdot B}.

See also

References


Search unanswered questions...
Enter a question here...
Search: All sources Community Q&A Reference topics
 
 

 

Copyrights:

Wikipedia. This article is licensed under the Creative Commons Attribution/Share-Alike License. It uses material from the Wikipedia article "Cosine similarity" Read more