Cosine Similarity is a metric used to measure how similar two vectors (or text documents) are, by computing the cosine of the angle between them. It is widely used in text mining and natural language processing (NLP) to compare documents or sentences, based on the assumption that similar documents will have similar word distributions.
Mathematically, Cosine Similarity is defined as:
Cosine Similarity=A⋅B
Steps for Calculating Cosine Similarity:
- Tokenization: Break down the text into words.
- Vectorization: Convert the text into a vector representation (e.g., TF-IDF, Bag of Words).
- Compute the Dot Product: Find the dot product of the two vectors.
- Find the Magnitudes: Calculate the magnitude (norm) of each vector.
- Compute Cosine Similarity: Apply the formula above.
Example:
Let’s say you have two vectors A=[1,2,3]
- Dot product: 1×4+2×5+3×6=32
- Magnitude of AÂ 12+22+32=14
- Magnitude of BÂ 42+52+62=77
- Cosine Similarity: 3214×77≈0.974
A cosine similarity value closer to 1 indicates that the vectors are very similar, while a value closer to 0 indicates they are dissimilar.
Use Cases:
- Document similarity: Comparing two text documents.
- Recommendation systems: Finding similar items.
- Clustering: Grouping similar text documents.
GeeksforGeeks typically provides code examples in languages like Python to help understand the concept better. If you want, I can show you how to implement Cosine Similarity in Python!