Comparison

  • Cosine Similarity:This measures the cosine of the angle between two text vectors. A smaller angle (closer to 1) indicates higher similarity, while a larger angle (closer to 0) suggests dissimilarity. It is useful when text length varies significantly.
    • **Metric Calculation**: \[ \frac{A \cdot B}{\|A\| \times \|B\|} \] Here, \( A \cdot B \) represents the dot product of two vectors, and \( \|A\| \) and \( \|B\| \) are their magnitudes (Euclidean norms).

      • Jaccard Similarity:This compares two sets of words by dividing the number of shared words (intersection) by the total unique words (union). It works well for short text comparisons, such as document deduplication.
        • **Metric Calculation**: \[ \frac{|A \cap B|}{|A \cup B|} \] Where \( |A \cap B| \) is the count of common words, and \( |A \cup B| \) is the total number of unique words from both texts.

          • Levenshtein Distance:This measures how many edits (insertions, deletions, or substitutions) are required to convert one text into another. A lower distance indicates greater similarity. To normalize the score, it is scaled between 0 and 1.
            • **Metric Calculation**: \[ 1 - \frac{\text{Levenshtein Distance}}{\max(|A|, |B|)} \] The denominator ensures that the score accounts for different text lengths, making comparisons more meaningful.