Available Comparison Functions

Pre-Processing Functions

Function	Description	Example Input	Processed Output
Lowercase	Converts text to a consistent case format.	"Great Product! Works Perfectly."	"great product! works perfectly."
Remove Punctuation	Eliminates special characters to retain only words.	"Great product! Works perfectly."	"great product works perfectly"
Remove Stop Words	Filters out commonly used words that add little meaning.	"This is an amazing product with great value."	"amazing product great value"
Remove Extra Whitespaces	Standardizes spacing between words.	" The product is excellent! "	"the product is excellent!"

Similarity Functions

Levenshtein Distance

This measures how many edits (insertions, deletions, or substitutions) are required to convert one text into another. A lower distance indicates greater similarity. To normalize the score, it is scaled between 0 and 1.

Metric Calculation: 1 - (Levenshtein Distance / max(|A|, |B|))

The denominator ensures that the score accounts for different text lengths, making comparisons more meaningful.

Jaccard Similarity

This compares two sets of words by dividing the number of shared words (intersection) by the total unique words (union). It works well for short text comparisons, such as document deduplication.

Metric Calculation: |A ∩ B| / |A ∪ B|

Where ∣A∩B∣ is the count of common words, and ∣A∪B∣ is the total number of unique words from both texts.

Cosine Similarity

This measures the cosine of the angle between two text vectors. A smaller angle (closer to 1) indicates higher similarity, while a larger angle (closer to 0) suggests dissimilarity. It is useful when text length varies significantly.

Metric Calculation: (A ⋅ B) / (||A|| × ||B||)

Here, A⋅B represents the dot product of two vectors, and ∥A∥ and ∥B∥ are their magnitudes (Euclidean norms).