Available Functions

Pre-Processing Functions

FunctionDescriptionExample InputProcessed Output
LowercaseConverts text to a consistent case format."Great Product! Works Perfectly.""great product! works perfectly."
Remove PunctuationEliminates special characters to retain only words."Great product! Works perfectly.""great product works perfectly"
Remove Stop WordsFilters out commonly used words that add little meaning."This is an amazing product with great value.""amazing product great value"
Remove Extra WhitespacesStandardizes spacing between words." The product is excellent! ""the product is excellent!"



Similarity Functions


Levenshtein Distance

This measures how many edits (insertions, deletions, or substitutions) are required to convert one text into another. A lower distance indicates greater similarity. To normalize the score, it is scaled between 0 and 1.

  • Metric Calculation: 1 - (Levenshtein Distance / max(|A|, |B|))

    The denominator ensures that the score accounts for different text lengths, making comparisons more meaningful.


Jaccard Similarity

This compares two sets of words by dividing the number of shared words (intersection) by the total unique words (union). It works well for short text comparisons, such as document deduplication.

  • Metric Calculation: |A ∩ B| / |A ∪ B|

    Where ∣A∩B∣ is the count of common words, and ∣A∪B∣ is the total number of unique words from both texts.


Cosine Similarity

This measures the cosine of the angle between two text vectors. A smaller angle (closer to 1) indicates higher similarity, while a larger angle (closer to 0) suggests dissimilarity. It is useful when text length varies significantly.

  • Metric Calculation: (A ⋅ B) / (||A|| × ||B||)

    Here, A⋅B represents the dot product of two vectors, and ∥A∥ and ∥B∥ are their magnitudes (Euclidean norms).