Available Functions
Pre-Processing Functions
Function | Description | Example Input | Processed Output |
---|---|---|---|
Lowercase | Converts text to a consistent case format. | "Great Product! Works Perfectly." | "great product! works perfectly." |
Remove Punctuation | Eliminates special characters to retain only words. | "Great product! Works perfectly." | "great product works perfectly" |
Remove Stop Words | Filters out commonly used words that add little meaning. | "This is an amazing product with great value." | "amazing product great value" |
Remove Extra Whitespaces | Standardizes spacing between words. | " The product is excellent! " | "the product is excellent!" |
Similarity Functions
Levenshtein Distance
This measures how many edits (insertions, deletions, or substitutions) are required to convert one text into another. A lower distance indicates greater similarity. To normalize the score, it is scaled between 0 and 1.
-
Metric Calculation: 1 - (Levenshtein Distance / max(|A|, |B|))
The denominator ensures that the score accounts for different text lengths, making comparisons more meaningful.
Jaccard Similarity
This compares two sets of words by dividing the number of shared words (intersection) by the total unique words (union). It works well for short text comparisons, such as document deduplication.
-
Metric Calculation: |A ∩ B| / |A ∪ B|
Where ∣A∩B∣ is the count of common words, and ∣A∪B∣ is the total number of unique words from both texts.
Cosine Similarity
This measures the cosine of the angle between two text vectors. A smaller angle (closer to 1) indicates higher similarity, while a larger angle (closer to 0) suggests dissimilarity. It is useful when text length varies significantly.
-
Metric Calculation: (A ⋅ B) / (||A|| × ||B||)
Here, A⋅B represents the dot product of two vectors, and ∥A∥ and ∥B∥ are their magnitudes (Euclidean norms).
Updated 8 days ago