Statistical profiling
Statistical profiling collects quantitative metrics about your dataset:| Metric | Description |
|---|---|
| Row count | Total number of rows in the dataset, tracked over time to detect unexpected changes |
| Column count | Number of columns and their data types |
| Null counts | Per-column null value counts used to power completeness validations |
| Distinct values | Cardinality metrics used to power uniqueness validations |
| Distributions | Value distribution metrics surfaced in the health score |
Semantic profiling
Semantic profiling uses AI to analyze a sample of values from each column and determine what kind of data it contains — beyond just the raw data type. For each column, the profiler produces:| Field | Description |
|---|---|
| Data type | The storage type of the column (string, numeric, date, etc.) |
| Semantic type (general) | Broad category — for example, identifier, financial, contact, location |
| Semantic type (specific) | Precise type — for example, email, phone, aadhar, pan, api_key, address |
| Sensitivity | How sensitive the column is — used to flag columns for masking |
| Confidence | How confident the model is in its classification |
Sensitive column identification
Columns with a high sensitivity rating are automatically flagged as potentially containing PII or confidential data. These columns appear visually marked in the dataset view and are candidates for masking. Sensitivity is assessed based on the semantic type detected — for example, columns identified asemail, phone, aadhar, pan, bank_account, or api_key are treated as sensitive.
Masking strategies
Once a sensitive column is identified, the profiler recommends a masking strategy. The following strategies are supported:| Strategy | Description | Example |
|---|---|---|
keep_last_n | Mask all but the last N characters | ****9012 |
keep_prefix | Keep the first N characters, mask the rest | AB******* |
mask_username_in_email | Mask the email username, preserve the domain | r***@example.com |
numeric_random_same_length | Replace digits with random digits, preserve separators | +91-8234167890 |
date_mask_parts | Mask year, month, or day portions of a date | ****-08-15 |
hash_sha256 | Replace with a SHA-256 hash (optionally salted) | 3a5b9f2c... |
tokenize_uuid | Replace with a random UUID | f47ac10b-... |
none | No masking applied | — |
When profiling runs
Profiling runs automatically when:- A dataset is first discovered after connecting a datasource
- You manually trigger a Rescan from the Assets view