Skip to main content
Profiling runs automatically when a dataset is scanned. It combines statistical metrics with AI-powered semantic analysis to characterize your data, identify sensitive columns, and recommend masking strategies.

Statistical profiling

Statistical profiling collects quantitative metrics about your dataset:
MetricDescription
Row countTotal number of rows in the dataset, tracked over time to detect unexpected changes
Column countNumber of columns and their data types
Null countsPer-column null value counts used to power completeness validations
Distinct valuesCardinality metrics used to power uniqueness validations
DistributionsValue distribution metrics surfaced in the health score
These metrics feed directly into the dataset’s health score and row trend charts visible in the Assets overview tab.

Semantic profiling

Semantic profiling uses AI to analyze a sample of values from each column and determine what kind of data it contains — beyond just the raw data type. For each column, the profiler produces:
FieldDescription
Data typeThe storage type of the column (string, numeric, date, etc.)
Semantic type (general)Broad category — for example, identifier, financial, contact, location
Semantic type (specific)Precise type — for example, email, phone, aadhar, pan, api_key, address
SensitivityHow sensitive the column is — used to flag columns for masking
ConfidenceHow confident the model is in its classification
Semantic profiling examines up to five sample values per column and is powered by an LLM prompt that evaluates naming conventions, value patterns, and domain context together.

Sensitive column identification

Columns with a high sensitivity rating are automatically flagged as potentially containing PII or confidential data. These columns appear visually marked in the dataset view and are candidates for masking. Sensitivity is assessed based on the semantic type detected — for example, columns identified as email, phone, aadhar, pan, bank_account, or api_key are treated as sensitive.

Masking strategies

Once a sensitive column is identified, the profiler recommends a masking strategy. The following strategies are supported:
StrategyDescriptionExample
keep_last_nMask all but the last N characters****9012
keep_prefixKeep the first N characters, mask the restAB*******
mask_username_in_emailMask the email username, preserve the domainr***@example.com
numeric_random_same_lengthReplace digits with random digits, preserve separators+91-8234167890
date_mask_partsMask year, month, or day portions of a date****-08-15
hash_sha256Replace with a SHA-256 hash (optionally salted)3a5b9f2c...
tokenize_uuidReplace with a random UUIDf47ac10b-...
noneNo masking applied
Masking is applied in the UI so that sensitive values are not displayed to users in shared workspaces. The underlying data in your database is not modified.

When profiling runs

Profiling runs automatically when:
  • A dataset is first discovered after connecting a datasource
  • You manually trigger a Rescan from the Assets view
You can view the profiling results for any column from the dataset detail view.