SkillAgentSearch skills...

DataProfiler

What's in your data? Extract schema, statistics and entities from datasets

Install / Use

/learn @capitalone/DataProfiler

README

PyPI - Python Version GitHub GitHub last commit Downloads

<p text-align="left"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/capitalone/DataProfiler/raw/gh-pages/docs/source/_static/images/DataProfilerDarkLogoLong.png"> <source media="(prefers-color-scheme: light)" srcset="https://github.com/capitalone/DataProfiler/raw/gh-pages/docs/source/_static/images/DataProfilerLogoLightThemeLong.png"> <img alt="Shows a black logo in light color mode and a white one in dark color mode." src="https://user-images.githubusercontent.com/25423296/163456779-a8556205-d0a5-45e2-ac17-42d089e3c3f8.png"> </picture> </p>

Data Profiler | What's in your data?

The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.

Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI) and more. Data Profiles can then be used in downstream applications or reports.

Getting started only takes a few lines of code (example csv):

import json
from dataprofiler import Data, Profiler

data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc

readable_report = profile.report(report_options={"output_format": "compact"})

print(json.dumps(readable_report, indent=4))

Note: The Data Profiler comes with a pre-trained deep learning model, used to efficiently identify sensitive data (PII / NPI). If desired, it's easy to add new entities to the existing pre-trained model or insert an entire new pipeline for entity recognition.

For API documentation, visit the documentation page.

If you have suggestions or find a bug, please open an issue.

If you want to contribute, visit the contributing page.


Install

To install the full package from pypi: pip install DataProfiler[full]

If you want to install the ml dependencies without generating reports use DataProfiler[ml]

If the ML requirements are too strict (say, you don't want to install tensorflow), you can install a slimmer package with DataProfiler[reports]. The slimmer package disables the default sensitive data detection / entity recognition (labler)

Install from pypi: pip install DataProfiler


What is a Data Profile?

In the case of this library, a data profile is a dictionary containing statistics and predictions about the underlying dataset. There are "global statistics" or global_stats, which contain dataset level data and there are "column/row level statistics" or data_stats (each column is a new key-value entry).

The format for a structured profile is below:

"global_stats": {
    "samples_used": int,
    "column_count": int,
    "row_count": int,
    "row_has_null_ratio": float,
    "row_is_null_ratio": float,
    "unique_row_ratio": float,
    "duplicate_row_count": int,
    "file_type": string,
    "encoding": string,
    "correlation_matrix": list[list[int]], (*)
    "chi2_matrix": list[list[float]],
    "profile_schema": {
        string: list[int]
    },
    "times": dict[string, float],
},
"data_stats": [
    {
        "column_name": string,
        "data_type": string,
        "data_label": string,
        "categorical": bool,
        "order": string,
        "samples": list[str],
        "statistics": {
            "sample_size": int,
            "null_count": int,
            "null_types": list[string],
            "null_types_index": {
                string: list[int]
            },
            "data_type_representation": dict[string, float],
            "min": [null, float, str],
            "max": [null, float, str],
            "mode": float,
            "median": float,
            "median_absolute_deviation": float,
            "sum": float,
            "mean": float,
            "variance": float,
            "stddev": float,
            "skewness": float,
            "kurtosis": float,
            "num_zeros": int,
            "num_negatives": int,
            "histogram": {
                "bin_counts": list[int],
                "bin_edges": list[float],
            },
            "quantiles": {
                int: float
            },
            "vocab": list[char],
            "avg_predictions": dict[string, float],
            "data_label_representation": dict[string, float],
            "categories": list[str],
            "unique_count": int,
            "unique_ratio": float,
            "categorical_count": dict[string, int],
            "gini_impurity": float,
            "unalikeability": float,
            "precision": {
                'min': int,
                'max': int,
                'mean': float,
                'var': float,
                'std': float,
                'sample_size': int,
                'margin_of_error': float,
                'confidence_level': float
            },
            "times": dict[string, float],
            "format": string
        },
        "null_replication_metrics": {
            "class_prior": list[int],
            "class_sum": list[list[int]],
            "class_mean": list[list[int]]
        }
    }
]

(*) Currently the correlation matrix update is toggled off. It will be reset in a later update. Users can still use it as desired with the is_enable option set to True.

The format for an unstructured profile is below:

"global_stats": {
    "samples_used": int,
    "empty_line_count": int,
    "file_type": string,
    "encoding": string,
    "memory_size": float, # in MB
    "times": dict[string, float],
},
"data_stats": {
    "data_label": {
        "entity_counts": {
            "word_level": dict[string, int],
            "true_char_level": dict[string, int],
            "postprocess_char_level": dict[string, int]
        },
        "entity_percentages": {
            "word_level": dict[string, float],
            "true_char_level": dict[string, float],
            "postprocess_char_level": dict[string, float]
        },
        "times": dict[string, float]
    },
    "statistics": {
        "vocab": list[char],
        "vocab_count": dict[string, int],
        "words": list[string],
        "word_count": dict[string, int],
        "times": dict[string, float]
    }
}

The format for a graph profile is below:

"num_nodes": int,
"num_edges": int,
"categorical_attributes": list[string],
"continuous_attributes": list[string],
"avg_node_degree": float,
"global_max_component_size": int,
"continuous_distribution": {
    "<attribute_1>": {
        "name": string,
        "scale": float,
        "properties": list[float, np.array]
    },
    "<attribute_2>": None,
    ...
},
"categorical_distribution": {
    "<attribute_1>": None,
    "<attribute_2>": {
        "bin_counts": list[int],
        "bin_edges": list[float]
    },
    ...
},
"times": dict[string, float]

Profile Statistic Descriptions

Structured Profile

global_stats:

  • samples_used - number of input data samples used to generate this profile
  • column_count - the number of columns contained in the input dataset
  • row_count - the number of rows contained in the input dataset
  • row_has_null_ratio - the proportion of rows that contain at least one null value to the total number of rows
  • row_is_null_ratio - the proportion of rows that are fully comprised of null values (null rows) to the total number of rows
  • unique_row_ratio - the proportion of distinct rows in the input dataset to the total number of rows
  • duplicate_row_count - the number of rows that occur more than once in the input dataset
  • file_type - the format of the file containing the input dataset (ex: .csv)
  • encoding - the encoding of the file containing the input dataset (ex: UTF-8)
  • correlation_matrix - matrix of shape column_count x column_count containing the correlation coefficients between each column in the dataset
  • chi2_matrix - matrix of shape column_count x column_count containing the chi-square statistics between each column in the dataset
  • profile_schema - a description of the format of the input dataset labeling each column and its index in the dataset
    • string - the label of the column in question and its index in the profile schema
  • times - the duration of time it took to generate the global statistics for this dataset in milliseconds

data_stats:

  • column_name - the label/title of this column in the input dataset
  • data_type - the primitive python data type that is contained within this column
  • data_label - the label/entity of the data in this column as determined by the Labeler component
  • categorical - ‘true’ if this column contains categorical data
  • order - the way in which the data in this column is ordered, if any, otherwise “random”
  • samples - a small subset of data entries from this column
  • statistics - statistical information on the column
    • sample_size - number of input data samples used to generate this profile
    • null_count - the number of null entries in the sample
    • null_types - a list of the different null types present within this sample
    • `nu
View on GitHub
GitHub Stars1.6k
CategoryData
Updated8d ago
Forks185

Languages

Python

Security Score

100/100

Audited on Mar 24, 2026

No findings