DataProfiler
What's in your data? Extract schema, statistics and entities from datasets
Install / Use
/learn @capitalone/DataProfilerREADME
Data Profiler | What's in your data?
The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.
Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI) and more. Data Profiles can then be used in downstream applications or reports.
Getting started only takes a few lines of code (example csv):
import json
from dataprofiler import Data, Profiler
data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL
print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame
profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc
readable_report = profile.report(report_options={"output_format": "compact"})
print(json.dumps(readable_report, indent=4))
Note: The Data Profiler comes with a pre-trained deep learning model, used to efficiently identify sensitive data (PII / NPI). If desired, it's easy to add new entities to the existing pre-trained model or insert an entire new pipeline for entity recognition.
For API documentation, visit the documentation page.
If you have suggestions or find a bug, please open an issue.
If you want to contribute, visit the contributing page.
Install
To install the full package from pypi: pip install DataProfiler[full]
If you want to install the ml dependencies without generating reports use DataProfiler[ml]
If the ML requirements are too strict (say, you don't want to install tensorflow), you can install a slimmer package with DataProfiler[reports]. The slimmer package disables the default sensitive data detection / entity recognition (labler)
Install from pypi: pip install DataProfiler
What is a Data Profile?
In the case of this library, a data profile is a dictionary containing statistics and predictions about the underlying dataset. There are "global statistics" or global_stats, which contain dataset level data and there are "column/row level statistics" or data_stats (each column is a new key-value entry).
The format for a structured profile is below:
"global_stats": {
"samples_used": int,
"column_count": int,
"row_count": int,
"row_has_null_ratio": float,
"row_is_null_ratio": float,
"unique_row_ratio": float,
"duplicate_row_count": int,
"file_type": string,
"encoding": string,
"correlation_matrix": list[list[int]], (*)
"chi2_matrix": list[list[float]],
"profile_schema": {
string: list[int]
},
"times": dict[string, float],
},
"data_stats": [
{
"column_name": string,
"data_type": string,
"data_label": string,
"categorical": bool,
"order": string,
"samples": list[str],
"statistics": {
"sample_size": int,
"null_count": int,
"null_types": list[string],
"null_types_index": {
string: list[int]
},
"data_type_representation": dict[string, float],
"min": [null, float, str],
"max": [null, float, str],
"mode": float,
"median": float,
"median_absolute_deviation": float,
"sum": float,
"mean": float,
"variance": float,
"stddev": float,
"skewness": float,
"kurtosis": float,
"num_zeros": int,
"num_negatives": int,
"histogram": {
"bin_counts": list[int],
"bin_edges": list[float],
},
"quantiles": {
int: float
},
"vocab": list[char],
"avg_predictions": dict[string, float],
"data_label_representation": dict[string, float],
"categories": list[str],
"unique_count": int,
"unique_ratio": float,
"categorical_count": dict[string, int],
"gini_impurity": float,
"unalikeability": float,
"precision": {
'min': int,
'max': int,
'mean': float,
'var': float,
'std': float,
'sample_size': int,
'margin_of_error': float,
'confidence_level': float
},
"times": dict[string, float],
"format": string
},
"null_replication_metrics": {
"class_prior": list[int],
"class_sum": list[list[int]],
"class_mean": list[list[int]]
}
}
]
(*) Currently the correlation matrix update is toggled off. It will be reset in a later update. Users can still use it as desired with the is_enable option set to True.
The format for an unstructured profile is below:
"global_stats": {
"samples_used": int,
"empty_line_count": int,
"file_type": string,
"encoding": string,
"memory_size": float, # in MB
"times": dict[string, float],
},
"data_stats": {
"data_label": {
"entity_counts": {
"word_level": dict[string, int],
"true_char_level": dict[string, int],
"postprocess_char_level": dict[string, int]
},
"entity_percentages": {
"word_level": dict[string, float],
"true_char_level": dict[string, float],
"postprocess_char_level": dict[string, float]
},
"times": dict[string, float]
},
"statistics": {
"vocab": list[char],
"vocab_count": dict[string, int],
"words": list[string],
"word_count": dict[string, int],
"times": dict[string, float]
}
}
The format for a graph profile is below:
"num_nodes": int,
"num_edges": int,
"categorical_attributes": list[string],
"continuous_attributes": list[string],
"avg_node_degree": float,
"global_max_component_size": int,
"continuous_distribution": {
"<attribute_1>": {
"name": string,
"scale": float,
"properties": list[float, np.array]
},
"<attribute_2>": None,
...
},
"categorical_distribution": {
"<attribute_1>": None,
"<attribute_2>": {
"bin_counts": list[int],
"bin_edges": list[float]
},
...
},
"times": dict[string, float]
Profile Statistic Descriptions
Structured Profile
global_stats:
samples_used- number of input data samples used to generate this profilecolumn_count- the number of columns contained in the input datasetrow_count- the number of rows contained in the input datasetrow_has_null_ratio- the proportion of rows that contain at least one null value to the total number of rowsrow_is_null_ratio- the proportion of rows that are fully comprised of null values (null rows) to the total number of rowsunique_row_ratio- the proportion of distinct rows in the input dataset to the total number of rowsduplicate_row_count- the number of rows that occur more than once in the input datasetfile_type- the format of the file containing the input dataset (ex: .csv)encoding- the encoding of the file containing the input dataset (ex: UTF-8)correlation_matrix- matrix of shapecolumn_countxcolumn_countcontaining the correlation coefficients between each column in the datasetchi2_matrix- matrix of shapecolumn_countxcolumn_countcontaining the chi-square statistics between each column in the datasetprofile_schema- a description of the format of the input dataset labeling each column and its index in the datasetstring- the label of the column in question and its index in the profile schema
times- the duration of time it took to generate the global statistics for this dataset in milliseconds
data_stats:
column_name- the label/title of this column in the input datasetdata_type- the primitive python data type that is contained within this columndata_label- the label/entity of the data in this column as determined by the Labeler componentcategorical- ‘true’ if this column contains categorical dataorder- the way in which the data in this column is ordered, if any, otherwise “random”samples- a small subset of data entries from this columnstatistics- statistical information on the columnsample_size- number of input data samples used to generate this profilenull_count- the number of null entries in the samplenull_types- a list of the different null types present within this sample- `nu
