SkillAgentSearch skills...

Datagene

DataGene - Identify How Similar TS Datasets Are to One Another (by @firmai)

Install / Use

/learn @firmai/Datagene

README

DataGene - Data Transformations and Similarity Statistics


Downloads DOI

DataGene is developed to detect and compare dataset similarity between real and synthetic datasets as well as train, test, and validation datasets. You can read the report on SSRN for additional details. Datasets can largely be compared using quantitative and visual methods. Generated data can take on many formats, it can consist of multiple dimensions of various widths and heights. Original and generated datasets have to be transformed into an acceptable format before they can be compared, these transformation sometimes leads to a reduction in array dimensions. There are two reasons why we might want to reduce array dimensions, the first is to establish an acceptable format to perform distance calculations; the second is the preference for comparing like with like. You can use the MTSS-GAN to generate diverse multivariate time series data using stacked generative adversarial networks in combination with embedding and recurrent neural network models.

https://ssrn.com/abstract=3619626


Installation and import modules:

pip install datagene

As of now, you would also have to install the following package, until we find an alternative

pip install git+git://github.com/FirmAI-Research/ecopy.git
from datagene import distance as dist          # Distance Functions
from datagene import transform as tran         # Transformation Functions
from datagene import mod_utilities as mod      # Model Development Utilities
from datagene import dist_utilities as distu   # Distance Utilities
from datagene import vis_utilities as visu     # Visualisation Utility Functions

(A) Transformations (Colab):


  1. From Tesseract

    1. To Tensor & Matrix
      • Matrix Product State
  2. From Tensor

    1. To Tesseract

      • Multivariate Gramian Angular Encoding
      • Multivariate Recurrence Plot
      • Multivariate Markov Transition Fields
    2. To Tensor

      • Matrix Product State
      • Recurrence Plot
    3. To Matrix

      • Aggregates
      • Tucker
      • CANDECOMP
      • Sample PCA
  3. From Matrix

    1. To Tensor

      • Recurrence Plot
      • Gramian Angular Field
      • Markov Transition Field
    2. To Matrix

      • PCA
      • SVD
      • QR
      • Feature Kernels
      • Covariance
      • Correlation Matrix
      • 2D Histogram
      • Pairwise Distance
      • Pairwise Recurrence Plot
    3. To Vector

      • PCA Single Component
      • Histogram Filter
  4. From Vector

    1. To Matrix
      • Signitures Method
    2. To Vector
      • Extraction
      • Autocorrelation

(B) Visualisations (Colab):


  1. Convert Arrays to Images
  2. Histogram
  3. Signiture
  4. Gramian
  5. Recurrence
  6. Markov Transition Fields
  7. Correlation Matrix
  8. Pairplot
  9. Cord Lenght

(C) Distance Measures (Colab):


  1. Tensor/Matrix
    1. Contribution Values
      1. Predictions
      2. Feature Ordering
      3. Direction Divergence
      4. Effect Size
  2. Matrix
    1. Structural Similarity
    2. Similarity Histogram
    3. Hash Similarity
    4. Distance Matrix Hypothesis Test
    5. Dissimilarity Measures
    6. Statistical and Geometric Measures
  3. Vectors
    1. PCA Extracted Variance Explained
    2. Statistical and Geometrics Distances
    3. Geometric Distance Feature Map
    4. Curve Metrics
    5. Curve Metrics Feature Map
    6. Hypotheses Distance

In this example, the first thing we want to do is generate various datasets and load them into a list. See this notebook for an example of generating synthetic datasets by Turing Fellow, Mihaela van der Schaar, and researchers Jinsung Yoon, and Daniel Jarrett. As soon as we have these datasets, we load them into a list, starting with the original data.

As of now, this package is catering to time-series regression tasks, and more specifically input arrays with a three dimensional structure. The hope is that it will be extended to time-series classification and cross-sectional regression and classification tasks. This package can still be used for other tasks, but some functions won't apply. To run the package interactively, use this notebook.

datasets = [org, gen_1, gen_2]

Citation:

@software{datagene,
  title = {{DataGene}: Data Transformation and Similarity Statistics},
  author = {Snow, Derek},
  url = {https://github.com/firmai/datagene},
  version = {0.0.4},
  date = {2020-05-11},
}

 

Transformation Recipes

You have the ability to work with 2D and 3D generated data. The notebook excerpted in this documents, uses a 3D time series array. Data has to organised as samples, time steps, features, [i,s,f]. If you are working with a 2D array, the data has to be organised as samples, features [i,f].

This first recipe uses six arbitary transformations to identify the similarity of datasets. As an analogy, imagine you're importing similar looking oranges from two different countries, and you want to see whether there is a difference in the constitution of these oranges compared to the local variety your customers have gotten used to. To do that you might follow a six step process, first you press the oranges for pulp, then you boil the pulp, you then maybe sift the pulp out and drain the juice, you add apple juice to the pulp, and then add an organge concentrate back to the pulp, you then dry the concoction on a translucent petri dish and shine light through the petri dish to identify differences in patterns between the organges using various distance metrics. You might want to do the process multiple times and establish an average and possibly even a significance score. The transformation part, is the process we put the data through to be ready for similarity calculations.

From Tesseract:

tran.mps_decomp_4_to_2() - Matrix-product state are as the de facto standard for the representation of one-dimensional quantum many body states.

From Tensor:

tran.gaf_encode_3_to_4() - A Gramian Angular Field is an image obtained from a time series, representing some temporal correlation between each time point.

tran.mrp_encode_3_to_4() - Recurrence Plots are a way to visualize the behavior of a trajectory of a dynamical system in phase space.

tran.mtf_encode_3_to_4() - A Markov Transition Field is an image obtained from a time series, representing a field of transition probabilities for a discretized time series.

tran.jrp_encode_3_to_3() - A joint recurrence plot (JRP) is a graph which shows all those times at which a recurrence in one dynamical system occurs simultaneously with a recurrence in a second dynamical system

tran.mean_3_to_2() - Mean aggregation at the sample level.

tran.sum_3_to_2() - Sum aggregation at the sample level.

tran.min_3_to_2() - Minimum aggregation at the sample level.

tran.var_3_to_2() - Variation aggregation at the sample level.

tran.mps_decomp_3_to_2() - Matrix-product state are as the de facto standard for the representation of one-dimensional quantum many body states.

tran.tucker_decomp_3_to_2() - Tucker decomposition decomposes a tensor into a set of matrices and one small core tensor

tran.parafac_decomp_3_to_2() - The PARAFAC decomposition may be regarded as a generalization of the matrix singular value decomposition, but for tensors.

tran.pca_decomp_3_to_2() - Long to wide array conversion with a PCA Decomposition.

From Matrix:

tran.rp_encode_2_to_3() - Recurrence Plots are a way to visualize the behavior of a trajectory of a dynamical system in phase space.

tran.gaf_encode_2_to_3() - A Gramian Angular Field is an image obtained from a time series, representing some temporal correlation between each time point.

tran.mtf_encode_2_to_3() - A Markov Transition Field is an image obtained from a time series, representing a field of transition probabilities for a discretized time series.

tran.pca_decomp_2_to_2() - Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set.

tran.svd_decomp_2_to_2() - Singular value decomposition (SVD) is a factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix.

tran.qr_decomp_2_to_2() - QR decomposition (also called the QR factorization) of a matrix is a decomposition of the matrix into an orthogonal matrix and a triangular matrix.

tran.lik_kernel_2_to_2() - A special case of polynomial_kernel with degree=1 and coef0=0.

tran.cos_kernel_2_to_2() - The chi-squared kernel is a very popular choice for training non-linear SVMs in computer vision applications.

tran.pok_kernel_2_to_2() - The function polynomial_kernel computes the degree-d polynomial kernel between two vectors.

tran.lak_kernel_2_to_2() - The function laplacian_kernel

Related Skills

View on GitHub
GitHub Stars205
CategoryDevelopment
Updated13d ago
Forks25

Languages

Jupyter Notebook

Security Score

85/100

Audited on Mar 6, 2026

No findings