Semantic Type Detection and Data Profiling

Metadata/data identification Java library. Identifies Base Type (e.g. Boolean, Double, Long, String, LocalDate, LocalTime, ...) and Semantic Type (e.g. Gender, Age, Color, Country, ...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.

Design objectives:

Large set of built-in Semantic Types (extensible via JSON defined plugins). Details.
Extensive Profiling metrics (e.g. Min, Max, Distinct, signatures, …)
Sufficiently fast to be used inline. See Speed notes below.
Minimal false positives for Semantic type detection. See Performance notes below.
Usable in either Streaming, Bulk or Record mode.
Broad country/language support - including US, Canada, Mexico, Brazil, UK, Australia, India, much of Europe, Japan and China.
Support for sharded analysis (i.e. Analysis results can be merged)
Once stream is profiled then subsequent samples can be validated and/or new samples can be generated

Note: Date detection supports ~750 locales (no support for locales using non-Gregorian calendars or non-Arabic numerals).

Usage

FTA is available in Maven Central. Include it in your project with:

<dependency>
    <groupId>com.cobber.fta</groupId>
    <artifactId>fta</artifactId>
    <version>17.5.6</version>
</dependency>

Streaming Mode Example

Used when the source is inherently continuous, e.g. IOT device, flat file, etc.

	String[] inputs = {
				"Anaïs Nin", "Gertrude Stein", "Paul Cézanne", "Pablo Picasso", "Theodore Roosevelt",
				"Henri Matisse", "Georges Braque", "Henri de Toulouse-Lautrec", "Ernest Hemingway",
				"Alice B. Toklas", "Eleanor Roosevelt", "Edgar Degas", "Pierre-Auguste Renoir",
				"Claude Monet", "Édouard Manet", "Mary Cassatt", "Alfred Sisley",
				"Camille Pissarro", "Franklin Delano Roosevelt", "Winston Churchill" };

	// Use simple constructor - for improved detection provide an AnalyzerContext (see Contextual example).
	TextAnalyzer analysis = new TextAnalyzer("Famous");

	for (String input : inputs)
		analysis.train(input);

	TextAnalysisResult result = analysis.getResult();

	System.err.printf("Semantic Type: %s (%s)%n",
			result.getSemanticType(), result.getType());

	System.err.println("Detail: " + result.asJSON(true, 1));

Result: Semantic Type: NAME.FIRST_LAST (String)

Bulk Mode Example

Used when the source offers the ability to group at source, e.g. a Database. The advantages of using Bulk mode are that as the data is pre-aggregated the analysis is significantly faster, and the Semantic Type detection is not biased by a set of outliers present early in the analysis.

	TextAnalyzer analysis = new TextAnalyzer("Gender");
	HashMap<String, Long> basic = new HashMap<>();

	basic.put("Male", 2_000_000L);
	basic.put("Female", 1_000_000L);
	basic.put("Unknown", 10_000L);

	analysis.trainBulk(basic);

	TextAnalysisResult result = analysis.getResult();

	System.err.printf("Semantic Type: %s (%s)%n", result.getSemanticType(), result.getType());

	System.err.println("Detail: " + result.asJSON(true, 1));

Result: Semantic Type: GENDER.TEXT_EN (String)

Record Mode Example

Used when the primary objective is Semantic Type information and not profiling, or when the focus is on a subset of the data (e.g. fewer than MAX_CARDINALITY records). The advantage of using Record mode is that the Semantic Type detection is stronger and there is support for cross-stream analysis.

	String[] headers = { "First", "Last", "MI" };
	String[][] names = {
			{ "Anaïs", "Nin", "" }, { "Gertrude", "Stein", "" }, { "Paul", "Campbell", "" },
			{ "Pablo", "Picasso", "" }, { "Theodore", "Camp", "" }, { "Henri", "Matisse", "" },
			{ "Georges", "Braque", "" }, { "Ernest", "Hemingway", "" }, { "Alice", "Toklas", "B." },
			{ "Eleanor", "Roosevelt", "" }, { "Edgar", "Degas", "" }, { "Pierre-Auguste", "Wren", "" },
			{ "Claude", "Monet", "" }, { "Édouard", "Sorenson", "" }, { "Mary", "Dunning", "" },
			{ "Alfred", "Jones", "" }, { "Joseph", "Smith", "" }, { "Camille", "Pissarro", "" },
			{ "Franklin", "Roosevelt", "Delano" }, { "Winston", "Churchill", "" }
	};

	AnalyzerContext context = new AnalyzerContext(null, DateResolutionMode.Auto, "customer", headers );
	TextAnalyzer template = new TextAnalyzer(context);

	RecordAnalyzer analysis = new RecordAnalyzer(template);
	for (String [] name : names)
		analysis.train(name);

	RecordAnalysisResult recordResult = analysis.getResult();

	for (TextAnalysisResult result : recordResult.getStreamResults()) {
		System.err.printf("Semantic Type: %s (%s)%n", result.getSemanticType(), result.getType());
	}

Result:

Semantic Type: NAME.FIRST (String) </br> Semantic Type: NAME.LAST (String) </br> Semantic Type: NAME.MIDDLE (String)

Additional Examples

Are in the examples directory.

Date Format determination

If you are solely interested in determining the format of a date from a single sample, then the following example is a good starting point:

	final DateTimeParser dtp = new DateTimeParser().withDateResolutionMode(DateResolutionMode.MonthFirst).withLocale(Locale.ENGLISH);

	// Determine the DataTimeFormatter for the following examples
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("26 July 2012"));
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("March 9 2012"));
	// Note: Detected as MM/dd/yyyy despite being ambiguous as we indicated MonthFirst above when insufficient data
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("07/04/2012"));
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("2012 March 20"));
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("2012/04/09 18:24:12"));

	// Determine format of the input below and then parse it
	String input = "Wed Mar 04 05:09:06 GMT-06:00 2009";

	String formatString = dtp.determineFormatString(input);

	// Grab the DateTimeFormatter from fta as this creates a case-insensitive parser and it supports a slightly wider set set of formats
	// For example, "yyyy" does not work out of the box if you use DateTimeFormatter.ofPattern
	DateTimeFormatter formatter = DateTimeParser.ofPattern(formatString);

	OffsetDateTime parsedDate = OffsetDateTime.parse(input, formatter);

	System.err.printf("Format is: '%s', Date is: '%s'%n", formatString, parsedDate.toString());

If you are interested in determining the format based on a set of inputs, then the following example is good starting point:

	final DateTimeParser dtp = new DateTimeParser().withLocale(Locale.ENGLISH);

	final List<String> inputs = Arrays.asList( "10/1/2008", "10/2/2008", "10/3/2008", "10/4/2008", "10/5/2008", "10/10/2008" );

	inputs.forEach(dtp::train);

	// At this stage we are not sure of the date format, since with 'DateResolutionMode == None' we make no
	// assumption whether it is MM/DD or DD/MM and the format String is unbound (??/?/yyyy)
	System.err.println(dtp.getResult().getFormatString());

	// Once we train with another value which indicates that the Day must be the second field then the new
	// result is correctly determined to be MM/d/yyyy
	dtp.train("10/15/2008");
	System.err.println(dtp.getResult().getFormatString());

	// Once we train with another value which indicates that the Month is expressed using one or two digits the
	// result is correctly determined to be M/d/yyyy
	dtp.train("3/15/2008");
	System.err.println(dtp.getResult().getFormatString());

Note: For Date Format determination you only need fta-core.jar.

Metrics

In addition to the input/configuration attributes:

streamName - Name of the input stream
dateResolutionMode - Mode used to determine how to resolve dates in the absence of adequate data. One of None, DayFirst, MonthFirst, or Auto.
compositeName - Name of the Composite the Stream is a member of (e.g. Table Name)
compositeStreamNames - Ordered list of the Composite Stream names (including streamName)
detectionLocale - Locale used to run the analysis (e.g. "en-US")
ftaVersion - Version of FTA used to generate analysis

There are a large number of metrics detected, which vary based on the type of the input stream.

<details> <summary><b>Supported Metrics</b></summary>

sampleCount - Number of samples observed (read Merging Analyzes for subtleties associated with merging)
matchCount - Number of samples that match the detected Base (or Semantic) type
nullCount - Number of null samples
blankCount - Number of blank samples
distinctCount - Number of distinct (valid) samples, typically -1 if maxCardinality exceeded. See Note 1.
regExp - A Regular Expression (Java) that matches the detected Type
confidence - The percentage confidence (0-1.0) in the determination of the Type. If no Semantic Type is detected then the confidence reflects the confidence in the Base Type, if a Semantic Type is detected then the confidence reflects the confidence in the Semantic Type.
type - The Base Type (one of Boolean, Double,

Fta

Install / Use

README