SkillAgentSearch skills...

Fta

Metadata/data identification Java library. Identifies Semantic Type information (e.g. Gender, Age, Color, Country,...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.

Install / Use

/learn @tsegall/Fta

README

License Maven Central GitHub release (latest by date) CodeQL javadoc codecov

Semantic Type Detection and Data Profiling

Metadata/data identification Java library. Identifies Base Type (e.g. Boolean, Double, Long, String, LocalDate, LocalTime, ...) and Semantic Type (e.g. Gender, Age, Color, Country, ...). Extensive country/language support. Extensible via user-defined plugins. Comprehensive Profiling support.

Design objectives:

  • Large set of built-in Semantic Types (extensible via JSON defined plugins). Details.
  • Extensive Profiling metrics (e.g. Min, Max, Distinct, signatures, …)
  • Sufficiently fast to be used inline. See Speed notes below.
  • Minimal false positives for Semantic type detection. See Performance notes below.
  • Usable in either Streaming, Bulk or Record mode.
  • Broad country/language support - including US, Canada, Mexico, Brazil, UK, Australia, India, much of Europe, Japan and China.
  • Support for sharded analysis (i.e. Analysis results can be merged)
  • Once stream is profiled then subsequent samples can be validated and/or new samples can be generated

Note: Date detection supports ~750 locales (no support for locales using non-Gregorian calendars or non-Arabic numerals).

Usage

FTA is available in Maven Central. Include it in your project with:

<dependency>
    <groupId>com.cobber.fta</groupId>
    <artifactId>fta</artifactId>
    <version>17.5.6</version>
</dependency>

Streaming Mode Example

Used when the source is inherently continuous, e.g. IOT device, flat file, etc.

	String[] inputs = {
				"Anaïs Nin", "Gertrude Stein", "Paul Cézanne", "Pablo Picasso", "Theodore Roosevelt",
				"Henri Matisse", "Georges Braque", "Henri de Toulouse-Lautrec", "Ernest Hemingway",
				"Alice B. Toklas", "Eleanor Roosevelt", "Edgar Degas", "Pierre-Auguste Renoir",
				"Claude Monet", "Édouard Manet", "Mary Cassatt", "Alfred Sisley",
				"Camille Pissarro", "Franklin Delano Roosevelt", "Winston Churchill" };

	// Use simple constructor - for improved detection provide an AnalyzerContext (see Contextual example).
	TextAnalyzer analysis = new TextAnalyzer("Famous");

	for (String input : inputs)
		analysis.train(input);

	TextAnalysisResult result = analysis.getResult();

	System.err.printf("Semantic Type: %s (%s)%n",
			result.getSemanticType(), result.getType());

	System.err.println("Detail: " + result.asJSON(true, 1));

Result: Semantic Type: NAME.FIRST_LAST (String)

Bulk Mode Example

Used when the source offers the ability to group at source, e.g. a Database. The advantages of using Bulk mode are that as the data is pre-aggregated the analysis is significantly faster, and the Semantic Type detection is not biased by a set of outliers present early in the analysis.

	TextAnalyzer analysis = new TextAnalyzer("Gender");
	HashMap<String, Long> basic = new HashMap<>();

	basic.put("Male", 2_000_000L);
	basic.put("Female", 1_000_000L);
	basic.put("Unknown", 10_000L);

	analysis.trainBulk(basic);

	TextAnalysisResult result = analysis.getResult();

	System.err.printf("Semantic Type: %s (%s)%n", result.getSemanticType(), result.getType());

	System.err.println("Detail: " + result.asJSON(true, 1));

Result: Semantic Type: GENDER.TEXT_EN (String)

Record Mode Example

Used when the primary objective is Semantic Type information and not profiling, or when the focus is on a subset of the data (e.g. fewer than MAX_CARDINALITY records). The advantage of using Record mode is that the Semantic Type detection is stronger and there is support for cross-stream analysis.

	String[] headers = { "First", "Last", "MI" };
	String[][] names = {
			{ "Anaïs", "Nin", "" }, { "Gertrude", "Stein", "" }, { "Paul", "Campbell", "" },
			{ "Pablo", "Picasso", "" }, { "Theodore", "Camp", "" }, { "Henri", "Matisse", "" },
			{ "Georges", "Braque", "" }, { "Ernest", "Hemingway", "" }, { "Alice", "Toklas", "B." },
			{ "Eleanor", "Roosevelt", "" }, { "Edgar", "Degas", "" }, { "Pierre-Auguste", "Wren", "" },
			{ "Claude", "Monet", "" }, { "Édouard", "Sorenson", "" }, { "Mary", "Dunning", "" },
			{ "Alfred", "Jones", "" }, { "Joseph", "Smith", "" }, { "Camille", "Pissarro", "" },
			{ "Franklin", "Roosevelt", "Delano" }, { "Winston", "Churchill", "" }
	};

	AnalyzerContext context = new AnalyzerContext(null, DateResolutionMode.Auto, "customer", headers );
	TextAnalyzer template = new TextAnalyzer(context);

	RecordAnalyzer analysis = new RecordAnalyzer(template);
	for (String [] name : names)
		analysis.train(name);

	RecordAnalysisResult recordResult = analysis.getResult();

	for (TextAnalysisResult result : recordResult.getStreamResults()) {
		System.err.printf("Semantic Type: %s (%s)%n", result.getSemanticType(), result.getType());
	}

Result:

Semantic Type: NAME.FIRST (String) </br> Semantic Type: NAME.LAST (String) </br> Semantic Type: NAME.MIDDLE (String)

Additional Examples

Are in the examples directory.

Date Format determination

If you are solely interested in determining the format of a date from a single sample, then the following example is a good starting point:

	final DateTimeParser dtp = new DateTimeParser().withDateResolutionMode(DateResolutionMode.MonthFirst).withLocale(Locale.ENGLISH);

	// Determine the DataTimeFormatter for the following examples
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("26 July 2012"));
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("March 9 2012"));
	// Note: Detected as MM/dd/yyyy despite being ambiguous as we indicated MonthFirst above when insufficient data
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("07/04/2012"));
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("2012 March 20"));
	System.err.printf("Format is: '%s'%n", dtp.determineFormatString("2012/04/09 18:24:12"));

	// Determine format of the input below and then parse it
	String input = "Wed Mar 04 05:09:06 GMT-06:00 2009";

	String formatString = dtp.determineFormatString(input);

	// Grab the DateTimeFormatter from fta as this creates a case-insensitive parser and it supports a slightly wider set set of formats
	// For example, "yyyy" does not work out of the box if you use DateTimeFormatter.ofPattern
	DateTimeFormatter formatter = DateTimeParser.ofPattern(formatString);

	OffsetDateTime parsedDate = OffsetDateTime.parse(input, formatter);

	System.err.printf("Format is: '%s', Date is: '%s'%n", formatString, parsedDate.toString());

If you are interested in determining the format based on a set of inputs, then the following example is good starting point:

	final DateTimeParser dtp = new DateTimeParser().withLocale(Locale.ENGLISH);

	final List<String> inputs = Arrays.asList( "10/1/2008", "10/2/2008", "10/3/2008", "10/4/2008", "10/5/2008", "10/10/2008" );

	inputs.forEach(dtp::train);

	// At this stage we are not sure of the date format, since with 'DateResolutionMode == None' we make no
	// assumption whether it is MM/DD or DD/MM and the format String is unbound (??/?/yyyy)
	System.err.println(dtp.getResult().getFormatString());

	// Once we train with another value which indicates that the Day must be the second field then the new
	// result is correctly determined to be MM/d/yyyy
	dtp.train("10/15/2008");
	System.err.println(dtp.getResult().getFormatString());

	// Once we train with another value which indicates that the Month is expressed using one or two digits the
	// result is correctly determined to be M/d/yyyy
	dtp.train("3/15/2008");
	System.err.println(dtp.getResult().getFormatString());

Note: For Date Format determination you only need fta-core.jar.

Metrics

In addition to the input/configuration attributes:

  • streamName - Name of the input stream
  • dateResolutionMode - Mode used to determine how to resolve dates in the absence of adequate data. One of None, DayFirst, MonthFirst, or Auto.
  • compositeName - Name of the Composite the Stream is a member of (e.g. Table Name)
  • compositeStreamNames - Ordered list of the Composite Stream names (including streamName)
  • detectionLocale - Locale used to run the analysis (e.g. "en-US")
  • ftaVersion - Version of FTA used to generate analysis

There are a large number of metrics detected, which vary based on the type of the input stream.

<details> <summary><b>Supported Metrics</b></summary>
  • sampleCount - Number of samples observed (read Merging Analyzes for subtleties associated with merging)
  • matchCount - Number of samples that match the detected Base (or Semantic) type
  • nullCount - Number of null samples
  • blankCount - Number of blank samples
  • distinctCount - Number of distinct (valid) samples, typically -1 if maxCardinality exceeded. See Note 1.
  • regExp - A Regular Expression (Java) that matches the detected Type
  • confidence - The percentage confidence (0-1.0) in the determination of the Type. If no Semantic Type is detected then the confidence reflects the confidence in the Base Type, if a Semantic Type is detected then the confidence reflects the confidence in the Semantic Type.
  • type - The Base Type (one of Boolean, Double,
View on GitHub
GitHub Stars32
CategoryCustomer
Updated3d ago
Forks3

Languages

Java

Security Score

95/100

Audited on Mar 30, 2026

No findings