SLOGERT v1.0.0-SNAPSHOT

-- Semantic LOG ExtRaction Templating (SLOGERT) --

General Introduction
Workflow
How to run
SLOGERT configurations
- Main Configuration
- I/O Configuration

General Introduction

SLOGERT aims to automatically extract and enrich low-level log data into an RDF Knowledge Graph that conforms to our LOG Ontology. It integrates

LogPai for event pattern detection and parameter extractions from log lines
Stanford NLP for parameter type detection and keyword extraction, and
OTTR Engine for RDF generation.
Apache Jena for RDF data manipulation.

We have tested our approach on text-based logs produced by Unix OSs, in particular:

Apache,
Kernel,
Syslog,
Auth, and
FTP logs.

In our latest evaluation, we are testing our approach with the AIT log dataset, which contains additional logs from non-standard application, such as suricata and exim4. In this repository, we include a small excerpt of the AIT log dataset in the input folder as example log sources.

Workflow

**Figure 1**. SLOGERT KG generation workflow.

SLOGERT pipeline can be described in several steps, which main parts are shown in Figure 1 above and will be described as the following:

Initialization

Load config-io and config.yaml
Collect target log files from the input folder as defined in config-io. We assume that each top-level folder within input folder represent a single log source
Aggregate collected log files into single file.
Add log-source information to each log lines,
If log lines exceed the configuration limit (e.g., 100k), split the aggregated log file into a set of log-files.

Example results of this step is available in output/auth.log/1-init/ folder

A1 - Extraction Template Generation

Initialize extraction_template_generator with config-io to register extraction patterns
For each log-file from log-files
- Generate a list of <extraction-template, raw-result> pairs using extraction_template_generator

NOTE: We use LogPAI as extraction_template_generator
Example results of this step is available in output/auth.log/2-logpai/ folder

A2 - Template Enrichment

Load existing RDF_templates list
Load regex_patterns from config list for parameter recognition
Initialize NLP_engine engine
For each extraction-template from the list of <extraction-template, raw-result> pairs
- Transform extraction-template into an RDF_template_candidate
- if RDF_templates does not contain RDF_template_candidate
 - [A2.1 - RDF template generation]
 - For each parameter from RDF_template_candidate
 - If parameter is unknown
 
 [A2.2 - Template parameter recognition]
 
 Load sample-raw-results from raw-results
 
 Recognize parameter from sample-raw-results using NLP_engine and regex_patterns as parameter_type
 
 Save parameter_type in RDF_template_candidate
 
 [A2.2 - end]
 - [A2.3 - Keyword extraction]
 - Extract template_pattern from RDF_template_candidate
 - Execute NLP_engine engine on the template_pattern to retrieve template_keywords
 - Add template_keywords as keywords in RDF_template_candidate
 - [A2.3 - end]
 - [A2.4 - Concept annotation]
 - Load concept_model containing relevant concept in the domain
 - For each keyword from template_keywords
 
 for each concept in concept_model
 
 if keyword contains concept
 
 Add concept as concept annotation in RDF_template_candidate
 - [A2.4 - end]
 - add RDF_template_candidate to RDF_templates list
 - [A2.1 - end]

NOTE: We use Stanford NLP as our NLP_engine
Example results (i.e., RDF_templates) of this step is available as output/auth.log/auth.log-template.ttl

A3 - RDFization

Initialize RDFizer_engine
Generate RDF_generation_template from RDF_templates list
for each raw_result from raw_results list
- Generate RDF_generation_instances from RDF_generation_template and raw_result
- Generate RDF_graph from RDF_generation_instances and RDF_generation_template using RDFizer_engine

NOTE: We use LUTRA as our RDFizer_engine
Example RDF_generation_template and RDF_generation_instances are available in the output/auth.log/3-ottr/ folder.
Example results of this step is available in the output/auth.log/4-ttl/ folder

KG Generation Algorithm

<img width="460" src="https://raw.githubusercontent.com/sepses/slogert/master/figures/algorithm.png"> Figure 2. SLOGERT KG generation algorithms.

For those that are interested, we also provided an explanation of the KG generation in a form of Algorithm as shown in the Figure 2 above.

How to run

Prerequisites for running SLOGERT

Java 11 (for Lutra)
Apache Maven
Python 2 with pandas and python-scipy installed (for LogPai)
- the default setting is to use python command to invoke Python 2
- if this is not the case, modification on the LogIntializer.java is needed.

We have tried and and tested SLOGERT on Mac OSX and Ubuntu with the following steps:

Compile this project (mvn clean install or mvn clean install -DskipTests if you want to skip the tests)
You can set properties for extraction in the config file (e.g., number of loglines produced per file). Examples of config and template files are available on the src/test/resources folder (e.g., auth-config.yamlfor auth log data).
Transform the CSVs into OTTR format using the config file. By default, the following script should work on the example file. (java -jar target/slogert-<SLOGERT-VERSION>-jar-with-dependencies.jar -c src/test/resources/auth-config.yaml)
The result would be produced in the output/ folder

SLOGERT configurations

Slogert configuration is divided into two parts: main configuration config.yaml and the input parameter config-io.yaml

Main Configuration

There are several configuration that can be adapted in the main configuration file src/main/resources/config.yaml. We will briefly described the most important configuration options here.

logFormats to describe information that you want to extract from a log source. This is important due to the various existing logline formats and variants. Each logFormat contain references to the ottrTemplate to build the RDF_generation_template for RDFization step.
nerParameters to register patterns that will used by StanfordNLP for recognizing log template parameter types.
nonNerParameters to register standard regex patterns for template parameter types that can't be easily detected using StanfordNLP. Both nerParameters and nonNerParameters are contains reference for ottr template generation.
ottrTemplates to register RDF_generation_template building block necessary for the RDFization process.

I/O Configuration

The I/O configuration aim to describe log-source specific information that are not suitable to be added into config.yaml. An example of this IO configuration is src/test/resources/auth-config.yaml for auth log. We will describe the most important configuration options in the following:

source: the name of source file to be searched for in the input folder.
format: the basic format of the log file, which will be used by extraction_template_generator in process A1.
logFormat: types of the logfile. this value of this property should be registered in the logFormats within config.yaml for SLOGERT to work.
isOverrideExisting: whether SLOGERT should use load RDF_templates or to override them.
paramExtractAttempt: how many log lines should be processed to determine the parameter_type of a RDF_template_candidate.
logEventsPerExtraction: how many log lines should be processed in a single batch of execution.

Slogert

Install / Use

README