Coverage

LUDA: Large URLs Dataset Analyzer for security

Presented at BlackHat USA 2021 Arsenal

Download and getting started
The 5 modules
Deployment with docker to a remote machine
Support and contributing to Luda

Malicious actors often reuse code to deploy their malware, phishing website or CNC server. As a result, similiaries can be found on URLs path by inspecting internet traffic. Moreover, deep learning models or even regular ML model do not fit for inline deployment in terms of running performance. However, regexes ( or YARA rules ) can be deployed on a proxy and work in real time on all the traffic. LUDA can take a set of malicious and benign URLs and return a list of regexes ready to be deployed inline !

Download and getting started

First of all, clone the repo :)

Copy now test/config.json to the main directory.

To make sure it will work for everyone, we will run everything inside a docker. Assuming you have docker and docker-composed on your machine, just run from the project directory

docker-compose up  #building the docker for the first time can take few minutes.

It will create a container named luda as well as running a jupyter notebook that you can access on localhost:5555 (token: luda). You noticed that it created also a folder "data" on project level that is mapped to the same folder on the docker.

So now copy (on the host) test/data_demo.csv to data/data_demo.csv. The file conf.json is already set like we need.

Now go to you docker with

docker exec -it luda bash

and run

python main.py #should take less than 1 min with 8cpu 16go RAM

It will preprocess the data, and cluster the urls. Now let's look at the clusters ! Go to you localhost:5555 to access the jupyter notebook hosted on the docker and open analysis/luda_analysis.ipynb

You can run all cells adn then go to the last part "Cluster analysis". The last output cells should show you the clusters. You should see something like this

Name: cluster, dtype: int64
#####Cluster 0 - 27 samples: #### 

['/neat/serverphp/config.bin',
 '/serverphp/config.bin',
...
 '/pus1/serverphp/config.bin',
 '/lg/server.php/config.bin',
 '/ekene/Severphp/config.bin',
 '/server[php]/config.bin',
 '/versy/serverphp/config.bin']


#####Cluster 4 - 17 samples: #### 

['/mupanel/post.php',
 '/jiz/kbpanel/post.php',
...
 '/low/kbpanel/post.php',
 '/1/kbpanel/post.php',
 '/new/kbpanel/post.php']

Here you can choose from which cluster you would like to run the regex generation. This last part is CPU and RAM expensive and you should run it only on the clusters that looks "good". Here you can also identify path that can generate FP ( like "/index.php" for example. Check use_case_clustering.py to see how you can fix FP at this step). Let's say you choose only those two clusters (0 and 4). Change the config.json (on the docker, you can access it directly via the notebook) to be

{
  "main_file": "data_demo.csv",
  "data": {
    "run": false,
    "additional_files": [
      {
        "path": "my_data/benign_data.csv",
        "label": "benign"
      },
      {
        "path": "my_data/malicious_traffic.csv",
        "label": "malicious"}
    ]
  },
  "feeder": {
    "run": false,
    "sources": [
      "urlhaus",
      "openfish",
      "alexa"
    ]
  },
  "preprocessing": {
    "run": false,
    "name": "basic"
  },
  "clustering": {
    "run": false,
    "preprocessed_file": null,
    "skip_distance_computation": false,
    "clusterer": {
      "dbscan": {
        "eps": 20,
        "min_samples": 8
      }
    },
    "metric": "sw",
    "features_folder": "luda_output/mymatrix",
    "filter_similarity": 30,
    "phishing_mode": false
  },
  "regex": {
    "run": true,
    "benign_for_retrain": 30,
    "round_max": 10,
    "regex_folder": "myregexes",
    "take_existing_result": true,
    "min_path_for_run": 200,
    "cluster_list": [0,4]
  }
}

We just turned off all step exept the regex generation steps that we want to run. We also added that we want run on cluster 0 and 4 only.

Now again (from the docker)

/!\ This step can take few hours ( ~2h on a 48CPU machine, 378GB RAM without using all its ressources)

python main.py

Check the log on luda_output/logs/luda.log at the end you can see a small report in the log ( where you see how each signature evolved at each round)

        N cluster : 2
        N paths: 44
        N benign in final test: 9486
        Benign number for retraining : 30
        N round: 10

        Cluster sig paths:

        cluster_27_0 : (\.*+[^_])++ ---> [^bin]*+[^\.]*+\.bin
cluster_17_4 : ([^_]\w++)++ ---> [^\.]++\.php ---> (\w*+/)++post\.php ---> [^php]++\w\w\w/?+\w++/post\.php


        After final testing:
        Cluster with 0 FP: {'cluster_17_4', 'cluster_27_0'}
        Number of paths covered with 0 FP: 44
        Percentage of paths covered with 0 FP: 100.0 %

        ### FP Report ###

        With FP :



        Without:

        ['cluster_27_0', 'cluster_17_4']

You also get a report showing basic info on the run. It's a csv stored in the "regex_folder" ( following the above config, it is luda_output/myregexes/report_myregexes.csv)

|id|name |regex_js |regex_java |malicious|benign|round|example_malicious |results_file |input_file | |------|------------|----------------------------------------------------|-------------------------------|---------|------|-----|---------------------------|-------------------------|-----------------------| |0 |cluster_17_4|(?=([^php]+))\1\w\w\w(?=(/?))\2(?=(\w+))\3/post.php|[^php]++\w\w\w/?+\w++/post.php|17 |61 |3 |/mupanel/post.php. |results_cluster_17_4.json|input_cluster_17_4.json| |1 |cluster_27_0|(?=([^bin]))\1(?=([^.]))\2.bin |[^bin]+[^.]+.bin |27 |30 |1 |/neat/serverphp/config.bin.|results_cluster_27_0.json|input_cluster_27_0.json|

Congrats on your first LUDA run. You now have 2 regexes ( Java or JS) that can be used malicious urls belonging to the clusters you found :)

On the next part, we will dive into LUDA architecture to understand each of its components, understand what else you can do and possibly make you contribute to the project !

LUDA is composed of 5 modules : data, feeder, preprocessing, clustering and regex generation.

To run LUDA, we need to first configure config.json

The 5 modules

Every part is independent and can be run separately with the config file.

Data

To provide LUDA with some URLs, you can pass some files. The only condition that they should have is a column named "url". However, if you provide the main file ( here data_demo.csv) it should have url, source, label, family as columns. So the easiest way to add your files is to add them on the additional_files array.

LUDA will then load them and store it in its format joined with the data coming from the feeders. By default, it will look for the file in the data folder. Otherwise you can write an absolute path. The main file does not have to exists. You can add you own file in additional_files and luda will combine them.

"main_file": "data_demo.csv",
  "data": {
    "run": false,
    "additional_files": [
      {
        "path": "my_data/benign_data.csv",
        "label": "benign"
      },
      {
        "path": "my_data/malicious_traffic.csv",
        "label": "malicious"}

    ]
  },

Feeders

We implemented several feeders from malicious sources that bring you the most recent data. Among them feeders for UrlHaus, OpenPhish, Alexa, Majestic, VT etc. If a your feeder bring domains (not URLs) a crawler is available and can convert your domain into URLs. We invite you to create your own feeder and share it to this project

  "feeder": {
    "run": false,
    "sources": [
      "urlhaus",
      "openfish",
      "alexa"
    ]
  }

Preprocessing

To get better results and save computation, it is mandatory to preprocess the data. You need to filter smartly your URLs to leave only the one that "have a chance to create a cluster".

We provide a class that implemented "basic" preprocessing techniques that we are currently using.

  "preprocessing": {
    "run": false,
    "name": "basic"
  }

Clustering

  "clustering": {
    "run": false,
    "preprocessed_file": null,
    "skip_distance_computation": false,
    "clusterer": {
      "dbscan": {
        "eps": 20,
        "min_samples": 8
      }
    },
    "metric": "sw",
    "features_folder": "luda_output/mymatrix",
    "filter_similarity": 30,
    "phishing_mode": false
  }

Distance matrix computation

This is a CPU and RAM expensive step. It will use ( by default ) all your CPU and can catch 300GB RAM for a list of URLs longuer than 35k...That's why the preprocessing step is very important. At the end of the task, it will save the results in a folder ( specified in the config file) that you can reuse several times to test different parameters of the clustering.

If you already have a csv file with your data. You need to write its absolute path in confi

Luda

Install / Use

README