Mathilda

A library for writing fast and scalable web exploits and scanners in C++11

Disclaimer: I had to strip some of Mathilda's important functionality in order to open source it. You may need to do some modifications and tweaking in order for it to be useful for you. Feel free to reach out to me with your ideas and pull requests.

What

Mathilda is a multi-process libcurl/libuv wrapper written (mostly) in C++11. Mathilda allows you to write fast web testing tools that can operate at scale. But what does scale mean in this context? It means horizontal scaling and distribution of work across worker processes wherever possible.

Using these classes you can write tools that work quickly against a very large set of hosts. Some good examples are command injection crawlers, XXE exploits, SSRF, or HTTP header discovery tools.

Mathilda is not a singleton, you can have multiple Mathilda class instances in your program. Each class instance manages its own internal state with regards to child processes, libuv, and libcurl. The point is to abstract a lot of that away but still give you access to the bits that you want to customize.

Why

The libcurl API is good, but I needed something that wrapped all the plumbing code you normally rewrite with each new command line web testing tool. Mathilda is a library that handles all of that for you. All you have to do is focus on writing the code that defines the requests and does something with the responses. All of this will be automatically distributed for you across a pool of worker processes that invoke your callbacks. I have benchmarked it doing simple tasks at around ~10,000 HTTP GET requests per second on a 24 core system.

How

Mathilda takes a list of Instruction class instances. Each Instruction class represents a specific HTTP request to be made. These Instruction class instances tell Mathilda how to make a request and defines pre and post function callback pointers to handle the request and response appropriately.

Depending on the number of cores your system has Mathilda manages a pool of worker processes for efficiently distributing work with an event model implemented by libuv. This design was chosen to minimize the impact that slower servers have on performance. More on that below.

It is important to note that Mathilda calls fork to create the child processes. This is normally a bad choice for library designs but suits our needs perfectly. Mathilda manages these child processes with a class named MathildaFork. You can use this class directly from your own code to manage child processes if you want but its purely optional.

A bit of history for you. This code was originally a threadpool but I ran into limitations with using evented IO from threads because nothing is thread safe. You end up having to message a worker thread to safely update timers to avoid race conditions and dead locks. This quickly becomes very inefficient and complex code that isn't worth the small performance gains (http://stackoverflow.com/a/809049). This is true of libuv, libevent, and libev in one form or another. The complexity began to outweigh the performance hit of forking. If you don't care about using and evented IO with libcurl then you can safely use a threadpool and its fast-enough. However you will need to use select with libcurls multi interface, which is older and slower than epoll, which libuv manages for us. In fact theres probably more asynchronous operations we could be using libuv for but aren't yet. If you're interested in how some of the Mathilda bits are reusable see Mathilda::name_to_addr_a and how it uses MathildaFork to distribute DNS lookups across worker processes.

API

All source code is documented using Doxygen and the documentation automatically created in the project. Everything below is a quick start guide to familiarize you with how to use Mathilda and Instruction classes. I recommend starting here and referring to the Doxygen generated documentation when writing code. Mathilda also contains a Ruby binding, Rathilda.

Mathilda class functions

* add_instruction(Instruction *) - Adds an Instruction class to an internally managed vector
* execute_instructions() - Instructs Mathilda to start scanning hosts
* clear_instructions() - Clears the instructions vector Mathilda holds, rarely used

Mathilda class members

* safe_to_fork (bool) - A flag indicating whether it is OK to fork (default: true)
* use_shm (bool) - A flag indicating whether shared memory segments should be allocated (default: false)
* set_cpu (bool) - A flag that tells Mathilda to try and bind to a specific CPU with sched_setaffinity (default: true)
* slow_parallel (bool) - Forks a child process for each Instruction if true (default: false)
* process_timeout (uint32_t) - The number of seconds a child process should be given before a SIGALRM is sent
* finish(ProcessInfo *) - Callback function pointer, executed after child exits. Passed a pointer to a ProcessInfo structure

Mathilda class misc

* MATHILDA_FORK - Environment variable that when set tells Mathilda it is OK to fork

Instruction class members

This class needs to be created with the new operator, filled in, and passed to the Mathilda::add_instruction function

* host (std::string) - The hostname to scan
* path (std::string) - The URI to request
* http_method (std::string) - The HTTP method to use (GET/POST)
* post_body (std::string) - The POST body to send to the server
* cookie_file (std::string) - Location on disk of a Curl cookie file
* user_agent (std::string) - User agent string to use (default: Chrome)
* proxy (std::string) - Hostname of the proxy you want to use
* port (short) - Server port to connect to
* proxy_port (short) - Proxy port to connect to
* response_code (int) - Invoke the after callback only if HTTP response matches this (0 for always invoke)
* connect_timeout (int) - Maximum TCP connection timeout in seconds (default: 5)
* http_timeout (int) - Maximum HTTP transaction timeout in seconds (default: 10)
* ssl (bool) - TLS/SSL support, true to enable, false to disable
* include_headers (bool) - Include HTTP headers with the after callback (Response->text will include them)
* use_proxy (bool) - Use the proxy configured with proxy/proxy_port
* follow_redirects (bool) - Follow HTTP redirects
* verbose (bool) - Enables verbose mode in Curl (this generates a lot of output)
* curl_code (CURLCode) - The Curl response code returned after the request is finished
* before(Instruction *, CURL *) - A callback function pointer, executed before curl_perform
* after(Instruction *, CURL *, Response *) - Callback function pointer, executed after curl_perform

MathildaUtils static class functions

These utility functions were written to make writing libmathilda tools easier. When working with HTTP and URIs you often need to work a collection of strings in some way. Sometimes its tokenizing them, or modifying them for fuzzing. These functions will help you do those kinds of operations using simple STL containers. The documentation for each function is auto generated using doxygen. Please see the 'docs' directory for more information.

* shm_store_string(uint8_t *, const char *, size_t) - Stores a string in shared memory in [length,string] format
* get_http_headers(const char *, std::map<std::string, std::string> &) - Returns a std::map of HTTP headers from a raw HTTP response
* read_file(char *, std::vector<std::string> &) - Reads a file line by line into a std::vector
* unique_string_vector(vector<std::string> &) - Removes duplicate entries from a std::vector of std::string
* split(const std::string &, char, std::vector<std::string> &) - Splits a std::string according to a delimeter
* replaceAll(std::string &, const std::string &, const std::string &) - Replaces all occurences of X within a std::string with Y
* link_blacklist(std::string const &) - Checks a std::string against a blacklist of URI's
* page_blacklist(std::string const &) - Checks a std::string against a blacklist of content
* is_http_uri(std::string const &) - Returns true if a URI is an HTTP URI
* is_https_uri(std::string const &) - Returns true if a URI is an HTTPS URI
* is_subdomain(std::string const &) - Returns true if a URI is a subdomain
* is_domain_host(std::string const &, std::string const &) - Returns true if a URI matches a domain
* extract_host_from_uri(std::string const &) - Extracts the hostname from a URI
* extract_path_from_uri(std::string const &) - Extracts the path from a URI
* normalize_uri(std::string const &) - Attempts to normalize a URI

MathildaDNS class functions

One of the slowest parts of writing networked tools is DNS lookups. These class methods should make that easier. The functions with the _a suffix denote an asynchronous call. These functions resolve multiple hostnames or IP addresses and block until they complete or until 30 seconds has passed. The MathildaDNS class only contains a very simple DNS cache in the form of a std::map. The key to this map is a string containing an IP address and the data is a hostname it maps to. IP addresses in the cache are unique but hostnames may not be (i.e. multiple IP addresses will map to a single hostname). This cache is very simple and is only helpful if you are repeatedly scanning the same set of hosts over and over.

* name_to_addr(std::string const &, std::vector<std::string> &, bool) - Synchronous DNS name to address translation
* name_to_addr_a(std::vector<std::string> const &, std::vector<std::string> &) - Asynchronous DNS name to address translation
* addr_to_name(std::string const &, std::vector<std::string> &, bool) - Synchronous DNS address to name translation
* addr_to_name_a(std::vector<std::string> const &ips, std::vector<std::string> &) - Asynchronous DNS address to name translation
* flush_cache() - Flushes the DNS cache within the class
* disable_cache() - Disables the caching mechanism
* enab

Mathilda

Install / Use

README

Mathilda

What

Why

How

API

Mathilda class functions

Mathilda class members

Mathilda class misc

Instruction class members

MathildaUtils static class functions

MathildaDNS class functions