<p align="center"> <a href="https://github.com/mohammadraziei/liburlparser"> <img src="https://github.com/MohammadRaziei/liburlparser/raw/master/docs/images/logo/liburlparser-logo-1.svg" alt="Logo"> </a> <h3 align="center"> Fastest domain extractor library written in C++ with python binding. </h3> <h4 align="center"> First and complete library for parsing url in C++ and Python and Command Line </h4> </p>

About The Project

liburlparser is a powerful domain extractor library written in C++ with Python bindings. It provides efficient URL parsing capabilities for both C++ and Python, making it a valuable tool for projects that involve working with web addresses.

Features

Here are some key features of liburlparser:

Multiple Language Support:
- liburlparser can be used in multiple programming languages, including Python, C++, and Shell.
- It offers an intuitive interface that remains consistent across both C++ and Python.
Clean Code Design:
- The library provides two separate classes: Url and Host.
- This separation allows for cleaner and more organized code when dealing with URLs.
Public Suffix List Support:
- liburlparser supports known combinatorial suffixes (e.g., "ac.ir") using the public_suffix_list.
- It can also handle unknown suffixes (e.g., "comm" in "google.comm").
Automatic Public Suffix List Updates:
- Before each build and deployment, liburlparser updates the public_suffix_list automatically.
Host Properties:
- The Host class includes properties such as subdomain, domain, domain name, and suffix.
URL Properties:
- The Url class provides properties like protocol, userinfo, host (and all host properties), port, path, query parameters, and fragment.

Usage

Command Line

python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json

Python

you can use liburlparser so intutively

all of classes has help section

import liburlparser
help(liburlparser)
print(liburlparser.__version__)

from liburlparser import Url, Host
help(Url)
help(Host)

parse url and host

from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())

Also there is some helping api to get better performance for some small tasks

# if you need to extract the host of url as a string without any parsing
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast

if you are fan of pydomainextractor, there is some interface similar to it

import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url

# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api

C++

there is some examples in examples folder

#include "urlparser.h"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");

you can see all methods in python we can use in c++ very easily

Installation

C++:

build steps:

# 1. Clone the repository with submodules (recursive)
# --recursive is essential to download third-party libs like googletest
git clone --recursive https://github.com/mohammadraziei/liburlparser
cd liburlparser

# 2. Configure the project
# -B build tells CMake to create a 'build' directory and generate files there
cmake -B build

# 3. Build the project
# --build abstracts the underlying build tool (Make, Ninja, MSBuild, etc.)
cmake --build build --config Debug

# 4. Run examples
# Note: On Windows, the path might be ./build/Debug/example.exe
./build/example

# 5. Make install
# --install handles the installation process
sudo cmake --install build

Python and Command Line:

Be aware that it required python>=3.8

Installation

from source

To install from source, you must ensure all submodules are cloned:

# 1. Clone recursively to get third-party dependencies
git clone --recursive https://github.com/mohammadraziei/liburlparser
cd liburlparser

# 2. Install using pip
pip install .

# Optional: If you want to see build logs
pip install . -v

pip by pypi

pip install liburlparser

if you want to use psl.update to update the public suffix list, you must install the online version

pip install "liburlparser[online]"

pip by git

pip install git+https://github.com/mohammadraziei/liburlparser

manually

git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser

Performance

Extract From Host

Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)

| Library | Function | Time | | ------------------------------------------------------------------- | ------------------------- | ------ | | liburlparser | liburlparser.Host | 1.12s | | PyDomainExtractor | pydomainextractor.extract | 1.50s | | publicsuffix2 | publicsuffix2.get_sld | 9.92s | | tldextract | __call__

Liburlparser

Install / Use

README

About The Project

Features

Usage

Command Line

Python

C++

Installation

C++:

build steps:

Python and Command Line:

Installation

from source

pip by pypi

pip by git

manually

Performance

Extract From Host