Onetl
One ETL tool to rule them all
Install / Use
/learn @MTSWebServices/OnetlREADME
.. _readme:
onETL
|Repo Status| |PyPI Latest Release| |PyPI License| |PyPI Python Version| |PyPI Downloads| |Documentation| |CI Status| |Test Coverage| |pre-commit.ci Status|
.. |Repo Status| image:: https://www.repostatus.org/badges/latest/active.svg :alt: Repo status - Active :target: https://github.com/MTSWebServices/onetl .. |PyPI Latest Release| image:: https://img.shields.io/pypi/v/onetl :alt: PyPI - Latest Release :target: https://pypi.org/project/onetl/ .. |PyPI License| image:: https://img.shields.io/pypi/l/onetl.svg :alt: PyPI - License :target: https://github.com/MTSWebServices/onetl/blob/develop/LICENSE.txt .. |PyPI Python Version| image:: https://img.shields.io/pypi/pyversions/onetl.svg :alt: PyPI - Python Version :target: https://pypi.org/project/onetl/ .. |PyPI Downloads| image:: https://img.shields.io/pypi/dm/onetl :alt: PyPI - Downloads :target: https://pypi.org/project/onetl/ .. |Documentation| image:: https://readthedocs.org/projects/onetl/badge/?version=stable :alt: Documentation - ReadTheDocs :target: https://onetl.readthedocs.io/ .. |CI Status| image:: https://github.com/MTSWebServices/onetl/workflows/Tests/badge.svg :alt: Github Actions - latest CI build status :target: https://github.com/MTSWebServices/onetl/actions .. |Test Coverage| image:: https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/ MTSOnGithub/03e73a82ecc4709934540ce8201cc3b4/raw/onetl_badge.json :target: https://github.com/MTSWebServices/onetl/actions .. |pre-commit.ci Status| image:: https://results.pre-commit.ci/badge/github/MTSWebServices/onetl/develop.svg :alt: pre-commit.ci - status :target: https://results.pre-commit.ci/latest/github/MTSWebServices/onetl/develop
|Logo|
.. |Logo| image:: docs/_static/logo_wide.svg :alt: onETL logo :target: https://github.com/MTSWebServices/onetl
What is onETL?
Python ETL/ELT library powered by Apache Spark <https://spark.apache.org/>_ & other open-source tools.
Goals
- Provide unified classes to extract data from (E) & load data to (L) various stores.
- Provides
Spark DataFrame API <https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html>_ for performing transformations (T) in terms of ETL. - Provide direct assess to database, allowing to execute SQL queries, as well as DDL, DML, and call functions/procedures. This can be used for building up ELT pipelines.
- Support different
read strategies <https://onetl.readthedocs.io/en/stable/strategy/index.html>_, e.g. icremental reads. - Provide
hooks <https://onetl.readthedocs.io/en/stable/hooks/index.html>_ &plugins <https://onetl.readthedocs.io/en/stable/plugins.html>_ mechanism for altering behavior of internal classes.
Non-goals
- onETL is not a Spark replacement. It just provides additional functionality that Spark does not have, and improves UX for end users.
- onETL is not a framework, as it does not have requirements to project structure, naming, the way of running ETL/ELT processes, configuration, etc. All of that should be implemented in some other tool.
- onETL is deliberately developed without any integration with scheduling software like Apache Airflow. All integrations should be implemented as separated tools.
- No Spark streaming support of any kind, only batch operations are supported. For streaming prefer
Apache Flink <https://flink.apache.org/>_.
Requirements
- Python 3.7 - 3.14
- PySpark 3.2.x - 4.1.x (depends on used connector)
- Java 8+ (required by Spark, see below)
- Kerberos libs & GCC (required by
Hive,HDFSandSparkHDFSconnectors)
Supported storages
+--------------------+--------------+-------------------------------------------------------------------------------------------------------------------------+
| Type | Storage | Powered by |
+====================+==============+=========================================================================================================================+
| Database | Clickhouse | Apache Spark JDBC Data Source <https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html>_ |
-
+--------------+ +
| | MSSQL | |
-
+--------------+ +
| | MySQL | |
-
+--------------+ +
| | Postgres | |
-
+--------------+ +
| | Oracle | |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | Hive | Apache Spark Hive integration <https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | Iceberg | Apache Iceberg Spark integration <https://iceberg.apache.org/spark-quickstart/>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | Kafka | Apache Spark Kafka integration <https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | Greenplum | VMware Greenplum Spark connector <https://docs.vmware.com/en/VMware-Greenplum-Connector-for-Apache-Spark/index.html>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | MongoDB | MongoDB Spark connector <https://www.mongodb.com/docs/spark-connector/current>_ |
+--------------------+--------------+-------------------------------------------------------------------------------------------------------------------------+
| File | HDFS | HDFS Python client <https://pypi.org/project/hdfs/>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | S3 | minio-py client <https://pypi.org/project/minio/>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | SFTP | Paramiko library <https://pypi.org/project/paramiko/>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | FTP | FTPUtil library <https://pypi.org/project/ftputil/>_ |
-
+--------------+ +
| | FTPS | |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | WebDAV | WebdavClient3 library <https://pypi.org/project/webdavclient3/>_ |
-
+--------------+-------------------------------------------------------------------------------------------------------------------------+
| | Samba | pysmb library <https://pypi.org/project/pysmb/>_ |
+--------------------+--------------+-------------------------------------------------------------------------------------------------------------------------+
| Files as DataFrame | SparkLocalFS | Apache Spark File Data Source <https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html>_ |
| +--------------+
