Free Open Source ETL Tools

Open source (freeware) data integration tools are the latest segment in the community-driven software. They are an alternative to commercial packaged solutions as well as an productive and efficient option to writing custom code.

The most advanced ETL tools packages on the market include enterprise-level offers from IBM, Informatica or Ab Initio. They are designed to handle high performance and scalability requirements. Minimum prices of them range from $45,000 to around $150,000 per CPU for an Enterprise package.

There are, however, less expensive options but they often are limited in support for heterogenous environments (Microsoft, Oracle) and sometimes charge extra for additional facilities and sevices (Metadata management, Data Quality, Data federation modules for instance or connectors for SAP or IFS).

ETL as a service for Redshift



It is a challenge for data integration architects to create an ETL tool that will be capable of integration of the data between a variety of data sources and targets and be reasonably priced. If there is a need for such a solution, an open source model should be considered.

Open Source ETL Tools

Open source implementations play a significant role both in bringing community power into ETL and promotion of development of standards. A large number of testers is available which makes free ETL tools to be widely spoken of and evolving. But the most important feature of open source ETL products is that they are significantly less expensive than tools of commercial licence.

There are four basic constituencies that adopt free ETL tools are:

A number of open source projects are capable of performing more than one ETL function. The technical features of these projects are less different than similar.

Free ETL tools

Apache Airflow

Apache Airflow is a project that builds a platform offering automatic authoring, scheduling and monitoring of workflows. Workflows are authored as directed acyclic graphs (DAGs) of tasks. The scheduler executes tasks on arrays of workers and follows dependencies as specified. The command line utilities allow to perform surgeries on DAGs, and the user interface allows to visualize production pipelines, monitor progress, and troubleshooting.

The principles of this project:

Apache Kafka™

Apache Kafka™ is a distributed streaming platform, which offers publish and subscribe to streams of records (similar to a message queue), allows fault-tolerant storing of streams of records, and allows processing streams of records as they occur.

Kafka is typically used for building real-time streaming data pipelines that either move data between systems or applications, or transform or react to the streams of data. The core concepts of this project include running as a cluster on one or more servers, stroing streams of records in categories (or topics), and working with records, where each record includes a key, a value, and a timestamp. Kafka has four core APIs: The Producer API, the Consumer API, the Streams API, and the Connector API.

clover.ETL

This project is directed by OpenSys, a based in Czech Republic company. It is Java-based, dual-licensed open source that in its commercially licensed version offers warranty and support. In its offer there is a small footprint that makes it easy to embed by system integrators and ISVs. It aims at creating a basic library of functions, including mapping and transformations. Its enterprise server edition is a commercial offering.

Apache NiFi

The Apache NiFi project is used to automate and manage the flow of information between systems, and its design model allows NiFi to be a very effective platform for building powerful and scalable dataflows. NiFi’s fundamental design concepts are related to the main ideas of Flow Based Programming. The main features of this project include a highly configurable web-based user interface (for example, including dynamic prioritization and allowing back pressue), data provenance, extensibility, and security (including options for SSL, SSH, HTTPS, and so on).

KETL

This project is sponsored by Kinetic Networks - a professional services company. It started as a tool for customer engagements as commercial tools were too expensive. The Kinetic employees are currently developing the code but there are outside contributions that are expected in the future. Additional modules like data quality and profilifng component, were also developed by Kinetic and they are not placed under the licence for the open source. Initially KETL was designed as a utility to replace custom PLSQL code would move large data volumes. It is Java-based and XML-driven development environment which is of great use for skilled Java developers. KETL is currently limited ofr those users who do not have a visual development GUI.

Pentaho Data Integration (PDI, Kettle)

According to Pentaho itself, it is a BI provider that offers ETL tools as a capability of data integration. These ETL capabilities are based on the Kettle project. Pentaho is known by selling subscriptions such as support services or management tools. Focusing primarily on connectivity and transformation, Pentaho's Kettle project is able to incorporate significant number of contributions from its community. Community-driven enhancements include: a Web services lookup, a SAP connector and the development of an Oracle bulk loader.
The SAP connector, although it is integrated with Kettle, is not a free product - it is a commercially offered plug-in, however it is around 10 times cheaper than an SAP connectivity for Infosphere Datastage.

Talend

It is a startup of French origin that has positioned itself as a pure play of open source data integration and now offers its product - Open Studio. For vendors wishing to embed Open Studio capabilities in their products, Talend has an OEM license agreement. That is what JasperSoft has done, thus creating an open source BI stack to compete with Pentaho's Kettle. Talend is a commercial open source vendor which generates profit from support, training and consulting services. What Open Studio offers is a user-friendly graphical modeling environment as it provides traditional approach for performance management as well as a pushdown optimization (architectural approach). The latter allows users to bypass the actual cost of dedicated hardware to support an ETL engine and enables users to leverage spare capacity of the server within both the source and target environments to power the transformations.

Limitations of the license cost free ETL

When they are used within limits, today's free ETL tools are quite suitable and do their work.

In the future those limits are expected to be extended as now the limitations include:

Open source ETL does not provide management capabilities that could be considered as a cross-enterprise standard for data integration. They are missing advanced connectivity, techniques of real-time data integration, such as enterprise information integration (EII) or change data capture (CDC), collaboration of enterprise-level, integrated data quality management and profiling. Yet many enterprises are not looking for large and expensive data integration suite. If there is an efficient and reliable alternative available to custom code of data integration requirements, an option to use free ETL technologies should be taken into consideration.

The most popular open source vendors are still not truly community-driven projects. There is and increased investment expected from a wider community to build out and encourage development, especially for connectivity modules to the unimaginable number of evolving source systems.


ETL as a service for Redshift