AWS ETL

Amazon Web Services (AWS) is a cloud-based computing service offering from Amazon. AWS offers over 90 services and products on its platform from storage to game development. As part of their services, Amazon offers ETL services and tools. AWS Glue is a managed ETL service and AWS Data Pipeline is an automated ETL service. Whereas AWS Elastic MapReduce (EMR) and Amazon Athena/Redshift Spectrum are data offerings that assist in the ETL process.

Data analysis is vital to businesses. In providing several transform and load options, Amazon can offer ETL functionality to a wide spectrum of businesses. In addition, each has its own benefits to suit a particular audience.

Connect all your data sources to any data warehouse

This removes the barriers to data integration.

Different AWS ETL Methods

Method 1 - Use AWS and an EMR CLuster on an S3 Bucket

Use AWS and an EMR cluster on an S3 bucket. To create and run this ETL method administrators can use the management console, AWS command line, an SDK or a web service API. The execution path depends on your data needs and programming resources.

This method sets up a data pipeline between your source and destination data stores. If there are log files available in your Amazon S3 bucket, then you can begin. EMR is a tool to process and analyze big data, an expandable computing service alternative to running on-premise cluster computing for processing power. You create an Amazon EMR cluster. Update the metadata with the contents of the S3 bucket to sync the source and destination stores. EMR processes data across a Hadoop cluster in AWS. Using the integrated Apache programming language Pig, you can submit a transformation script to clean the data. Then generate reports using Apache Hive’s SQL-like query language. When run with a scheduler, the older EMRFS metadata will need to be cleaned out of the cluster on a regular basis.

Method 2 - Use Athena or Redshift Spectrum to Analyze Data in Amazon S3

Use Athena or Redshift Spectrum to analyze data in an Amazon S3 data lake. This method sets up a data pipeline between your source and destination data stores. If there are log files available in your Amazon S3 bucket, then you can begin. EMR is a tool to process and analyze big data, an expandable computing service alternative to running on-premise cluster computing for processing power. You create an Amazon EMR cluster. Update the metadata with the contents of the S3 bucket to sync the source and destination stores. EMR processes data across a Hadoop cluster in AWS. Using the integrated Apache programming language Pig, you can submit a transformation script to clean the data. Then generate reports using Apache Hive’s SQL-like query language. When run with a scheduler, the older EMRFS metadata will need to be cleaned out of the cluster on a regular basis.

A prerequisite for both is that the S3 files for the largest table need to be in three different formats: CSV, non-partitioned Parquet and partitioned Parquet (Apache Parquet is a columnar storage format that makes data available to any project in the Hadoop ecosystem.) The columnar format allows Redshift Spectrum to scan only needed columns, thereby saving Amazon Redshift charges.

The basic ETL steps are:

  1. Create external schema
  2. Define external tables
  3. Query data

Method 3 - Use AWS Glue - Fully Managed ETL

AWS GLUE, fully managed ETL service run from the AWS Management Console (more details). In addition to enabling user friendly ETL, it also allows you to catalog, clean and move data between data stores. Since Glue is on a pay-per-resource-used model, it is cost efficient for companies without adequate programming resources.

The user creates a data catalog, generates transformations and schedules/runs ETL jobs in the console. Glue uses Python to generate the ETL code to run in the Glue Apache Spark environment. Schema discovery is automated, too. This is accomplished without infrastructure or purchasing software licenses. Using your existing AWS account and data, Amazon charges Glue customers only for the computing time used while running their ETL jobs.

Plus, Glue is fully integrated with other AWS data services. Data held on Amazon VPC’s in MySQL or PostgreSQL databases can also be queried. Developers can customize and port the Python code used in Glue anywhere. Logs and notifications are pushed to Amazon CloudWatch for monitoring and alerts.

Pros and Cons

With such different approaches, these methods offer businesses the opportunity to get the most out of their ETL process. Using the Data Pipeline method requires skilled programming resources. The trade-off is that your ETL processes and queries can be customized in any way you need. For businesses with highly individualized data needs this method is likely worth the extra cost. Method 2 is advantageous for businesses with strong database administrators or SQL programmers and for those with multiple datasets on AWS. The flexibility of ad-hoc Athena queries is great for quickly changing business conditions or for what if analysis. For those with limited programming resources and data analysis time, using Glue can provide the benefits of ETL without the overhead. However, as with any managed solution, if you do not or cannot customize the code then some analysis nuances may be lost. Unless you have highly unique data analysis needs, this should not present a significant problem.

Learn More


Connect all your data sources to any data warehouse