Home Cloud Amazon Web Services A look at AWS Glue and Version 2.0: Handle ETL process swiftly...

A look at AWS Glue and Version 2.0: Handle ETL process swiftly for data transformation

-

One of the trending fields in the past few years has been big data. And there is one cloud service that is quite popular when it comes to tackling big data, i.e., AWS Glue. After reading this blog, you can take away, “Why AWS Glue exists?” “What is ELT?” “What is AWS Glue?” “Learn about its different use cases,” and finally, “The feedback from users who have put AWS Glue into use in real-life scenarios.”

Big data has proven to be crucial for organizations that are ready to extract insights to serve their customers an enriched experience and keep their competitors on edge. But unfortunately, most of them are unable to put the available data into efficient use.

Many businesses have chosen to leverage data warehouse to simplify enterprise data analytics and reporting. It is a data storage system that collects information from many different sources within the organization. But this still implies the challenge of how to get data from faraway databases into the centralized data warehouse.

Now, the ETL process comes into the picture, which helps explicitly in transferring data from a source database to a data warehouse. The ETL comes with its implementation complexities and challenges. To solve such issues, Amazon introduced AWS Glue.

What is ETL?

Extract, transform, load (ETL) is a data integration process for loading information from one or more source databases into a data warehouse.

The process consists of the following three stages:

Extract:

The data is read before extracting it to a staging area from the source database.

Transform:

The extracted data is then validated, evaluated for any data integrity issues, and finally transformed so that it matches the target database schema.

Load:

The transformed data is then loaded into the target data warehouse.

ETL tools must be able to transform the data correctly between source and target, deal with a wide variety of data sources, and scale to advise the large volumes of data. As I mentioned earlier, the organizations aren’t proactive enough to successfully implement the ETL process.

So, now the question arises that “Why the hell I am still not talking about AWS Glue?” Well, I just wanted to address why AWS Glue exists to help you realize its significance. Now, let’s catch up with AWS Glue.

What is AWS Glue?

As per AWS’s official website, “AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.” The service was initially released in August 2017. Since then, AWS is putting constant efforts to enhance AWS Glue capabilities. Here are the most recent significant updates for AWS Glue:

  1. Added new transforms (Purge, Transition and Merge) for Apache Spark applications to work with datasets in Amazon S3 (January 2020)
  2. Supports the ability to run ETL jobs on Apache Spark 2.4.3 (with Python 3) (July 2019)
  3. Supports scripts that are compatible with Python 3.6 in Python shell jobs (June 2019)

AWS Glue comprises of following

  1. AWS Glue Data Catalog – a repository of metadata that contains references to data sources and the targets involved in the ETL process.
  2. An ETL engine – automatically generates scripts in Python and Scala to be used for the entire ETL process.
  3. A scheduler – runs job and trigger events as per the defined time and other criteria.

AWS Glue has always been seen as the piece that completed AWS’s data processing puzzle. Before AWS Glue, the AWS service portfolio lacked the solution for data transformation. Earlier customers could enjoy the services for data acquisition, storage, and analysis, but for enterprise customers, it was not enough.

AWS Glue is a dedicated service to facilitate the construction of an enterprise-level data warehouse. The information can be transferred from data warehouses to different sources, including transnational databases, as well as the Amazon cloud.

Recently Updated AWS Glue Version 2.0

AWS Glue version 2.0 with 10x faster Spark ETL job start times is now generally available. With Glue version 2.0, job start delay is more predictable and less overhead. Additionally, AWS Glue Version 2.0 spark jobs will be charged in 1-second increments with a minimum billing time of 10x to a minimum of -10 minutes to a minimum of 1 minute. As a result, customers can now effectively operate micro-patch, time-sensitive, interactive workloads more cost-effectively. Customers can run micro patch jobs to quickly load data lakes, databases and databases and run real-time analytics. With fast work start times, customers can run SLA-powered data tubes more reliably. Rapid start-up hours also enable interactive data analysis and testing. Glue version 2.0 offers a new ability to install Python modules from a wheel file or from a repository.

Use Cases for AWS Glue

  1. Launch ETL tasks based on a specific trigger, schedule, or event.
  2. Prevent stalling during the ELT process by handling errors and retrying.
  3. Automatically detect changes in your database schema and adjust the service to match them.
  4. Create ETL scripts to transform, denormalize, and enrich the data while transferring data from the source to target.
  5. Keep a tab on metadata about your various databases and data stores, and archiving them in the AWS Glue Data Catalog.
  6. Collect every possible log, metrics, and KPIs for the ETL process to stream monitoring and reporting.
  7. Get a unified view of your data across multiple data stores.

What you just read till now is enough to give you the jump start on the AWS Glue. But before a hands-on approach to the service, hear what other users had to say about the service and takeaway critical details from their real-life experiences:

1. Alkesh G

(Cloud Architect, Information Technology and Services)

What do you like best?

“I have been working with AWS Glue for 2-3 years. It allows you to locate, move, and transform all your data set across your business. The most interesting thing about AWS Glue is that it is serverless. You can run all your ETL jobs by just pointing Glue to them; you don’t need to configure, provision, or spin-up servers, and you don’t need to manage their life cycle. It customizes your task by 80-85%!!”

What do you dislike?

“It’s not that easy to learn and implement AWS Glue because it contains concepts like Crawlers, ETL scripts, etc.”

2. Anudeep M

(SAP BODS S/4 Hana MDG solutions migration developer,

Mechanical or Industrial Engineering)

What do you like best?

“The most useful thing about AWS glue is to convert the data into a parquet format from the raw data format, which is not present with other ETL tools. It can convert a huge amount of data into a parquet format and retrieve it as required.”

What do you dislike?

“It’s not user-friendly with Graphical user interface like other ETL tools and expects developers to work on coding and debugging is tough. AWS glue source to target mapping schema creation is also not user-friendly.”

Review collected by and hosted on G2.com.

Look out for our article on “How to get started with AWS Glue?” in the upcoming week.

If you are a cloud professional and want to contribute to the article on AWS Glue or any other topic around cloud technology, CMI will be happy to have you as a contributor. Drop a message via Contact Us.

Cloud Evangelist
Cloud Evangelist
Cloud Evangelists are CMI's in house ambassadors for the entire Cloud ecosystem. They are responsible for propagating the doctrine of cloud computing and help community members make informed decisions.

Cloud

Cloud Management