Introduction to AWS Data Catalog
There has always been a no-man land between IT and business. IT knows how to work with data, but business knows what data represents. This makes a conflict as neither know enough about data to strategically use it and hence, emerges a tribal behavior where each guard their pockets of expertise. At a certain point, every company suffers from this condition over the years. Solutions have emerged that can ease the tendency, and one of the modern solutions is AWS Data Catalog.
Let’s Understand the related terms to AWS Data Catalog first!
What is a Data Catalog?
Data Catalog can be referred to as a retailer’s catalog, but instead of giving information about the product, it gives information about the organization’s data element. The consumer of data is present all over the hierarchy. They want to use the full potential of data. Hence, the data catalog vendor gives a heavy dose of automation to the data catalog to collect meaningful information about the data element imported into the solution. This acts as a bridge between the consumer and hence, emitting the tribal behavior.
The data catalogs have been reignited with data lakes and more advanced automation from data catalog vendors. Know about AWS Data lake Analytics here.
It is important to understand the data catalog does not conform to the data, but rather it plays to identify its uses. The application of conformity is what lives inside the data warehouse. In conjunction with the data warehouse, the data is used to pinpoint the audience to use the conformed data object.
Curious to know about AWS Data Catalog now? Scroll ahead!
AWS Data Catalog
AWS Data Catalog is a prospering metadata management service nowadays. It offers two forms of data catalog: Comprehensive Data catalog and Hive Metastore Catalog (HCatalog). The two catalogs may exist parallelly. Using a comprehensive data catalog, we can search for all our assets in the data lake. HCatalog can come into play to discover and query assets in the data lake. Let’s know about these key services of AWS Data Catalog in detail.
1. Comprehensive AWS Data Catalog
Standard AWS services like AWS Lambda, Amazon ES, and Amazon DynamoDB can create a comprehensive data catalog. At a glance, Lambda functions that are triggered are used to populate the DynamoDB table with metadata and object names. On saving the metadata and object name into Amazon S3, the Amazon ES may be used to search for desired assets, the corresponding metadata, and data classifications. It basically contains all information about the assets that are ingested to the S3 data lake.
2. HCatalog with AWS Data Catalog
Hive compatible Metastore catalog can be created using AWS Amazon Glue for the data stored in an Amazon S3 based data lake. Building your own data catalog is a piece of cake with the help of AWS glue. To begin with, go to the AWS management console and register your data source with AWS glue. The AWS Data Catalog Crawler crawls over the S3 bucket, searching your data sources, and devises a data catalog using classifiers.
You can choose from many classifiers such as CSV, JSON, Parquet, or add your own classifiers or choose classifiers from the AWS glue community to add them to the crawler to recognize and catalog different types of data. AWS glue then spawns a data catalog that can be used by various AWS services such as Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and Amazon Redshift, and the third-party analytics tools that use a standard Hive Metastore Catalog.
Know about AWS Serverless Computing here with us.
Connections in AWS Glue Data Catalog
A connection in AWS Glue is a Data Catalog object that stores connection information for a particular data store. Creating a connection takes the burden off the shoulder to specify connection details every time you create a crawler or job.
There are different types of connections available with AWS Data Catalog such as JDBC, Amazon RDS, Amazon Redshift, MongoDB, including Amazon DocumentDB. While creating a crawler or ETL job for any of these data sources, you specify the connection to use.
Populating the AWS Glue Data Catalog
The AWS Glue Data Catalog contains a reference to data used as a source and target for your ETL (extract, transform, and load) jobs. This data must be cataloged to create your data warehouse or data lake. The information that is present in your data catalog is used to monitor ETL jobs. Each metadata table specifies a single meta datastore where we store our data catalog information. There are different ways to add a metadata table to your data catalog, that is, to populate your data catalog.
A table in AWS Data Catalog is a metadata definition representing data in a data store that may hold the object of Amazon S3 service and relational tables in Amazon RDS (Relational Database Service). We can create Tables manually or by using the AWS Data Catalog Crawlers.
While defining a table in AWS glue, it provides you with the option of partitioning key that allows you to partition the table to improve query performance. While partitioning the table, these conditions must be true. The schemas of the files must be similar. As determined by AWS Glue, the data and the compression format of the files is the same.
After adding a table in AWS glue, we add it to a database. A database in AWS Data Catalog organizes tables into different categories. To do so, we can use a crawler or login to the AWS glue console. Database in AWS glue acts as a container for the defined table created when you run a column or add a table manually.
The data catalog is the pointer to the schema and location of your data.
You can visualize the whole thing as an interlinked chain. The data that may hold the object of Amazon S3 service and relational tables in Amazon RDS (Relational Database Service) is present in the table contained by the database in the data catalog.
Steps to Work with Database
Various functions can be performed from the database tab in the AWS glue console:
- To create a new database, select Add database, and provide the description.
- Edit database action allows you to edit the database.
- Delete database action allows you to delete the database.
- View tables action allows you to view the list of tables.
Get featured information about AWS Big Data here.
AWS Data Catalog Crawler
A crawler is the most basic method used by most AWS Glue users to populate the data catalog. The crawler crawls multiple datastore on completion it creates and updates the table that is further used by ETL jobs. The following workflow depicts how a crawler populates the AWS Data Catalog:
- A crawler runs all custom classifiers that you decide to infer the format and blueprint of your information. You give the code to custom classifiers, and they run in the request that you specify.
- The main custom classifier to effectively perceive the structure of your data is utilized to make a schema. Custom classifiers ranking lower in the list are skipped. If no custom classifier matches your data’s schema, built-in classifiers attempt to perceive your data’s schema. An illustration of a built-in classifier is one that perceives JSON.
- The crawler interfaces with the data store. Datastores may require connection properties for crawler access.
- Your data creates the inferred schema.
- The crawler composes metadata to the Data Catalog. The table contains metadata. The table is written in a database, which is a compartment of tables in the Data Catalog. Properties of a table incorporate classification, which might be a label made by the classifier that derived the table schema.
AWS Cloudformation: A Way to Populate AWS Data Catalog
AWS cloud formation service can create many AWS resources. Cloudformation can automate creating an object, making it convenient to define and create AWS Glue objects and other related AWS resources.
Demand for data catalogs is soaring as organizations continue to struggle with finding, inventorying and analyzing vastly distributed and diverse data assets.
Source: Gartner, Inc
AWS CloudFormation provides a simplified syntax in JSON/YAML to create AWS resources. CloudFormation can provide templates that may be used to define Data Catalog objects databases, tables, partitions, crawlers, classifiers, and connections. AWS CloudFormation helps in provisioning and configuring resources described by the template. No better version than AWS Data Catalog, right?
One should now be clear how AWS Data Catalog has helped in strategically using data by both business and IT, using the serverless environment that makes it easier for one to populate the data catalog. One now knows different types of AWS Data Catalog services (Comprehensive data catalog and HCatalog) and how can we populate them. Future Workspace for a reason, right?