Lyft Amundsen is used within the organization for searching their data. This metadata driven application helps in increasing the efficiency of data analysis by engineers when interacting with data. Think of it like Google searches for data.
Fun Fact: "The project has been named after the Norwegian explorer Roald Amundsen, the first person to discover the South Pole."
Architecture of Amundsen Lyft
The architecture of Amundsen Lyft is explained below in the form of services they provide to the DataOps Engineer.
1. Metadata Service
It handles metadata request from front end and other related micro services. It is responsible for updating metadata. By default, the persistent layer is Neo4j, which is replaceable other database.
2. Search Service
It is backed by Elastic Search to handle search request from front end service. It is responsible for searching metadata. By default, search engine is powered by Elastic Search, which is replaceable with other database.
3. Front end Service
It hosts Amundsen’s web application. Amundsen’s front-end service is composed of two distinct parts: A React Application and A Flask Server.
4. Amundsen Data Builder
It is also known as Big Data ingestion framework, which extracts metadata from various data sources.
It is Amundsen’s library repo which stores common codes among micro services.
Terminology of Amundsen Lyft in Detail
Amundsen Lyft is basically a data discovery and metadata engine that is aimed to improve the interaction with any type of data. Below listed are some terminologies to be kept in mind while operating Amundsen Lyft.
1. Front End Service
It is a react application for client side rendering that uses flask server for serving requests. It also acts as an intermediary for metadata or search service request.
2. Data Builder
Also known as ETL Framework, it is used to build data into Amundsen Lyft. The process of controlling ETL is known as Task. Job controls task and Publisher. Pull approach is used by data ingestion library data builder to index the metadata into Amundsen Lyft. Here are certain points to take note of:
Extract (E): It is used to extract data from different metadata sources. It can use pull and push pattern for extracting records.
Transforms (T): It is used to transform the extracted data by cleaning the data, validating the data, organizing the data and applying the functions on the data.
Loader (L): It is used to load the data into the staging area. It doesn't support atomicity and operates at record level.
Task: It orchestrates ETL to perform record level operation.
Publisher: It supports atomicity at job level or bulk load into sink.
Job: It is used by the client to launch ETL job. It also orchestrates task and publisher.
It belongs to graph database platform which is developed for storage, query, analyzing and handling highly connected data more efficiently than other databases. It uses the Cypher graph query language to store and retrieve data from the graph database.
4. Elastic Search
It is an open source, distributed environment search engine. It supports full text search means completely based on documents rather than tables and schemas. For full text based searches, it uses “inverted index” data structure. It can be used to store, search and analyze big volume of data rapidly in real time. Queries can be used with Elastic Search which lets you perform and merge many types of searches like: structured, unstructured, Geo, metric etc. It is used for single page web application projects only. It contains a cluster which can have any number of nodes present in it. Nodes are used to store data and it participates in cluster indexing and search capability.
Elastic Search has the ability to divide the index into several sections known as shards. Every shard in itself is fully functional and autonomous index which can be hosted on any node present in the cluster. It also has the capability to automatically creates replica shards, in case any node failure occurs within the cluster of Elastic Search.
Why Amundsen Lyft?
Amundsen is used for improving the productivity of data analysts and data engineers when interacting with data by faster data discovery. Some of its features include:
Connecting different data resources with people using three types of relationships between users and resources: followed, owned, used
Solving the problem of data ownership between different users
Allowing users to request more information from owners of data resources
Working of Amundsen Lyft
When the one page web app “Amundsen” opens, then the landing page for Amundsen Lyft includes search bars and popular used tables. Search bar is used to search for any data using plain English and Popular used tables presents the list of popular tables in organization. Search ranking uses an algorithm, where highly queried tables show up above. Given below are bulleted points to be taken care of while working with Amundsen Lyft:
Select a table of your choice
Once you have selected a table of your choice, you get to the detail page of that table which shows the name of the table, column names with its data type, name of owner, frequent users of that table and tags with the table.
The table detail page also contains a preview button which is used to view the daily partition of the data only if you have access to the data.
Information like tags and description is manually entered by users. At the bottom of detailed page, there is a feedback widget for the users.
Types of Metadata for DataOps
Metadata provides the data about other data, which can be classified as below:
Application Context: It is the information required by humans and applications.
Behavior: It gives information about how data is originated and used over time.
Change: It gives information about how data is altered over time.
Confused about data described by Metadata? Look ahead to know about it.
Reports: It is used to save queries, reports, dashboards in Tableau, Looker etc.
Schemas: They can be described as tools like Segment or schema registries which store schemas and events.
Processing: It includes ETL job scripts.
Approaches to Index Metadata
There are two approaches to index Metadata into Amundsen Lyft. Pull approach: It updates metadata periodically by pulling from sources via crawlers. Push approach: The databases are used to push metadata into Apache Kafka such that the downstream sources can consume changes by subscribing it.
Data Modelling for MetaData
Metadata is represented using graph data model in Amundsen Lyft, which is a unique choice for some applications. Graph data model is used to form relationship between vertex and relation (edge), making it easy to expand the model with the introduction of more entities.