Introduction to Uber Marmaray Working, Best Practices and Tools

November 13, 2018 

Introduction to Uber Marmaray Working, Best Practices and Tools

What is Marmaray?

Marmaray is an Open source, Data Ingestion and dispersal framework and library for Apache Hadoop, build on the top of the Hadoop ecosystem. Users ingest data from any source and also further, distribute it to any sink leveraging the use of Apache Spark. Marmaray is responsible for ingesting the raw data in a data lake with an appropriate source schema to obtain reliable analytical results. Marmaray is visualized by Uber as a pipeline used for connecting the raw data from different types of data sources to Hadoop or Hive and also further connecting both derived and raw datasets from Hive to a variety of sinks depending on SLA, latency, and other customer requirements.


Uber Marmaray Features

  • Automated Schema Management.
  • Monitoring and Alerting Systems.
  • Fully Integrated with workflow orchestration tool.
  • Extensible architecture.
  • Open Source.

Why Marmaray Matters?

  • Marmaray is capable of producing quality schematized data.
  • It is capable of ingesting data from multiple data source into Hadoop data lake through Marmaray ingestion.
  • It is capable of processing the ingested data and also capable of storing and calculating business metrics based on data in Hive.
  • Marmaray is responsible for serving the processed data from hive to any data store where the users can query the data and get the results via Marmaray dispersal.

Why Marmaray Ingestion?

  • Raw Data needed in Hadoop data lake.
  • Ingested raw data to Derived Datasets.
  • Reliable and correct schematized data.
  • Maintenance of multiple data pipelines.

Why Marmaray Dispersal?

  • Derived datasets in Hive.
  • Duplicate and ad hoc dispersal pipelines.
  • Future dispersal needs.

How Marmaray Works?

Chain of Converters

These are responsible for conversion or transformation of ingested data according to the requirements and also have the potential to save/store it to multiple sinks. If there is any malformed data found during transformation such as any missing fields or any other issues, then it is written to error tables.

Work Unit Calculator

Work Unit Calculator is responsible for creating the batches of data for processing. It takes cares that the defined amount of data to read or defined number of messages fetched to read from Kafka. It ensures that the works units are appropriately sized and don't overwhelm source or sink systems.

Metadata Manager

Metadata Manager is only responsible for storing the relevant metadata for a running job. Metadata Manager is used to storing the metadata as checkpoint information or can say partition offsets in case of Kafka.

Fork Operator and Fork Function Why is Fork Operator needed?
  • Avoid reprocessing input records.
  • Avoid re-reading input records( in the case of Spark, re-executing input transformations).

How to Adopt Marmaray?

  • Marmaray can be used for Data Ingestion and Data dispersal.
  • User submits and Ingestion/Dispersal job.
  • Create source and sink specific configuration.
  • Determine Work Unit to Process.
  • Read in raw data from Source.
  • Fork data to split into valid and error records.
  • Convert data to sink schema format.
  • Persist data to sink and update metadata.
  • Report metrics.

Marmaray Vs. Gobblin

Gobblin is similar to Marmaray, but one of the significant difference is that Gobblin is only capable of ingesting the data from different types of data sources such as databases, FTP/SFTP servers, rest API's, etc. onto Hadoop, whereas Marmaray is responsible for ingesting the data from any source and load it to Hadoop and further, it is also capable of distributing the ingested data from Hadoop to various sinks by leveraging Apache Spark.

Hadoop MapReduce framework used by Gobblin wherein order to transform the data, but on the other hand, Marmaray doesn't provide any transformation capabilities.

Both frameworks Marmaray and Gobblin are responsible for handling the job, task scheduling and metadata management.

Gobblin uses Hadoop MapReduce, but on the other hand, Marmaray uses Apache Spark as a primary Data Processing Engine. And using Apache Spark as a Data Processing Engine has its advantages over MapReduce.

Spark is much faster due to its In-Memory Processing Semantics.

Spark also provides many transformations by default such as grouping, mapping, filtering, etc.

It can perform multiple transformations on data without storing the previously transformed data to HDFS.


Best Practises of Marmaray

Use a new type of HoodieRcordPayload and keep the previous persisted one as the output of combineAndGetUpdateValue(...).

However, the commit time of the previous persisted one updated to the latest, which makes the downstream incremental ETL counting this record twice.

Left join the data frame with all the persisted data by key and insert the records whose persisted_data.key are null. The concern is not sure bloomIndex / metadata can be taken full advantage of.

Put a new flag field in the HoodieRecord reading from HoodRecordPayload metadata to indicate if a copyOldRecord needed during writing.

Pass down a flag during data frame options to enforce, this entire job will be copyOldRecord.


Benefits of Marmaray

  • Automated Schema Management.
  • Integration Monitoring & Alerting system.
  • Fully integrated with workflow Orchestration tool.
  • Extensible architecture.
  • Data Deletion and Updation.

Uber Marmaray Tools