Best Practices of Hadoop Infrastructure Migration


Migration involves the movement of data, business applications from an organizational infrastructure to Cloud for -


  • Recovery
  • Create Backups
  • Store chunks of data
  • High security
  • Reliability
  • Fault Tolerance

Challenge of Building On Premises Hadoop Infrastructure


  • Limitation of tools
  • Latency issue
  • Architecture Modifications before migration
  • Lack of Skilled Professional
  • Integration
  • Cost
  • Loss of transparency

Service Offerings for Building Data Pipeline and Migration Platform


Understand requirements involving data sources, data pipelines, etc. for the migration of the Platform from On-Premises to Google Cloud Platform.


  • Data Collection Services on Google Compute Engines. Migrate all Data Collection Services and REST API and other background services to Google Compute Engine (VM’s).
  • Update the Data Collection Jobs to write data on Google Buckets. Develop Data Collections Jobs in Node.js and write data to Ceph Object Storage. Use Ceph as Data Lake. Update existing code to write the data to Google Buckets hence use Google Buckets as Data Lake.
  • Use Apache Airflow to build Data Pipelines and Building Data Warehouse using Hive and Spark. Develop a set of Spark Jobs which runs every 3 hours and checks for new files in Data Lake ( Google Buckets ) and then run the transformations and store the data into Hive Data Warehouse.
  • Migrate Airflow Data Pipelines to Google Compute Engines and Hive on HDFS using Cloud DataProc Cluster for Spark and Hadoop. Migrate REST API to Google Compute Instances.
  • The REST API served as Prediction results to Dashboards and acts as Data Access Layer for Data Scientists migrated to Google Compute Instances (VM’s ).

Technology Stack -


  • Node Js based Data Collection Services (on Google Compute Engines)
  • Google Cloud Storage found Data Lake (storing raw data coming from Data Collection Service)
  • Apache Airflow (Configuration & Scheduling of Data Pipeline which runs Spark Transformation Jobs)
  • Apache Spark on Cloud DataProc (Transforming Raw Data to Structured Data)
  • Hive Data Warehouse on Cloud DataProc
  • Play Framework in Scala Language (REST API)
  • Python based SDKs

Looking For More Details

Download Now

Data Driven Enterprises with DataOps

Talk to Experts for Continuous Delivery to Analytics, Machine Learning and Data Management Practices

Reach Us

Disrupting Industries with Enterprise AI

Accelerate AI Adoption by Harnessing AI Power, Implementing AI Solutions and Leveraging AI Marketplace

Contact Us

Decentralised Big Data management and Governance, AI Marketplace for Operationalising and Scaling AI

  • Advanced Monitoring
  • Infrastructure Automation
  • Optimizing GPU Usage
  • Deploying Deep Learning Models
Learn More