Building Data Lake on Google Cloud using BigQuery

Overview of Data Lake Architecture


    • Build Data Lake to perform Analytics to collect data from different countries Servers, IoT, Social Media, Click Streams and logs to research Product discovery, Product recommendations, and New Product Requirements.


    • Build Data lake and Data Warehouse for Real-Time & Batch Data Processing for Social Media Analytics, IoT Analytics, Image Analytics, Recommendation System using clickstream, Data Warehouse ETL operations.


Challenge for Building the Data Pipeline


  • Server Data have specified format to pull data with DCD format of the file which consists of 16 XML’s.
  • Monitoring of Stores and refrigerators with IoT device with Data Pipeline to collects data from IoT devices, run analytics and detect anomalies in data received from the sensors.
  • Setup Data Pipeline to collect Real-Time data from the Social Media with hashtags for Sentimental and Intent Analytics.
  • Recommendation system to collect clickstream from the Web and Mobile application.
  • Product Search and Discovery Data Scraping.
  • Data Ingestion from ERP Solution for there Vendors.


Solution Offered for Building the Analytics Platform


Real-Time Social Media Analytics


Data collection from Real-Time tweets from the Twitter API and scrapping of API’s with filter specific keyword, hashtag, language, and location.


Python to collect data from Twitter through Twitter API’s and transfer to Google Cloud Pub/Sub. Google Cloud App Engine to deploy the application. Data from Pub/Sub consumed in Cloud DataFlow for the further cleaning, transformation and sent to the Data Lake BigQuery.


Real-Time IoT Analytics Platform


Sensor Data from IoT devices at different warehouses with Refrigerator installed at various places to collected data. Different IoT Devices configured with Google IoT Core using MQTT bridge. Google Pub/Sub used as a messaging queue and Google Cloud DataFlow for the transformation and cleaning. Cleaned data sent to the Data Lake BigQuery for the further Analytics.


Real-Time Clickstream Analytics


ClickStream Analytics used for product Recommendation System. Real-Time Clickstream data is captured using Google Cloud Function with an HTTP request as the trigger and collected data sent to Google Pub/Sub. Before performing the Data Analytics with BigQuery, the data gets cleaned and transformed using Cloud DataFlow.


Sales Analytics Platform


A portal where store manager uploads the data file in DCD format. On the backend, convert the file into the CSV, and publish the data to Cloud Pub/Sub for the further processing. Cloud DataFlow used for the data cleaning and necessary data transformation. After these transformations, the data sent to the BigQuery and Bigtable(for Cache).


Technology Stack


  • Cloud App Engine
  • Cloud Pub/Sub
  • Cloud IoT Core
  • Cloud Function
  • Cloud DataFlow
  • BigQuery
  • BigTable
  • DataLab
  • Data Studio
Read more