Ingestion and Processing of Data for Big Data and IoT Solutions
XenonStack - A Stack Innovator

Ingestion and Processing of Data For Big Data and IoT Solutions

by Navdeep | March 03, 2017 |  Categories -  Big Data, IoT, Big Data Framework, Data Integration

 

Introduction

 

In the era of the Internet of Things and Mobility, with a huge volume of data becoming available at a fast velocity, there must be the need for an efficient analytics system.

 

Also, the variety of data is coming from various sources in various formats, such as sensors, logs, structured data from a RDBMS, etc. In the past few years, the generation of new data has drastically increased. More applications are being built and they are generating more data at a faster rate. 

 

Earlier, Data Storage was costly and there was an absence of technology which could process the data in an efficient manner. Now the storage costs have become cheaper, and the availability of technology to process Big Data is a reality.

 

What is Big Data

 

According to the Author Dr. Kirk Borne, Principal Data Scientist, Big Data Definition is Everything, Quantified and Tracked. Let’s pick that apart - 

 

  • Everything – Means every aspect of life, work, consumerism, entertainment, and play is now recognized as a source of digital information about you, your world, and anything else we may encounter.

 

  • Quantified – Means we are storing those "everything” somewhere, mostly in digital form, often as numbers, but not always in such formats. The quantification of features, characteristics, patterns, and trends in all things is enabling data mining, machine learning, statistics, and discovery at an unprecedented scale on an unprecedented number of things. The Internet of Things is just one example, but the Internet of Everything is even more awesome.

 

  • Tracked – Means we don’t simply quantify and measure everything just once, but we do so continuously. This includes - tracking your sentiment, your web clicks, your purchase logs, your geo-location, your social media history, etc. or tracking every car on the road, or every motor in a manufacturing plant or every moving part on an airplane, etc. Consequently, we are seeing the emergence of smart cities, smart highways, personalized medicine, personalized education, precision farming, and so much more.

 

All of these quantified and tracked data streams will enable

 

  • Smarter Decisions

  • Better Products

  • Deeper Insights

  • Greater Knowledge

  • Optimal Solutions

  • Customer-Centric Products

  • Increased Customer Loyalty

  • More Automated Processes, more accurate Predictive and Prescriptive Analytics

  • Better models of future behaviors and outcomes in Business, Government, Security, Science, Healthcare, Education, and more.

 

Big data Defines three D2D’s

 

  • Data-to-Decisions

  • Data-to-Discovery

  • Data-to-Dollars 

 

The 10 V's of Big Data

 

10V's of Big Data

 

Big Data Framework

 

The Best Way for a solution is to "Split The Problem". Big Data solution can be well understood using Layered Architecture. The Layered Architecture is split into different Layers where each layer performs a  particular function.

 

This Architecture helps in designing the Data Pipeline with different requirements of either Batch Processing System or Stream Processing System. This architecture consists of 6 layers which ensure a secure flow of data.

 

Big Data Framework

 

  1. Data Ingestion Layer - This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritised and categorised which makes data flow smooth in further layers.

  2. Data Collector Layer - In this Layer, more focus is on the transportation of data from ingestion layer to rest of data pipeline. This is the Layer, where components are decoupled so that analytic capabilities may begin.

  3. Data Processing Layer - In this layer main focus is to specialize the data pipeline processing system or we can say the data we have collected in the previous layer is to be processed in this layer. Here we do some magic with the data to route them to a different destination, classify the data flows and it’s the first point where the analytic may take place.

  4. Data Storage Layer - Storage becomes a challenge when the size of the data you are dealing with, becomes large. There are several possible solutions that can rescue from such problems. Finding a storage solution is very much important when the size of your data becomes large. This layer focuses on "where to store such a large data efficiently".

  5. Data Query Layer - This is the layer where strong analytic processing takes place. Here main focus is to gather the data value so that they are made to be more helpful for the next layer.

  6. Data Visualization Layer - The visualization, or presentation tier, probably the most important tier, where the data pipeline users may feel the VALUE of DATA. We need something that will grab people’s attention, pull them into, make your findings well-understood.

 

1. Data Ingestion Layer 

 

Data Ingestion Layer of Big Data Framework

 

Data ingestion is the first step for building Data Pipeline and also the toughest task in the System of Big Data. In this layer we plan the way to ingest data flows from hundreds or thousands of sources into Data Center. As the Data coming from Multiple sources at variable speed, in different formats.

 

That's why we should properly ingest the data for the successful business decisions making. It's rightly said that "If starting goes well, then, half of the work is already done"

1.1 What is Big Data Ingestion?

 

Big Data Ingestion involves connecting to various data sources, extracting the data, and detecting the changed data. It's about moving data - and especially the unstructured data - from where it is originated, into a system where it can be stored and analyzed.

 

We can also say that Data Ingestion means taking data coming from multiple sources and putting it somewhere it can be accessed. It is the beginning of Data Pipeline where it obtains or import data for immediate use.

 

Data can be streamed in real time or ingested in batches, When data is ingested in real time then, as soon as data arrives it is ingested immediately. When data is ingested in batches, data items are ingested in some chunks at a periodic interval of time. Ingestion is the process of bringing data into Data Processing system.

 

Effective Data Ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.

 

1.2 Challenges Faced with Data Ingestion

 

As the number of IoT devices increases, both the volume and variance of Data Sources are expanding rapidly. So, extracting the data such that it can be used by the destination system is a significant challenge in terms of time and resources. Some of other challenges faced by Data Ingestion are - 

 

  • When numerous Big Data sources exist in the different format, it's the biggest challenge for the business to ingest data at the reasonable speed and further process it efficiently so that data can be prioritized and improves business decisions.

  • Modern Data Sources and consuming application evolve rapidly.

  • Data produced changes without notice independent of consuming application.

  • Data Semantic Change over time as same Data Powers new cases.

  • Detection and capture of changed data - This task is difficult, not only because of the semi-structured or unstructured nature of data but also due to the low latency needed by certain business scenarios that require this determination.

 

That's why it should be well designed assuring following things -

 

  • Able to handle and upgrade the new data sources, technology and applications

  • Assure that consuming application are working with correct, consistent and trustworthy data.

  • Allows rapid consumption of data

  • Capacity and reliability - The system needs to scale according to input coming and also it should be fault tolerance.

  • Data volume: Though storing all incoming data is preferable, there are some cases in which aggregate data.

 

1.3 Data Ingestion Parameters

 

  • Data Velocity - Data Velocity deals with the speed at which data flows in from different sources like machines, networks, human interaction, media sites, social media. The flow of data can be massive or continuous.

  • Data Size - Data size implies enormous volume of data. Data is generated by different sources that may increase timely.

  • Data Frequency (Batch, Real-Time) - Data can be processed in real time or batch, in real time processing as data received on same time, it further proceeds but in batch time data is stored in batches, fixed at some time interval and then further moved.

  • Data Format (Structured, Semi-Structured, Unstructured) - Data can be in different formats, mostly it can be structured format i.e. tabular one or unstructured format i.e. images, audios, videos or semi-structured i.e. JSON files, CSS files etc.

 

1.4 Big Data Ingestion Key Principles

 

In order to complete the process of Data Ingestion, we should use right tools for that and most important that tools should be capable of supporting some of the key principles written below -

 

  • Network Bandwidth - Data Pipeline must be able to compete with business traffic. Sometimes traffic increases or sometimes decreases so Network bandwidth scalability is biggest Data Pipeline challenge. Tools are required for bandwidth throttling and compression capabilities.

  • Unreliable Network - Data Ingestion Pipeline takes data with multiple structures i.e. images, audios, videos, text files, tabular files data, XML files, log files etc and due to the variable speed of data coming, it might travel through the unreliable network. Data Pipeline should be capable of supporting this also.

  • Heterogeneous Technologies and System - Tools for Data Ingestion Pipeline must be able to use different data sources technologies and different operating system.

  • Choose Right Data Format - Tools must provide data serialization format, that means as data comes in the variable format so converting them into single format will provide an easier view to understand or relate the data.

  • Streaming Data - It depends upon business necessity whether to process the data in batch or streams or real time. Sometimes we may require both processing. So, tools must be capable of supporting both.

 

1.5 Data Serialization

 

Different types of users have different types of data consumer needs. Here we want to share variable data, so we must plan how the user can access data in a meaningful way. That's why a single image of variable data optimize the data for human readability.

 

Approaches used for this are -

 

  • Apache Thrift - It's an RPC Framework containing Data Serialization Libraries.

  • Google Protocol Buffers - It can use the special generated source code to easily write and read structured data to and from a variety of data streams and using a variety of languages.

  • Apache Avro - The more recent Data Serialization format that combines some of the best features which previously listed. Avro Data is self-describing and uses a JSON-schema description. This schema is included with the data itself and natively support compression. Probably it may become a de facto standard for Data Serialization.

 

1.6 Data Ingestion Tools

 

1.6.1 Apache Flume - Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

 

It has a simple and flexible architecture based on streaming data flows. It is robust and faults tolerant with tunable reliability mechanisms and many failovers and recovery mechanisms.

 

It uses a simple extensible data model that allows for an online analytic application. Its functions are -

 

  • Stream Data - Ingest streaming data from multiple sources into Hadoop for storage and analysis.

  • Insulate System - Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination

  • Scale Horizontally - To ingest new data streams and additional volume as needed.

 

1.6.2 Apache Nifi - Apache Nifi provides an easy to use, the powerful, and reliable system to process and distribute data. Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. Its functions are -

 

  • Track data flow from beginning to end

  • Seamless experience between design, control, feedback, and monitoring

  • Secure because of SSL, SSH, HTTPS, encrypted content.

 

1.6.3 Elastic Logstash - Elastic Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously transforms it, and then sends it to your “stash" i.e. Elasticsearch.

 

It easily ingests from your logs, metrics, web applications, data stores, and various AWS services and done in continuous, streaming fashion. It can Ingest Data of all Shapes, Sizes, and Sources.
 

2. Data Collector Layer

 

Data Collector Layer of Big Data Framework

 

In this Layer, more focus is on transportation data from ingestion layer to rest of Data Pipeline. Here we use a messaging system that will act as a mediator between all the programs that can send and receive messages.

 

Here the tool used is Apache Kafka. It's a new approach in message oriented middleware.

 

2.1 Apache Kafka

 

It is used for building real-time data pipelines and streaming apps. It can process streams of data in real-time and store streams of data safely in a distributed replicated cluster.

 

Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.
 

2.2 What is Data Pipeline?

 

  • Data Pipeline the main component of Data Integration. All transformation of data happens in Data Pipeline.

  • It is a Python based tool that streams and transforms real-time data to service that need it.

  • Data Pipeline Automate the movement and transformation of data. Data Pipeline is a Data Processing engine that runs inside your application.

  • It is used to transform all the incoming data in a common format so that we can prepare it for analysis and visualization. Data Pipeline is built on Java Virtual Machine (JVM).

  • So, a Data Pipeline is a series of steps that your data moves through. The output of one step in the process becomes the input of the next. Data, typically raw data, goes on one side, goes through a series of steps.

  • The steps of a Data Pipeline can include cleaning, transforming, merging, modeling and more, in any combination.

 

2.2.1 Functions of Data Pipeline

 

  • Ingestion - Data Pipeline Helps in bringing data into your system. It means taking unstructured data from where it is originated into a system where it can be stored and analyzed for making business decisions

  • Data Integration - Data Pipeline also helps in bringing different types of data together.

  • Organization - Organizing data means an arrangement of data, this arrangement is also done in Data Pipeline.

  • Refining the data - It's also one of the processes where we can enhance, clean, refine the raw data.

  • Analytics - After refining the useful data, Data Pipeline provides us the processed data on which we can apply the operations on raw data and can make business decisions accurately.

 

2.2.2 Need Of Data Pipeline

 

A Data Pipeline is software that takes data from multiple sources and makes it available to be used strategically for making business decisions.

 

Primarily reasons for the need of data pipeline is because it's very hard to monitor Data Migration and manage data errors. Other reasons for this are below -

 

  • Certain Business - Critical Analysis is only possible when combining data from multiple sources. For making business decisions we should have a single image of all the data coming.

  • Connections - All the time data keeps on increasing, new data came and old data modified, so, each new integration can take anywhere from a few days to a few months to complete.

  • Accuracy - The only way to build trust with data consumers is to make sure that your data is auditable. One best practice that’s easy to implement is to never discard inputs or intermediate forms when altering data.

  • Latency - The fresher your data, the more agile your company’s decision-making can be. Extracting data from APIs and databases in real-time can be difficult, and many target data sources, including large object stores like Amazon S3 and analytics databases like Amazon Redshift, are optimized for receiving data in chunks rather than a stream.

  • Scalability - Data can be increased or decreased with time we can't say for on Monday data will come less and rest of days comes a lot for processing. So, usage of data is not uniform. What we can do is making our pipeline so scalable that able to handle any amount of data coming at variable speed.

 

2.2.3 Use cases for Data Pipeline

 

Data Pipeline is useful to a number of roles, including CTOs, CIOs, Data Scientists, Data Engineers, BI Analysts, SQL Analysts, and anyone else who derives value from a unified real-time stream of user, web, and mobile engagement data. So, use cases for data pipeline are given below -

 

  • For Business Intelligence Teams

  • For SQL Experts

  • For Data Scientists

  • For Data Engineers

  • For Product Teams

 

2.3 Apache Kafka is Good for 2 Things 

 

  • Building Real-Time streaming Data Pipelines that reliably get data between systems or applications

  • Building Real-Time streaming applications that transform or react to the streams of data.

 

2.3.1 Common use cases of Apache Kafka -

 

  • Stream Processing

  • Website Activity Tracking

  • Metrics Collection and Monitoring

  • Log Aggregation

 

2.3.2 Features of Apache Kafka

 

  • One of the features of Kafka is durable Messaging.

  • Apache Kafka relies heavily on the filesystem for storing and caching messages: rather than maintain as much as possible in memory and flush it all out to the filesystem, all data is immediately written to a persistent log on the filesystem without necessarily flushing to disk.

  • Apache Kafka solves the situation where the producer is generating messages faster than the consumer can consume them in a reliable way.

 

2.3.3 How Apache Kafka Works

 

Kafka System design act as Distributed commit log, where incoming data is written sequentially on disk. There are four main components involved in moving data in and out of Apache Kafka -

 

  • Topics - Topic is a user-defined category to which messages are published.

  • Producers - Producers publish messages to one or more topics

  • Consumers - Consumers subscribe to topics and process the published messages.

  • Brokers - Brokers that manage the persistence and replication of message data.

 

3. Data Processing Layer

 

Data Processing Layer of Big Data Framework

 

In the previous layer, we gathered the data from different sources and made it available to go through rest of pipeline.

 

In this layer, our task is to do magic with data, as now data is ready we only have to route the data to different destinations.

 

In this layer main focus is to specialize Data Pipeline processing system or we can say the data we have collected by the last layer in this next layer we have to do processing on that data.

 

Processing can be done in 3 ways i.e.

 

3.1 Batch Processing System  

 

A pure batch processing system for off-line analytic. For doing this tool used is Apache Sqoop.

 

3.2 Apache Sqoop

 

It efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores.

 

Apache Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

 

3.2.1 Functions of Apache Sqoop are -

  • Import sequential data sets from mainframe

  • Data imports

  • Parallel data Transfer

  • Fast data copies

  • Efficient data analysis

  • Load balancing

 

3.3 Near Real Time Processing System  

 

A pure online processing system for on-line analytic. For this type of processing tool i.e. used is Apache Storm. The Apache Storm cluster makes decisions about the criticality of the event and sends the alerts to the alert system (dashboard, e-mail, other monitoring systems).

 

3.3.1 Apache Storm - It is a system for processing streaming data in real time. It adds reliable real-time data processing capabilities to Enterprise Hadoop. Storm on YARN is powerful for scenarios requiring real-time analytics, machine learning and continuous monitoring of operations.

 

3.3.2 Features of Apache Storm

 

  • Fast – It can process one million 100 byte messages per second per node.

  • Scalable – It can do parallel calculations that run across a cluster of machines.

  • Fault-tolerant – When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.

  • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.

  • Easy to operate – It consist of Standard configurations that are suitable for production on day one. Once deployed, Storm is easy to operate.

  • Hybrid Processing system - This consist of Batch and Real-time processing System capabilities. For this type of processing tool used is Apache spark and apache Flink.

 

3.4 Apache Spark

 

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

 

With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared data set in Hadoop.

 

3.5 Apache Flink  

 

Flink is an open-source framework for distributed stream processing that Provides results that are accurate, even in the case of out-of-order or late-arriving data. Some of its features are -

 

  • It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state.

  • Performs at large scale, running on thousands of nodes with very good throughput and latency characteristics.

  • It's streaming data flow execution engine, APIs and domain-specific libraries for Batch, Streaming, Machine Learning, and Graph Processing.


3.5.1 Apache Flink Use Cases

 

  • Optimization of e-commerce search results in real-time

  • Stream processing-as-a-service for data science teams

  • Network/Sensor monitoring and error detection

  • ETL for Business Intelligence Infrastructure

 

4. Data Storage Layer

 

Data Storage Layer of Big Data Framework

 

Next, the major issue is to keep data in the right place based on usage. We have relational Databases, that were a successful place to store our data over years.

 

But with the new big data strategic enterprise applications, you should no longer be assuming that your persistence should be relational.

 

We need different databases to handle the different variety of data, but using different databases creates overhead. That's why there is an introduction to the new concept in the database world i.e. the Polyglot Persistence.

 

4.1 Polyglot Persistence

 

Polyglot persistence is the idea of using multiple databases to power a single application. Polyglot persistence is the way to share or divide your data into multiple databases and leverage their power together.

 

It takes advantage of the strength of different database. Here different types of data are arranged in different ways. In short, it means picking the right tool for the right use case.

 

It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems.

 

4.1.1 Advantages of Polygon Persistence -

 

  • Faster response times - Here we leverage all the features of databases in one app, which makes the response times of your app very fast.
  • Helps your app to scale well - Your app scales exceptionally well with the data. All the NoSQL databases scale well when you model databases properly for the data that you want to store.
  • A rich experience - You have a very rich experience when you harness the power of multiple databases at the same time. For example, if you want to search on Products in an e-commerce app, then you use ElasticSearch, which returns the results based on relevance, which MongoDB cannot do.

 

4.2 Tools used for Data Storage

 

4.2.1 HDFS

 

  • HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers.

  • HDFS holds a very large amount of data and provides easier access.

  • To store such huge data, the files are stored across multiple machines. These files are stored in a redundant fashion to rescue the system from possible data losses in case of failure.

  • HDFS also makes applications available to parallel processing. HDFS is built to support applications with large data sets, including individual files that reach into the terabytes.

  • It uses a master/slave architecture, with each cluster consisting of a single NameNode that manages file system operations and supporting DataNodes that manage data storage on individual compute nodes.

  • When HDFS takes in data, it breaks the information down into separate pieces and distributes them to different nodes in a cluster, allowing for parallel processing.

  • The file system also copies each piece of data multiple times and distributes the copies to individual nodes, placing at least one copy on a different server rack

  • HDFS and YARN from the data management layer of Apache Hadoop.

 

4.2.1.1 Features of HDFS

 

  • It is suitable for the distributed storage and processing.

  • Hadoop provides a command interface to interact with HDFS.

  • The built-in servers of namenode and datanode help users to easily check the status of the cluster.

  • Streaming access to file system data.

  • HDFS provides file permissions and authentication.

 

4.2.2 Gluster file systems (GFS)

 

As we know good storage solution must provide elasticity in both storage and performance without affecting active operations.

 

Scale-out storage systems based on GlusterFS are suitable for unstructured data such as documents, images, audio and video files, and log files.GlusterFS is a scalable network filesystem.

 

Using this, we can create large, distributed storage solutions for media streaming, data analysis, and other data- and bandwidth-intensive tasks.

 

  • It's Open Source.

  • You can deploy GlusterFS with the help of commodity hardware servers.

  • Linear scaling of performance and storage capacity.

  • Scale storage size up to several petabytes, which can be accessed by thousands for servers.

 

4.2.2.1 Use Cases For GlusterFS include

 

  • Cloud Computing

  • Streaming Media

  • Content Delivery

 


4.2.3 Amazon S3

 

  • Amazon Simple Storage Service (Amazon S3) is object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web.

  • It is designed to deliver 99.999999999% durability, and scale past trillions of objects worldwide.

  • Customers use S3 as primary storage for cloud-native applications; as a bulk repository, or "data lake," for analytics; as a target for backup & recovery and disaster recovery; and with serverless computing.

  • It's simple to move large volumes of data into or out of S3 with Amazon's cloud data migration options.

  • Once data is stored in Amazon S3, it can be automatically tiered into lower cost, longer-term cloud storage classes like S3 Standard - Infrequent Access and Amazon Glacier for archiving.

 

5. Data Query Layer

 

Data Query Layer of Big Data Framework

 

This is the layer where strong analytic processing takes place. This is a field where interactive queries are necessaries and it’s a zone traditionally dominated by SQL expert developers. Before Hadoop, we had a very limited storage due to which it takes long analytics process.

 

As first it goes through a Lengthy process i.e. ETL to get a new data source ready to be stored and after that, it puts the data in database or data warehouse. But now, data analytics became essential step which solved above problems while computing such a large amount of data.

 

Companies from all industries use big data analytics to -

 

  • Increase revenue

  • Decrease costs

  • Increase productivity

 

5.1 Tools Used for Analytics Query -

 

5.1.1 Apache Hive - Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets.

 

Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.

 

It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL - like a language called HiveQL (HQL).

 

5.1.1.1 Features of Apache Hive

 

  • Query data with a SQL - based language.

  • Interactive response times, even over huge datasets.

  • It's scalable as data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance Works with traditional data integration and data analytics tools.


 

5.1.2 Spark SQL - Spark SQL includes a cost-based optimizer, columnar storage, and code generation to make queries fast.

 

At the same time, it scales to thousands of nodes and multi-hour queries using the Spark engine, which provides full mid-query fault tolerance.

 

Spark SQL is a Spark module for structured data processing. Some of the Functions performed by Spark SQL are -

 

  • The interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed.

  • Internally, Spark SQL uses this extra information to perform extra optimizations.

  • One use of Spark SQL is to execute SQL queries.

  • Spark SQL can also be used to read data from an existing Hive installation.

 

5.1.3 Amazon Redshift - Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. We use amazon Redshift to load the data and run queries on the data.

 

We can also create additional databases as needed by running a SQL command. Most important we can scale it from hundred gigabytes of data to a petabyte or more.

 

This enables you to use your data to acquire new insights for your business and customers. The Amazon Redshift service manages all of the work of setting up, operating and scaling a data warehouse.

 

These tasks include provisioning capacity, monitoring and backing up the cluster, and applying patches and upgrades to the Amazon Redshift engine.

 

5.1.4 Presto - Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

 

Presto was designed and written for interactive analytics and approaches and for the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

 

5.1.4.1 What can Presto do?

 

  • Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores.

  • A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

  • Presto is targeted at analysts who expect response times ranging from sub-second to minutes.

  • Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow "free" solution that requires excessive hardware.

 

5.1.4.2 WHO USES IT?

 

  • Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in the total scan over a petabyte each per day.
  • Leading internet companies including Airbnb and Dropbox are using Presto.

 

5.2 Data Warehouse

 

A data warehouse is a subject-oriented, Integrated, Time varying, non-volatile collection of data in support of management’s decision-making process.

 

So, a data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis.


5.3 Comparing Big Data Solutions to a Data Warehouse

 

While comparing we found that a big data solution is a technology and that data warehousing is an architecture. They are two very different things.

 

A technology is just that – a means to store and manage large amounts of data. A data warehouse is a way of organizing data so that there are corporate credibility and integrity.

 

When someone takes data from a data warehouse, that person knows that other people are using the same data for other purposes. There is a basis for reconcilability of data when there is a data warehouse.


5.4 Data Lake

 

It is a new type of cloud-based enterprise architecture that structures data in a more scalable way that makes it easier to experiment with it.

 

With data lake, incoming data goes into the lake in a raw form or whatever form data source providers and there we select and organize the data in a raw form. There are no assumptions about the schema of the data, each data source can use whatever scheme it likes.

 

It's up to the consumers of that data to make sense of that data for their own purposes. The idea is to have a single store for all of the raw data that anyone in an organization might need to analyze.

 

Commonly people use Hadoop to work on the data in the lake, but the concept is broader than just Hadoop.

 

5.4.1 The data lake supports the following capabilities:

 

  • To capture and store raw data at scale for a low cost

  • To store many types of data in the same repository

  • To perform transformations on the data

  • To define the structure of the data at the time it is used, referred to as schema

 

5.5 Data Lake VS Data Warehouse

 

  • With Data Lake incoming data goes into the lake in the raw form and then, we select and organize the data in a raw form.In Data Warehouse Data is cleaned and organized into single consistent schema before putting them into a warehouse and then the analysis is done on the warehouse data.

  • Data lakes retain all data. Not only the data that is in use but also data that it might use in the future.On the other hand, when a data warehouse is being developed, considerable time is spent in analyzing different data sources, along with understanding business processes and profiling of data. Data is kept in its raw form and is only transformed when it is ready to be used.

  • In Data Lake all data in a data lake is stored in its raw form. Also, the data is always accessible to someone in need of it. In Data Warehouses difficulty faced when trying to induce a change in them. A lot of time is spent during development to get the structure of the warehouse right. Although a good warehouse design is capable of adapting to change.

 

6. Data Visualization Layer 

 

Data Visualization Layer of Big Data Framework

 

This layer focus on Big Data Visualization. We need something that will grab people’s attention, pull them in, make your findings well-understood. That’s why it provides full Business Infographics. Because your own findings from your own data need the annotation and the bold canvas.

 

The data visualization layer often is the thermometer that measures the success of the project. This is the where the data value is perceived by the user. While it’s designed for handling and storing large volumes of data, Hadoop and other tools have no built-in provisions for data visualization and information distribution, leaving no way to make that data easily consumable by end business users.

 

6.1 Tools used for Dashboards

 

6.1.1 Custom Dashboards: Custom dashboards are useful for creating unique overviews that present data differently, For example, you can -

 

  • Show the web and mobile application information, server information, custom metric data, and plugin metric data all on a single custom dashboard.

  • Create dashboards that present charts and tables with a uniform size and arrangement on a grid.

  • Select existing New Relic charts for your dashboard, or create your own charts and tables.

 

6.1.2 Real-Time Dashboards: Real-Time Dashboards save, share, and communicate insights. It helps users generate questions by revealing the depth, range, and content of their data stores.

 

  • Data Visualization dashboards constantly change as new data arrives.

  • In Zoomdata, you have the flexibility to create a data analytics dashboard with just a single chart and then add to it as needed.

  • Dashboards can contain multiple visualizations from multiple connections side by side.

  • You can quickly build, edit, filter, and delete dashboards and move and resize them and then share them or integrate them into your web application.

  • You can export a dashboard as an image or as a file configuration like JSON.

  • You can also make multiple copies of your dashboard.

 

6.1.3 Tableau

 

  • Tableau is the richest data visualization tool available in the market. With Drag and Drop functionality.

  • Tableau allows users to design Charts, Maps, Tabular, Matrix reports, Stories and Dashboards without any technical knowledge.

  • Tableau helps anyone quickly analyze, visualize and share information.Whether it’s structured or unstructured, petabytes or terabytes, millions or billions of rows, you can turn big data into big ideas.

  • It connects directly to local and cloud data sources, or import data for fast in-memory performance.

  • Make sense of big data with easy-to-understand visuals and interactive web dashboards.

 

6.1.4 Kibana

 

  • A Kibana dashboard displays a collection of saved visualizations. You can arrange and resize the visualizations as needed and save dashboards so they are reloaded and shared.

  • Kibana act as analytics and visualization platform that builds on Elasticsearch to give you a better understanding of your data.

  • Application Performance Monitoring is one key area to implement in projects to ensure proper and smooth operations from day 1. APM solutions provide development and operations team with near real-time insights on how the applications and services are performing in production, allowing for a proactive tune of services, as well as for early detection of possible production issues.

  • It gives you the freedom to select the way you give shape to your data. And you don’t always have to know what you're looking for.

  • Kibana core ships with the classics: histograms, line graphs, pie charts, sunbursts, and more. They leverage the full aggregation capabilities of Elasticsearch.

  • The Kibana interface is divided into four main sections:

    • Discover

    • Visualize

    • Dashboard

    • Settings

 

6.1.5 Intelligence Agents

 

  • An intelligent agent is a software that assists people and acts on their behalf. Intelligent agents work by allowing people to delegate work that they could have done, to the agent software.

  • Agents can perform repetitive tasks, remember things you forgot, intelligently summarize complex data, learn from you and even make recommendations to you.

  • An intelligent agent can help you find and filter information when you are looking at corporate data or surfing the Internet and don't know where the right information is.

  • It could also customize information to your preferences, thus saving you the time of handling it as more and more new information arrived each day on the Internet.

  • An agent could also sense changes in its environment and responds to these changes.

  • An agent continues to work even when the user is gone, which means that an agent could run on a server, but in some cases, an agent runs on the user systems.

 

6.1.6 Recommender :-

 

  • Recommender systems provide personalized information by learning the user’s interests from traces of interaction with that user.In order for a recommender system to make predictions about a user’s interests, it has to learn a user model.

  • A user model contains data about the user and should be represented in such a way that the data can be matched to the items in the collection.

  • The question is, what kind of data can be used to construct a user profile.

  • Obviously, the items that users have seen in the past are important but other information such as the content of the items, the perception of users of the items or information about users themselves could also be used.

  • Most recommender systems focus on the task of information filtering, which deals with the delivery of items selected from a large collection that the user is likely to find interesting or useful.

  • Recommender systems are special types of information filtering systems that suggest items to users. Some of the largest e-commerce sites are using recommender systems and apply a marketing strategy that is referred to as mass customization.

  • A content-based filtering system often uses many of the same techniques as an information retrieval system (such as a search engine), because both systems require a content description of the items in their domain. A recommender system also requires the modeling of the user’s preferences for a longer period of time which is not needed in an information retrieval system.

  • There are several techniques that can be used to improve recommender systems in different ways.

 

6.1.7 Angular.js

 

  • AngularJS is a very powerful JavaScript Framework. It is used in Single Page Application (SPA) projects.

  • It extends HTML DOM with additional attributes and makes it more responsive to user actions.

  • AngularJS is open source, completely free, and used by thousands of developers around the world.React is a library for building composable user interfaces.

  • It encourages the creation of reusable UI components which present data that changes over time.

 

6.1.8 React.js

 

  • React is JavaScript Library used for building User Interface, focus on the UI, not a framework.

  • One way reactive data flow(no two-way Data Binding), Virtual DOM. React is front end library developed by Facebook.

  • It's used for handling view layer for the web and mobile apps. ReactJS allows us to create reusable UI components.

  • It is currently one of the most popular JavaScript libraries and it has a strong foundation and large community behind it.

 

6.1.8.1 React Features

 

  • JSX − JSX is JavaScript syntax extension. It isn't necessary to use JSX in React development, but it is recommended.

  • Components − React is all about components. You need to think of everything as a component. This will help you to maintain the code when working on larger scale projects.

  • Unidirectional data flow and Flux − React implements one-way data flow which makes it easy to reason about your app. Flux is a pattern that helps to keep your data unidirectional.

 

6.2 How Data Transferred Securely

 

Data Security Layer of Big Data Framework

 

 

Security is the primary task for any kind of work. Security should be implemented at all layers of the lake starting from Ingestion, through Storage, Analytics, Discovery, all the way to Consumption. For proving security to data pipeline few steps are there that are:-

 

  • Authentication - Authentication will verify user’s identity and ensure they really are who they say they are. Using the Kerberos protocol provides a strong mechanism for authentication.

  • Access Control is the next step to secure data, by defining which dataset can be accessed by the users or services. Access control will restrict users and services to access only that data which they have permission for, they will access all the data.

  • Encryption and data masking - Encryption and data masking are required to ensure secure access to sensitive data. Sensitive data in the cluster should be secured at rest as well as in motion. We need to use proper Data Protection techniques which will protect data in the cluster from unauthorized visibility.

  • Auditing data access by users - Another aspect of data security requirement is Auditing data access by users. This can detect the log on & access attempts as well as the administrative changes.

 

6.3 Data Monitoring

 

Data Monitoring Layer of Big Data Framework

 

Data In enterprise systems is like food – it has to be kept fresh, also it needs nourishment, otherwise it goes bad and doesn’t help you in making strategic and operational decisions. Just as consuming spoiled food could make you sick, using “spoiled” data may be bad for your organization’s health.

 

There may be plenty of data, but it has to be reliable and consumable in order to be valuable. While most of the focus in enterprises is often about how to store and analyze large amounts of data, it is also very important to keep this data fresh and flavorful.

 

So we can do this? The solution is By monitoring, auditing, testing, managing, and controlling the data. Continuous monitoring of data is an important part of the governance mechanisms. 

 

Apache Flume is useful in processing log data. Apache Storm is desirable for operations monitoring Apache Spark for streaming data, graph processing, and machine learning. Monitoring can happen in data storage layer. It includes following steps for data monitoring:-

 

  • Data Profiling and lineage -These are the techniques to identify the quality of data and the lifecycle of the data through various phases. In these systems, it is important to capture the metadata at every layer of the stack so it can be used for verification and profiling.Talend, Hive, Pig.

  • Data Quality - Data is considered to be of good quality if it meets business needs and it satisfies the intended use so that it's helpful in making business decisions successfully. So, understanding the dimension of greatest interest and implementing methods to achieve it is important.

  • Data Cleansing - It means implementing various solutions to correct the incorrect or corrupt data.

  • Data Loss and Prevention - Policies have to be in place to make sure the loopholes for data loss are taken care of. Identification of such data loss needs careful monitoring and quality assessment processes.

 

XenonStack Offerings

 

XenonStack is a leading Software Company in Product Development and Solution Provider for DevOps, Big Data Integration, Real Time Analytics & Data Science.

 

Product NexaStack - Unified DevOps Platform Provides monitoring of Kubernetes, Docker, OpenStack infrastructure, Big Data Infrastructure and uses advanced machine learning techniques for Log Mining and Log Analytics.

 

Product ElixirData - Modern Data Integration Platform Enables enterprises and Different agencies for Log Analytics and Log Mining. 

 

Product Akira.AI is an Automated & Knowledge Drive Artificial Intelligence Platform that enables you to automate the Infrastructure to train and deploy Deep Learning Models on Public Cloud as well as On-Premises. 

 

Get 1 Hour Free Assessment for DevOps, Big Data Strategy, and Data Science. CONTACT US NOW



Share Post On Social Media

Related Posts


Build, Deploy, Manage & Secure Continuous Delivery Pipeline & Analytics Stack.


NexaStack - DevOps & Serverless Computing Platform

Elixir Data - Modern Data Integration Platform

Contact For Free Assessment

Loading..