Overview of Apache DrillThe Apache Drill is an open-source SQL query engine of Apache for Big Data exploration. It is designed for high-performance analysis of semi-structured and rapidly evolving data. It is designed to enable data exploration and analytics on non-relational datastores. Drill provides plug and plays integration with Apache Hive and Apache HBase deployments. The Drill is very versatile. You can query a wide variety of data sources and formats, including: Formats:
- CSV, TSV, PSV, or any other delimited data
- Hadoop Sequence Files
- Apache and Nginx server logs
- Logs files
- Open Time Series Database
- Nearly all relational databases with a JDBC driver
- Hadoop Distributed File System
- Amazon Simple Storage Service
Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. Source: GitHub
Why do we need Apache Drill?Following some reasons are given to satisfy the need of the Apache Drill:
- Get started quickly: Drill takes a few minutes to get started. There is no need to set up any infrastructure or to define schemas. Just extract the Drill software on Linux, Mac, or Windows and run a local file query.
- Schema free JSON model: It is the world's first and only distributed SQL engine that doesn't require schemas. It does not need to define and maintain schemas or transform data. It automatically understands the structure of the data.
- Query data in-situ: Schema free JSON model helps to query semi-structured and complex data in situ. It is not required to flatten the data for query execution.
- No SQL-like: Drill supports The standard SQL:2003 syntax hence no need to learn a new "SQL-like" language. Also don't have to struggle with a semi-functional BI tool.
- Standard BI tools: Drill can be used with standard BI tools such as Tableau, Excel, Qlikview, Microsoft strategy.
- Multiple Datasources: Drill can be connected to the external file system, Hbase, and Hive. The Drill to work with any data source storage plugin can be implemented.
- User-defined Function: Users can embed their logic to drill by generating functions. Drill provides high-performance Java API to build custom user-defined functions.
The architecture of Apache DrillApache Drill is the low latency distributed query engine. It can handle large scale dataset of structured or semi-structured type. It can query nested data in formats of JSON and Parquet.
High-level ArchitectureThe drill has a distributed execution environment. Drillbit service is at the core of Apache Drill, responsible for accepting requests from the client, process queries, and giving the client results. To build a distributed environment, drillbit can be installed and run on all Hadoop cluster's required nodes. Data locality can be maximized using Drill during query execution without moving data between nodes. The access to the drill can be through the following interfaces:
- Drill Shell
- Drill Web UI
- C++ API
Query ExecutionClient or application sends drill query in the form of SQL statement in drill cluster. Then the Drillbit process runs on each active drill node. The following steps involved in the query execution:
- There is a foreman drillbit that receives the query and drives it. After receiving the query, the Parser in the foreman parses the SQL by applying some custom rules. It converts SQL operators to specific logical operator syntax.
- The generated output is in the form of the specific logical operator syntax that Drill can understand. The collection of these logical operators forms a logical plan.
- Logical plans describe the process to generate query results. It selects the data source and operation needed to apply to implement the process.
- To optimize SQL operators' order in the statement and read the Foreman's logical plan, it sends the logical plan into the cost-based optimizer.
- This optimizer converts the logical plan to the physical plan by applying some rules.
- This physical plan describes how to execute the query.
- A parallelizer will transform the physical plan into multiple phases known as major and minor fragments.
- These fragments generate a multilevel execution tree. This tree will rewrite the query and execute it in parallel. And then send the generated result to the client.
Data Warehouse Modernization allows the current data warehouse environment to meet suddenly replacing marketing requirements, give support for new data sources, and new solutions. Source: Data Warehouse Modernization
Modules of DrillbitThe following image and list describe the key components of a Drillbit:
- RPC Endpoint: Low overhead protobuf based RPC communicates with the clients. Java API and C++ layers are also available to interact with the Apache Drill. Directl communication to a particular drillbit or check available drillbits before submitting queries. It is recommended to go through the ZooKeeper quorum to discover available drillbits before submitting the question.
- SQL Parser: To parse the queries drill, use Calcite. Calcite is an open-source parser framework. The generated output is a language-agnostic, computer-friendly logical plan representing the query.
- Optimizer: Drill uses several optimization rules to rewrite and split the query. It helps to execute queries across various nodes quickly and efficiently.
- Execution Engine: It allows distributed query processing across different nodes.
- Storage Plugin Interface: Drill used a storage plugin interface to interact with the data sources. It provides the following information:
- Available Metadata.
- The interface of Drill to read and write data sources.
- Location of data and set of optimization rules to query particular data source.
- Distribute Cache: Distributed cache helps to manage metadata and configuration information.
Read More about Apache Spark Optimisation Techniques and Performance Tuning
What are the Benefits of Apache Drill?Listed below are the benefits of the Apache Drill.
- Data can be scaled from a single node to multiple nodes. Petabytes of data can query within seconds.
- Drill also supports user-defined functions; hence users can create custom functions using their logic.
- Due to symmetrical and flexible architecture and easy installation, it becomes easy to deploy and operate large clusters.
- It allows you to get insights faster without data loading, schema creation, and maintenance, transformations.
- Multi structured and nested data can be analyzed directly in non-relational data without transforming the data.
- Leverage existing SQL and BI tools.
What is the Difference between Drill vs. Traditional SQL-on-Hadoop Technologies?Also, Hadoop, Drill supports a variety of non-relational datastores. Drill uses different approaches than traditional SQL-on-Hadoop technologies (Hive, impala, etc.). Comparison between Drill and traditional SQL-on-Hadoop technologies is below:
|Drill supports a variety of non-relational data stores in addition to Hadoop. Such as NoSQL, cloud storage (including multiple instances)
|Single Hadoop cluster
|Self-service, in-situ, and SQL based analytics
|Data warehouse offloading
|Standalone service, NoSQL, co-located with Hadoop
|Co-located with Hadoop
|Self-service data exploration
|IT-driven data management
|Drill use ANSI standard for SQL
|These technologies use SQL-like.