Introduction – Geospatial Analytics using Presto and Hive
Geospatial Analytics is related to data that is used for locating anything on the globe, an uber driver to a man in a new neighbourhood place everybody uses geospatial data in some way or the other. Geospatial technology involves GPS (global positioning systems), GIS (geographical information systems), and RS (remote sensing). The data is collected in a humongous amount daily, and it needs to be extracted efficiently and judiciously to have better working software that requires it. This blog we will explore the topic Geospatial Analytics using Presto and Hive in depth. We start with the basics and then deep dive into all the details.
Why is Geospatial Data Necessary?
It is necessary for so many things and is used daily for various reasons. From commuting purposes for an ordinary man to data in missiles of a defence organization of a particular county, everything requires geospatial data. It is extracted from various resources. Every phone having an active internet connection somehow adds up to contributing to geospatial data, satellites collect data daily. It is of great use in everyday life, and so it requires a significant amount of attention. It can be used for various reasons, to help support natural hazards and, to know of disasters, global climate change, wildlife, natural resources, etc. It is used for satellite imagery too that could be for tactical or for weather forecasting purposes. Many tech giants like uber etc. use it on daily bases to help ease everyday life. A company has to be efficient in extracting the data efficiently and use it, to stand out in the market.
How to retrieve Geospatial Data?
Various methods could do this, but mainly Presto and hives are used to extract and reform the data that’s present in hundreds of petabyte and use it efficiently and make the lives of billions easy. This data is vital as it touches the mass majority and is used every second. GIS is a part of Geospatial data that helps in the collection, storage, manipulation, analyzation, and present spatial data. Whatever the situation is going on at local, regional or national level, if where is asked for, geospatial analytics come to play. Geospatial Analytics wouldn’t be effective without Visualization.
We create maps using abstract shapes and colors to reveal geographic patterns and tell stories about human existence. Visualize the maps by giving colors, shapes to understand geographic patterns and in the form of stories.
Taken From Article, Geospatial Visualization Tools and Techniques
Geospatial Analytics Using Presto
Presto is an open-source distributed SQL query, used to solve the question of any size or type. It runs on Hadoop. It supports many non-relational resources and Teradata. It can query data on its respective location, without moving the actual data to any separate system. The execution of the query runs parallel over a pure memory-based architecture, with most results returning within seconds. Many tech giants use it. It’s a popular choice for undertaking interactive queries that are in data ranging in100s of PetaByte.
Geospatial Analytics Using Hive
It is a data warehouse infrastructure tool to process any structured data and developed on top of the Hadoop distributed file system. It resides on top of Hadoop to summarize Big Data and makes querying and analyzing of any kind of data accessible.
Architecture of Hive
Apache Hive is an ETL and Data Warehousing tool built on top of the Hadoop. It helps to perform many operations secure like :
- Analysis of large data sets
- Data encapsulation
- Ad-hoc queries
Components of Hive
- Hive Client
- Hive Services
- Processing & Resource Management
- Distributed Storage
It supports all the application written in languages like Java, Python, C++ etc. It is using Thrift, JDBC and ODBC drivers. It’s easy to write hive client application in the desired language. Hive clients are categorized into three types:-
- Thrift Clients: Apache Hive’s servers are based on Thrift, so it’s easy for it to serve all the request from the languages that support Thrift
- JDBC Clients: Hive allows java apps to connect to it by using its JDBC driver
- ODBC Clients: ODBC Driver will enable applications that support ODBC protocol to connect to Hive. It uses Thrift to communicate to Hive server.
It provides with various services like –
- CLI(Command Line Interface) – It is the default shell provided by Hive, which helps to execute Hive queries and command directly.
- Web Interface – Hive gives an option to execute queries and commands on a web-based GUI provided by it.
- Hive Server – It is built on Apache Thrift and is also knows as Thrift Server. It allows different clients to submit requests and retrieve the final result from the Hive.
- Hive Driver – It is responsible for receiving the queries submitted by clients. It compiles, optimizes and executes the queries.
There is two central part in Presto: Coordinator and Worker. It is an open-source distributed system that can be run on multiple machines. Its distributed SQL query engine was built for fast analytic queries. Presto deployment will include one Presto Coordinator and any number of Presto Workers.
- Presto Coordinator – Used to submit queries and manages parsing, planning, and scheduling query processing.
- Presto Worker – Processes the queries, adding more workers gives faster query processing.
Components of Presto
It is the brain of any Presto installation; it manages all the presto worker nodes for all the work comes related to queries. It gets results from workers and returns the final output to the client. It connects with workers and clients via REST.
It helps to execute the task and to process the data. These nodes share data amongst each other and get data from the Coordinator.
It contains information related to data, such as where the data is located, where the schema is located and the data source.
Tables and Schemas
It is similar to what it means in a relational database. The table is set of rows organized into named columns and schema is what you use to hold your tables.
lt issued to help Presto integrate with the external data source.
To execute a query, Presto breaks it up into steps.
Stages are implemented as a series of functions that might get distributed on Presto Workers.
Drivers and Operators
Tasks contains one or more parallel drivers, and they are operators in memory. An operator consumes, transforms and produces data.
Deployment Strategies for Hive
Amazon EMR is used to deploy hive megastore. User can opt from three configurations that Amazon has to offer, namely – Embedded, Local or Remote.
There are two options for creating an external Hive megastore for EMR:
- By using AWS Glue data catalogue
- Use Amazon RDS / Amazon Aurora
Apache Hive on Cloud Dataproc provides an efficient and flexible way by storing data of Hive in Cloud Storage and hosting the Hive metastore in MySQL database on the Cloud SQL. It offers some advantages like flexibility and agility by letting user tailor cluster configuration for specific workloads and scale the cluster according to the need. It also helps in saving cost.
Deployment Strategies for Presto
Amazon EMR allows to quickly spin up a managed EMR cluster with a presto query engine and run interactive analysis on the data stored in Amazon S3. It is used to run interactive queries. Presto implementation can be built on the cloud on Amazon Web Services. Amazon EMR and Amazon Athena provides with building and implementation of Presto
The cluster that includes Presto component can easily prepare in Presto.
Ways to enhance optimization Presto and Hive
- Tez-Execution Engine in Hive – It is an application framework built on Hadoop Yarn.
- Usage of Suitable File Format in HIVE – Usage of appropriate file format on the basis of data will drastically increase the query performance. ORC file format is best suited for the same.
- Hive Partitioning – By partitioning the entries into the different dataset, only the required data is called during the time of the execution of the query, thus making the performance more efficient and optimized.
- Bucketing in Hive – It helps divide the datasets into more manageable parts, for this purpose bucketing is used. User can set the size of manageable pieces or Buckets too.
- Vectorization In Hive – Vectorized query execution is used for more optimized performance of HIVE. It happens by performing aggregation over batches of 1024 rows at once instead of the single row each time.
- Cost-Based Optimization in Hive (CBO) – It performs optimization based on query cost. To use CBO parameters are to be set at the beginning of the query.
- Hive Indexing – Indexing helps increase optimization. It helps the speed of the process of executing queries by taking less time to do so.
- File format – Usage of ORC file format is best suited for optimizing the execution of queries while using Presto.
- Presto can join automatically if the feature is enabled.
- Dynamic filter feature optimizes the use of JOIN queries
- Presto has added a new Hive connector configuration to skip corrupt records in input formats other than orc, parquet and rcfile.
- By setting task.max-worker-threads in config.properties, number of CPU cores into hyper-threads per core on a Presto worker node.
- Splits can be used for efficient and optimized use in executing the queries in Presto.
Advantages of Hive and Presto
- It is a stable query engine and has a large and active community
- Its queries are similar to that of SQL, which are easy to understand by RDBMS professionals
- It supports ORC, TextFile, RCFile, Avro and Parquet file Formats
- Presto supports file formats like ORC, Parquet and RCFile formats, eliminating the need for data transformation.
- It works well with Amazon S3 queries and Storage, it can query data in mere seconds even if the data is of the size of petabytes.
- Presto also has an active community.
Geospatial Analytics Using Presto and Hive
Modelling geospatial data has quite many complexities. Well, Known Texts are used to model different locations on the map. Various types like point and polygon shapes are used for these purposes. The Hive Spatial Library is used for spatial processing in the Hive with User-Defined Functions and SerDes. Through allowing this library in Hive, queries may be created using the Hive Query Language (HQL), which is somewhat close to SQL. You will, therefore, stop complex MapReduce algorithms and stick to a more common workflow.
Presto GeoSpatial plugin is running in production at Uber. All GeoSpatial traffic at Uber, more than 90% of it is completed within 5 minutes. Compared with brute force Hive MapReduce execution, Uber’s Presto Geospatial Plugin is more than 50X faster, leading to greater efficiency.
Summing up – Geospatial Analytics Using Presto and Hive
Presto has the edge over Hive as it can be used to process unstructured data too, and query processing in Presto is faster than that in Hive.