layout: post title: “Setting up a Hadoop Dev Environment Part 2: Presto” summary: This article explains how to set up Presto date: 2016-02-14 categories: Big-data tags: Big-data published: true —
This is the second part of the Setting up a minimal Big Data development environment on your local machine series. The aim is to use as little resources as possible, hence we are not using a VM. The first part of this series is available here and is prerequisite for this blog post. Previously we had a look at how to configure HDFS and Hive and we also set up Pentaho Data Integrating to connect to HDFS as well as Hive. This is a nice basic setup, however, quite often we also want a database with a short response time. Currently it’s not that easy to install Impala standalone (or outside of the CDH stack), so that leaves us with two popular alternatives: Apache Drill, druid and Presto. In this blog post we will take a look at Facebook’s Presto offering:
Installing Presto
Download the files from here. Once you extracted the files at a convenient location, let’s start configuring Presto. The configuration is fairly straight forward and consists of creating a few properties files.
Navigate inside the extracted folder and issue the following commands:
$ mkdir etc
$ cd etc
$ vi node.properties
The important bit here is to set the location of your data directory (to store logs etc). You can set the other properties to any sensible value:
node.environment=dev
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/presto/data
Save file and exit. Next define the main configuration in config.properties
. Important settings are related to memory and ports:
$ vi config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
Save file and exit. This is a suitable configuration for a dev environment on a local machine. Next we define the properties for the JVM:
$ vi jvm.properties
-server
-Xmx6G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
Save file and exit. Finally set up the Log properties:
$ vi log.properties
com.facebook.presto=INFO
Save file and exit. Let’s also create a catalog directory so that we can soon store the configuration files in there to connect to various data stores:
$ mkdir catalog
Basics
To start Presto:
bin/launcher start
or alternatively, if you do not want to run a background task (and on first run it is best to start Presto this way so that you get an understanding if it is working straight away or not):
bin/launcher run
Important: Presto requires Java 8, so make sure you have this one set as your active java version!
Logs
Are located within the data directory that you defined in the config, so e.g. as I set it up in the tmp folder, I can read the log like so:
tail -fn 200 /tmp/presto/data/var/log/launcher.log
Connectors
Hive Metastore
We’d like to query data on our Hadoop (pseudo distributed) cluster. Presto requires a connection to the Hive Metastore to query data on HDFS.
In your Hive configuration file search for hive.metastore.uris
and copy the URI e.g.:
$ vi apache-hive-0.14.0-bin/conf/hive-site.xml
Within the Presto root folder create a new file:
$ vi etc/catalog/hive.properties
and paste the following content into it (adjust accordingly as described in the official docu):
connector.name=hive-hadoop2
hive.metastore.uri=thrift://127.0.0.1:9083
That’s it. Easy, right?!
Command Line Interface
Consult the Official Docu for download link for command line client.
In a nutshell: Just download the file, rename it and make it executable:
$ mkdir presto-client-0.136
$ cd presto-client-0.136/
$ mv ~/Downloads/presto-cli-0.136-executable.jar .
$ mv presto-cli-0.136-executable.jar presto
$ chmod +x presto
To start the client (adjust parameter values according to your setup):
./presto --server localhost:8080 --catalog hive --schema default
The official docu says: “Run the CLI with the --help
option to see the available options. By default, the results of queries are paginated using the less program which is configured with a carefully selected set of options. This behavior can be overridden by setting the environment variable PRESTO_PAGER
to the name of a different program such as more, or set it to an empty value to completely disable pagination.”
Now that we have the command line client running, let’s see what tables we have available (just an example):
$ ./presto --server localhost:8080 --catalog hive --schema default
presto:default> SHOW CATALOGS;
Catalog
---------
hive
system
jmx
(3 rows)
Query 20160213_142657_00002_cfqq7, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto:default> USE hive;
presto:hive> SHOW SCHEMAS;
Schema
--------------------
default
events_dma
events_staging
information_schema
(4 rows)
Query 20160213_142808_00005_cfqq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [4 rows, 73B] [30 rows/s, 558B/s]
presto:hive> USE events_dma;
presto:events_dma> SHOW TABLES;
Table
-------------
dim_date
fact_events
(2 rows)
Query 20160213_142838_00008_cfqq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [2 rows, 67B] [17 rows/s, 602B/s]
presto:events_dma> SELECT * FROM fact_events LIMIT 5;
date_tk | datetime | issue_id | count | age
----------+-------------------------+----------+-------+--------
20150301 | 2015-03-01 13:22:00.000 | 333 | -1 | 2940
20150301 | 2015-03-01 12:33:00.000 | 333 | 1 | NULL
20150301 | 2015-03-01 13:22:00.000 | 333 | 1 | NULL
20150304 | 2015-03-04 09:01:00.000 | 333 | 1 | NULL
20150304 | 2015-03-04 09:01:00.000 | 333 | -1 | 243540
(5 rows)
Query 20160213_142852_00009_cfqq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:04 [23 rows, 1.61KB] [6 rows/s, 461B/s]
JDBC Driver
While it’s fun for some time to write queries on the command line, more of then not we want to use a given database with some power tool like Pentaho Data Integration, in which case we need a JDBC Driver to connect the tool to the database. The Presto JDBC Driver can be downloaded from here.
The connection string looks like this:
jdbc:presto://host:port/catalog/schema
Catalog and Schema are optional.
If you require the class name as part of the tool configuration, here you go:
com.facebook.presto.jdbc.PrestoDriver
In regards to username and password: I haven’t found any information yet on how Presto manages users and role. In client tools it seems like you can just define any value and it will connect properly. You must define though a value - whatever it is.
Web Interface
If you want to see a list of running and executed queries, Presto comes with a simple web interface which you can find under the URL you speficied in the config.properties
. In my case this is http://localhost:8080
.
SQL Considerations
Presto supports the full SQL standard. Take a look at the official documentation for more details.
Some important points:
- You cannot create a table for Hive via Presto. You always have to create the table in Hive first.
- Hive views are currently not supported.
Not all functions are documented, but there is an extremely useful SHOW FUNCTIONS
command. E.g. I was trying to find a way to subtract two dates to find out the amount of days in between:
SELECT DATE_DIFF('DAY',CAST('2015-01-01' AS TIMESTAMP), CAST('2015-01-02' AS TIMESTAMP));
SELECT DATE_DIFF('DAY',CAST('2015-01-01 11:23:00' AS TIMESTAMP), CAST('2015-01-02 12:30:00' AS TIMESTAMP));
The UNIX_TIMESTAMP
function commonly found in Impala and Hive does not exist as such in Presto, but there is a similar function:
SELECT TO_UNIXTIME(DATE '2015-01-02') - TO_UNIXTIME(DATE '2015-01-01');
Once you run the SHOW FUNCTIONS
command you will realise that Presto has plenty of functions to get your work done!
Creating Views
Now this is something rather interesting and possibly confusing: You cannot query a view, which was originally created in Hive, but you can create a view from within the Presto SQL client which will be stored as blobs by the connector within the Hive Metastore, but these views cannot be queried by Hive, only by Presto. Source
Special Hive connector settings:
Many of the features are disabled by default. Enable them by adding them to the Hive catalog properties:
Property | Description |
---|---|
hive.allow-rename-table=true |
allow renaming of tables |
hive.max-initial-split-size |
configures the size of the initial split |
hive.max-initial-splits |
configures the number of initial splits. This can be useful for speeding up small queries, which would otherwise have low parallelism. Applies to the one above as well. |
hive.allow-drop-table=true |
allow dropping tables |
hive.s3.max-connections |
sets the maximum number of connections to S3 |
hive.s3.max-retry-time |
maximum retry time for the Hive S3 file system |
hive.force-local-scheduling and session property force_local_scheduling |
force local scheduling of splits. |