Setting up a Hadoop Dev Environment Part 2: Presto

  14 Feb 2016


This is the second part on setting up a minimal Big Data development environment on your local machine. The aim is to use as little resources as possible, hence we are not using a VM. Previously we had a look at how to configure HDFS and Hive and we also set up Pentaho Data Integrating to connect to HDFS as well as Hive. This is a nice basic setup, however, quite often we also want a database with a short response time. Currently it’s not that easy to install Impala standalone (or outside of the CDH stack), so that leaves us with two popular alternatives: Apache Drill, druid and Presto. In this blog post we will take a look at Facebook’s Presto offering:

Installing Presto

Download the files from here. Once you extracted the files at a convenient location, let’s start configuring Presto. The configuration is fairly straight forward and consists of creating a few properties files.

Navigate inside the extracted folder and issue the following commands:

$ mkdir etc
$ cd etc
$ vi node.properties

The important bit here is to set the location of your data directory (to store logs etc). You can set the other properties to any sensible value:

node.environment=dev
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/presto/data

Save file and exit. Next define the main configuration in config.properties. Important settings are related to memory and ports:

$ vi config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

Save file and exit. This is a suitable configuration for a dev environment on a local machine. Next we define the properties for the JVM:

$ vi jvm.properties
-server
-Xmx6G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p

Save file and exit. Finally set up the Log properties:

$ vi log.properties
com.facebook.presto=INFO

Save file and exit. Let’s also create a catalog directory so that we can soon store the configuration files in there to connect to various data stores:

$ mkdir catalog

Basics

To start Presto:

bin/launcher start

or alternatively, if you do not want to run a background task (and on first run it is best to start Presto this way so that you get an understanding if it is working straight away or not):

bin/launcher run

Important: Presto requires Java 8, so make sure you have this one set as your active java version!

Logs

Are located within the data directory that you defined in the config, so e.g. as I set it up in the tmp folder, I can read the log like so:

tail -fn 200 /tmp/presto/data/var/log/launcher.log

Connectors

Hive Metastore

We’d like to query data on our Hadoop (pseudo distributed) cluster. Presto requires a connection to the Hive Metastore to query data on HDFS.

In your Hive configuration file search for hive.metastore.uris and copy the URI e.g.:

$ vi apache-hive-0.14.0-bin/conf/hive-site.xml

Within the Presto root folder create a new file:

$ vi etc/catalog/hive.properties

and paste the following content into it (adjust accordingly as described in the official docu):

connector.name=hive-hadoop2
hive.metastore.uri=thrift://127.0.0.1:9083

That’s it. Easy, right?!

Command Line Interface

Consult the Official Docu for download link for command line client.

In a nutshell: Just download the file, rename it and make it executable:

$ mkdir presto-client-0.136
$ cd presto-client-0.136/
$ mv ~/Downloads/presto-cli-0.136-executable.jar .
$ mv presto-cli-0.136-executable.jar presto
$ chmod +x presto 

To start the client (adjust parameter values according to your setup):

./presto --server localhost:8080 --catalog hive --schema default

The official docu says: “Run the CLI with the --help option to see the available options. By default, the results of queries are paginated using the less program which is configured with a carefully selected set of options. This behavior can be overridden by setting the environment variable PRESTO_PAGER to the name of a different program such as more, or set it to an empty value to completely disable pagination.”

Now that we have the command line client running, let’s see what tables we have available (just an example):

$ ./presto --server localhost:8080 --catalog hive --schema default
presto:default> SHOW CATALOGS;
 Catalog 
---------
 hive    
 system  
 jmx     
(3 rows)

Query 20160213_142657_00002_cfqq7, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto:default> USE hive;
presto:hive> SHOW SCHEMAS;
       Schema       
--------------------
 default            
 events_dma         
 events_staging     
 information_schema 
(4 rows)

Query 20160213_142808_00005_cfqq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [4 rows, 73B] [30 rows/s, 558B/s]

presto:hive> USE events_dma;
presto:events_dma> SHOW TABLES;
    Table    
-------------
 dim_date    
 fact_events 
(2 rows)

Query 20160213_142838_00008_cfqq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:00 [2 rows, 67B] [17 rows/s, 602B/s]

presto:events_dma> SELECT * FROM fact_events LIMIT 5;
 date_tk  |        datetime         | issue_id | count |  age   
----------+-------------------------+----------+-------+--------
 20150301 | 2015-03-01 13:22:00.000 |      333 |    -1 |   2940 
 20150301 | 2015-03-01 12:33:00.000 |      333 |     1 | NULL   
 20150301 | 2015-03-01 13:22:00.000 |      333 |     1 | NULL   
 20150304 | 2015-03-04 09:01:00.000 |      333 |     1 | NULL   
 20150304 | 2015-03-04 09:01:00.000 |      333 |    -1 | 243540 
(5 rows)

Query 20160213_142852_00009_cfqq7, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:04 [23 rows, 1.61KB] [6 rows/s, 461B/s]

JDBC Driver

While it’s fun for some time to write queries on the command line, more of then not we want to use a given database with some power tool like Pentaho Data Integration, in which case we need a JDBC Driver to connect the tool to the database. The Presto JDBC Driver can be downloaded from here.

The connection string looks like this:

jdbc:presto://host:port/catalog/schema

Catalog and Schema are optional.

If you require the class name as part of the tool configuration, here you go:

com.facebook.presto.jdbc.PrestoDriver

In regards to username and password: I haven’t found any information yet on how Presto manages users and role. In client tools it seems like you can just define any value and it will connect properly. You must define though a value - whatever it is.

Web Interface

If you want to see a list of running and executed queries, Presto comes with a simple web interface which you can find under the URL you speficied in the config.properties. In my case this is http://localhost:8080.

SQL Considerations

Presto supports the full SQL standard. Take a look at the official documentation for more details.

Some important points:

  • You cannot create a table for Hive via Presto. You always have to create the table in Hive first.
  • Hive views are currently not supported.

Not all functions are documented, but there is an extremely useful SHOW FUNCTIONS command. E.g. I was trying to find a way to subtract two dates to find out the amount of days in between:

SELECT DATE_DIFF('DAY',CAST('2015-01-01' AS TIMESTAMP), CAST('2015-01-02' AS TIMESTAMP));
SELECT DATE_DIFF('DAY',CAST('2015-01-01 11:23:00' AS TIMESTAMP), CAST('2015-01-02 12:30:00' AS TIMESTAMP));

The UNIX_TIMESTAMP function commonly found in Impala and Hive does not exist as such in Presto, but there is a similar function:

SELECT TO_UNIXTIME(DATE '2015-01-02') - TO_UNIXTIME(DATE '2015-01-01');

Once you run the SHOW FUNCTIONS command you will realise that Presto has plenty of functions to get your work done!

Creating Views

Now this is something rather interesting and possibly confusing: You cannot query a view, which was originally created in Hive, but you can create a view from within the Presto SQL client which will be stored as blobs by the connector within the Hive Metastore, but these views cannot be queried by Hive, only by Presto. Source

Special Hive connector settings:

Many of the features are disabled by default. Enable them by adding them to the Hive catalog properties:

Property Description
hive.allow-rename-table=true allow renaming of tables
hive.max-initial-split-size configures the size of the initial split
hive.max-initial-splits configures the number of initial splits. This can be useful for speeding up small queries, which would otherwise have low parallelism. Applies to the one above as well.
hive.allow-drop-table=true allow dropping tables
hive.s3.max-connections sets the maximum number of connections to S3
hive.s3.max-retry-time maximum retry time for the Hive S3 file system
hive.force-local-scheduling and session property force_local_scheduling force local scheduling of splits.
comments powered by Disqus