This is the second part of the Setting up a minimal Big Data development environment on your local machine series. The aim is to use as little resources as possible, hence we are not using a VM. The first part of this series is available here and is prerequisite for this blog post. Previously we had a look at how to configure HDFS and Hive and we also set up Pentaho Data Integrating to connect to HDFS as well as Hive. This is a nice basic setup, however, quite often we also want a database with a short response time. Currently it’s not that easy to install Impala standalone (or outside of the CDH stack), so that leaves us with two popular alternatives: Apache Drill, druid and Presto. In this blog post we will take a look at Facebook’s Presto offering:
Download the files from here. Once you extracted the files at a convenient location, let’s start configuring Presto. The configuration is fairly straight forward and consists of creating a few properties files.
Navigate inside the extracted folder and issue the following commands:
$ mkdir etc $ cd etc $ vi node.properties
The important bit here is to set the location of your data directory (to store logs etc). You can set the other properties to any sensible value:
node.environment=dev node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/tmp/presto/data
Save file and exit. Next define the main configuration in
config.properties. Important settings are related to memory and ports:
$ vi config.properties coordinator=true node-scheduler.include-coordinator=true http-server.http.port=8080 query.max-memory=5GB query.max-memory-per-node=1GB discovery-server.enabled=true discovery.uri=http://localhost:8080
Save file and exit. This is a suitable configuration for a dev environment on a local machine. Next we define the properties for the JVM:
$ vi jvm.properties -server -Xmx6G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p
Save file and exit. Finally set up the Log properties:
$ vi log.properties com.facebook.presto=INFO
Save file and exit. Let’s also create a catalog directory so that we can soon store the configuration files in there to connect to various data stores:
$ mkdir catalog
To start Presto:
or alternatively, if you do not want to run a background task (and on first run it is best to start Presto this way so that you get an understanding if it is working straight away or not):
Important: Presto requires Java 8, so make sure you have this one set as your active java version!
Are located within the data directory that you defined in the config, so e.g. as I set it up in the tmp folder, I can read the log like so:
tail -fn 200 /tmp/presto/data/var/log/launcher.log
We’d like to query data on our Hadoop (pseudo distributed) cluster. Presto requires a connection to the Hive Metastore to query data on HDFS.
In your Hive configuration file search for
hive.metastore.uris and copy the URI e.g.:
$ vi apache-hive-0.14.0-bin/conf/hive-site.xml
Within the Presto root folder create a new file:
$ vi etc/catalog/hive.properties
and paste the following content into it (adjust accordingly as described in the official docu):
That’s it. Easy, right?!
Command Line Interface
Consult the Official Docu for download link for command line client.
In a nutshell: Just download the file, rename it and make it executable:
$ mkdir presto-client-0.136 $ cd presto-client-0.136/ $ mv ~/Downloads/presto-cli-0.136-executable.jar . $ mv presto-cli-0.136-executable.jar presto $ chmod +x presto
To start the client (adjust parameter values according to your setup):
./presto --server localhost:8080 --catalog hive --schema default
The official docu says: “Run the CLI with the
--help option to see the available options. By default, the results of queries are paginated using the less program which is configured with a carefully selected set of options. This behavior can be overridden by setting the environment variable
PRESTO_PAGER to the name of a different program such as more, or set it to an empty value to completely disable pagination.”
Now that we have the command line client running, let’s see what tables we have available (just an example):
$ ./presto --server localhost:8080 --catalog hive --schema default presto:default> SHOW CATALOGS; Catalog --------- hive system jmx (3 rows) Query 20160213_142657_00002_cfqq7, FINISHED, 1 node Splits: 1 total, 1 done (100.00%) 0:00 [0 rows, 0B] [0 rows/s, 0B/s] presto:default> USE hive; presto:hive> SHOW SCHEMAS; Schema -------------------- default events_dma events_staging information_schema (4 rows) Query 20160213_142808_00005_cfqq7, FINISHED, 1 node Splits: 2 total, 2 done (100.00%) 0:00 [4 rows, 73B] [30 rows/s, 558B/s] presto:hive> USE events_dma; presto:events_dma> SHOW TABLES; Table ------------- dim_date fact_events (2 rows) Query 20160213_142838_00008_cfqq7, FINISHED, 1 node Splits: 2 total, 2 done (100.00%) 0:00 [2 rows, 67B] [17 rows/s, 602B/s] presto:events_dma> SELECT * FROM fact_events LIMIT 5; date_tk | datetime | issue_id | count | age ----------+-------------------------+----------+-------+-------- 20150301 | 2015-03-01 13:22:00.000 | 333 | -1 | 2940 20150301 | 2015-03-01 12:33:00.000 | 333 | 1 | NULL 20150301 | 2015-03-01 13:22:00.000 | 333 | 1 | NULL 20150304 | 2015-03-04 09:01:00.000 | 333 | 1 | NULL 20150304 | 2015-03-04 09:01:00.000 | 333 | -1 | 243540 (5 rows) Query 20160213_142852_00009_cfqq7, FINISHED, 1 node Splits: 2 total, 2 done (100.00%) 0:04 [23 rows, 1.61KB] [6 rows/s, 461B/s]
While it’s fun for some time to write queries on the command line, more of then not we want to use a given database with some power tool like Pentaho Data Integration, in which case we need a JDBC Driver to connect the tool to the database. The Presto JDBC Driver can be downloaded from here.
The connection string looks like this:
Catalog and Schema are optional.
If you require the class name as part of the tool configuration, here you go:
In regards to username and password: I haven’t found any information yet on how Presto manages users and role. In client tools it seems like you can just define any value and it will connect properly. You must define though a value - whatever it is.
If you want to see a list of running and executed queries, Presto comes with a simple web interface which you can find under the URL you speficied in the
config.properties. In my case this is
Presto supports the full SQL standard. Take a look at the official documentation for more details.
Some important points:
- You cannot create a table for Hive via Presto. You always have to create the table in Hive first.
- Hive views are currently not supported.
Not all functions are documented, but there is an extremely useful
SHOW FUNCTIONS command. E.g. I was trying to find a way to subtract two dates to find out the amount of days in between:
SELECT DATE_DIFF('DAY',CAST('2015-01-01' AS TIMESTAMP), CAST('2015-01-02' AS TIMESTAMP)); SELECT DATE_DIFF('DAY',CAST('2015-01-01 11:23:00' AS TIMESTAMP), CAST('2015-01-02 12:30:00' AS TIMESTAMP));
UNIX_TIMESTAMP function commonly found in Impala and Hive does not exist as such in Presto, but there is a similar function:
SELECT TO_UNIXTIME(DATE '2015-01-02') - TO_UNIXTIME(DATE '2015-01-01');
Once you run the
SHOW FUNCTIONS command you will realise that Presto has plenty of functions to get your work done!
Now this is something rather interesting and possibly confusing: You cannot query a view, which was originally created in Hive, but you can create a view from within the Presto SQL client which will be stored as blobs by the connector within the Hive Metastore, but these views cannot be queried by Hive, only by Presto. Source
Special Hive connector settings:
Many of the features are disabled by default. Enable them by adding them to the Hive catalog properties:
||allow renaming of tables|
||configures the size of the initial split|
||configures the number of initial splits. This can be useful for speeding up small queries, which would otherwise have low parallelism. Applies to the one above as well.|
||allow dropping tables|
||sets the maximum number of connections to S3|
||maximum retry time for the Hive S3 file system|
||force local scheduling of splits.|