Query Engines

Apache Drill

A SQL query engine for a variety of non-relational datastores
- Hive, MongoDB, HBase
- Flat JSON files, Parquet files on HDFS, S3, Azure, Google Cloud Storage, local filesystem
Based on Google's Dremel
Not SQL-Like -- Actually SQL
It has an ODBC/JDBC driver
- Other tools can connect to it like any relational database (e.g. Tableau)
Still not a relational database
- Don't push it with big joins etc.
Allows SQL analysis without having to transform and load data into a relational database
If you know how to write SQL, you can use Drill
You can even do joins between different database technologies
- Or with flat JSON files

Setting up Drill

Start MongoDB from Ambari
Open HiveView from Ambari
- Import the ratings data
```
CREATE DATABASE IF NOT EXISTS movielens;
```
- Click on the "Upload Table" button
- Change the CSV Field Delimiter to TAB
- Choose the u.data file
  - Select the movielens database
- Name the table ratings
  - Change column names to user_id, movie_id, rating, epoch_seconds
- Click on "Upload"

Import data to MongoDB

ssh maria_dev@localhost -p 2222
sudo su
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11.2.6.5 MongoSpark.py

Install Drill

wget http://archive.apache.org/dist/drill/drill-1.12.0/apache-drill-1.12.0.tar.gz
tar -xvf apache-drill-1.12.0.tar.gz

Start Drill

 cd apache-drill-1.12.0
 bin/drillbit.sh start -Ddrill.exec.port=8765

Connect through the browser: http://127.0.0.1:8765
Connect to MongoDB and Hive
- Click on storage tab and make sure the following are enabled:
  - hive
  - mongo
- Click update next to hive
  - Change the hive.metastore.uris to thrift://localhost:9083
- Mongo works out of the box

Querying with Drill

Go to http://127.0.0.1:8765
Click on Query tab
Write and submit the following query:
```
SHOW DATABASES;
```

Query the data from Hive and MongoDB

SELECT * FROM hive.movielens.ratings LIMIT 10;
SELECT * FROM mongo.movielens.users LIMIT 10;

Tie the two together by joining data from Hive and MongoDB (how many ratings were provided by each occupation)
- Join the users and ratings tables with user_id column and group by occupation
```
SELECT u.occupation, COUNT(*) FROM hive.movielens.ratings r JOIN mongo.movielens.users u ON r.user_id = u.user_id GROUP BY u.occupation;
```

Cleaning up

Stop Drill from the command line
```
bin/drillbit.sh stop
```
Stop MongoDB from Ambari

Apache Phoenix

A SQL driver for HBase
It supports transactions
Fast, low latency - Online Transaction Processing (OLTP) support
Developed by Salesforce, then open sourced
Exposes a JDBC connector for HBase
Supports secondary indices and UDFs (user defined functions)
Integrates with MapReduce, Spark, Hive, Pig, and Flume

Why Phoenix?

Really fast
Why Phoenix and not Drill?
- Choose the right tool for the job
- Focuses exclusively on HBase
Why not HBase's native clients?
- SQL might be easier to use
- Existing applications might already be using SQL
- Phoenix can do a lot of optimizations for you

Architecture

Phoenix Client | HBase API -> HBase Region Server(s) | Phoenix Co-Processor ->  HDFS
                                                                            -> Zookeeper

Using Phoenix

Command-line interface
JDBC driver
Java API
Phoenix Query Server (PQS)
- Intended to eventually enable non-JVM access
Provides JARs for MapReduce, Pig, Hive, Pig, Flume, and Spark

Setting up Phoenix and Command-line Interface

Make sure HBase is running from Ambari (by default it is not)

Login to the VM and start Phoenix

 ssh maria_dev@localhost -p 2222
 sudo su
 cd /usr/hdp/current/phoenix-client/
 python bin/sqlline.py

Create a table with US population data

 CREATE TABLE IF NOT EXISTS us_population (state CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT) CONSTRAINT my_pk PRIMARY KEY (state, city);
 !tables

Insert some data (INSERT doesn't work, use UPSERT)

UPSERT INTO us_population VALUES ('NY', 'New York', 8143197);
UPSERT INTO us_population VALUES ('CA', 'Los Angeles', 3844829);

Query the data

SELECT * FROM us_population;
SELECT * FROM us_population WHERE state = 'CA';

Clean up
```
DROP table us_population;
quit;
```

Querying with Phoenix (Integrating with Pig)

Use the u.data file from the MovieLens dataset
We will use Pig to load the data into HBase via Phoenix
We will then query the data using Phoenix

Create a table in Phoenix

CREATE TABLE IF NOT EXISTS users (USERID INTEGER NOT NULL, AGE INTEGER, GENDER CHAR(1), OCCUPATION VARCHAR, ZIP VARCHAR CONSTRAINT my_pk PRIMARY KEY (USERID));

Load the data into HBase via Pig and Phoenix

Go to the home directory: cd /home/maria_dev
Create a directory for the ml-100k dataset: mkdir ml-100k if it doesn't exist

cd ml-100k
wget http://media.sundog-soft.com/hadoop/ml-100k/u.user
cd ..

Get the Pig script

wget http://media.sundog-soft.com/hadoop/phoenix.pig

set zookeeper.znode.parent '/hbase-unsecure'
REGISTER /usr/hdp/current/phoenix-client/phoenix-client.jar
users = LOAD '/user/maria_dev/ml-100k/u.user'
USING PigStorage('|')
AS (USERID:int, AGE:int, GENDER:chararray, OCCUPATION:chararray, ZIP:chararray);

STORE users into 'hbase://users' using
org.apache.phoenix.pig.PhoenixHBaseStorage('localhost','-batchSize 5000');

occupations = load 'hbase://table/users/USERID,OCCUPATION' using org.apache.phoenix.pig.PhoenixHBaseLoader('localhost');

grpd = GROUP occupations BY OCCUPATION;
cnt = FOREACH grpd GENERATE group AS OCCUPATION,COUNT(occupations);
DUMP cnt;

Run the Pig script
```
 pig phoenix.pig
```

Clean up:

 /usr/hdp/current/phoenix-client/bin/sqlline.py

 DROP TABLE users;

Stop HBase from Ambari

Presto

Similar to Apache Drill
- Can connect to many different "big data" databases
- Familiar SQL syntax
- Optimized for OLAP (Online Analytical Processing), data warehousing
- Exposes JDBC, CLI, and Tableau interfaces
Made by Facebook, and still partially maintained by them
Open source
Can talk to Cassandra - Drill can't
Can't talk to MongoDB - Drill can

Why Presto?

Instead of Drill?
- It has a Cassandra connector
It's used by Facebook against 30PB of data, Dropbox, and AirBnB
A single Presto query can combine data from multiple sources

What can Presto connect to?

Cassandra
Hive
MongoDB
MySQL
PostgreSQL
Local files
Kafka
JMX
Redis
Accumulo

Setting up Presto

Login to the VM
```
ssh maria_dev@localhost -p 2222
sudo su
```

Download and install Presto

wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.165/presto-server-0.165.tar.gz
tar -xvf presto-server-0.165.tar.gz

Get the configuration files (presto has a lot of configuration files, the guide to set them up is here, but we will use pre-made ones for the HDP VM)
```
wget https://raw.githubusercontent.com/sundog-education/hadoop/presto-hdp-config.tgz
tar -xvf presto-hdp-config.tgz
```

Download the CLI for presto

 wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.165/presto-cli-0.165-executable.jar
 mv presto-cli-0.165-executable.jar presto
 chmod +x presto

Start Presto

 cd presto-server-0.165
 bin/launcher start

Connect to Presto UI: http://localhost:8090

Run the CLI

 ./presto --server 127.0.0.1:8090 --catalog hive

Run queries against Hive

 show tables from default;
 select * from default.ratings limit 10;
 select * from default.ratings where rating = 5 limit 10;
 select count(*) from default.ratings where rating = 1;

Querying Cassandra with Presto

Start Cassandra, enable thrift, and enter the CQL shell

service cassandra start
node enablethrift
cqlsh --cqlversion="3.4.0"

Check if movielens keyspace exists

DESCRIBE keyspaces; use movielens; describe tables; select * from users limit 10;

Quit the CQL shell and set up the Cassandra connector for Presto

exit

cd /home/maria_dev/presto-server-0.165/etc/catalog
nano cassandra.properties

Add the following to the file

connector.name=cassandra
cassandra.contact-points=127.0.0.1

Start presto

cd /home/maria_dev/presto-server-0.165
bin/launcher start

Connect to the CLI

./presto --server 127.0.0.1:8090 --catalog hive,cassandra

Query data

show tables from cassandra.movielens;
describe cassandra.movielens.users;
select * from cassandra.movielens.users limit 10;
select count(*) from hive.default.ratings limit 10;
select u.occupation, count(*) from hive.default.ratings r join cassandra.movielens.users u on r.user_id = u.user_id group by u.occupation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUERY_ENGINES.md

QUERY_ENGINES.md

Query Engines

Apache Drill

Setting up Drill

Querying with Drill

Cleaning up

Apache Phoenix

Architecture

Using Phoenix

Setting up Phoenix and Command-line Interface

Querying with Phoenix (Integrating with Pig)

Presto

Why Presto?

What can Presto connect to?

Setting up Presto

Querying Cassandra with Presto

Files

QUERY_ENGINES.md

Latest commit

History

QUERY_ENGINES.md

File metadata and controls

Query Engines

Apache Drill

Setting up Drill

Querying with Drill

Cleaning up

Apache Phoenix

Architecture

Using Phoenix

Setting up Phoenix and Command-line Interface

Querying with Phoenix (Integrating with Pig)

Presto

Why Presto?

What can Presto connect to?

Setting up Presto

Querying Cassandra with Presto