- A SQL query engine for a variety of non-relational datastores
- Hive, MongoDB, HBase
- Flat JSON files, Parquet files on HDFS, S3, Azure, Google Cloud Storage, local filesystem
- Based on Google's Dremel
- Not SQL-Like -- Actually SQL
- It has an ODBC/JDBC driver
- Other tools can connect to it like any relational database (e.g. Tableau)
- Still not a relational database
- Don't push it with big joins etc.
- Allows SQL analysis without having to transform and load data into a relational database
- If you know how to write SQL, you can use Drill
- You can even do joins between different database technologies
- Or with flat JSON files
-
Start MongoDB from Ambari
-
Open HiveView from Ambari
- Import the ratings data
CREATE DATABASE IF NOT EXISTS movielens;
- Click on the "Upload Table" button
- Change the CSV Field Delimiter to
TAB
- Choose the
u.data
file- Select the
movielens
database
- Select the
- Name the table
ratings
- Change column names to
user_id
,movie_id
,rating
,epoch_seconds
- Change column names to
- Click on "Upload"
- Import the ratings data
-
Import data to MongoDB
ssh maria_dev@localhost -p 2222 sudo su spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11.2.6.5 MongoSpark.py
-
Install Drill
wget http://archive.apache.org/dist/drill/drill-1.12.0/apache-drill-1.12.0.tar.gz tar -xvf apache-drill-1.12.0.tar.gz
-
Start Drill
cd apache-drill-1.12.0 bin/drillbit.sh start -Ddrill.exec.port=8765
-
Connect through the browser: http://127.0.0.1:8765
-
Connect to MongoDB and Hive
- Click on
storage
tab and make sure the following are enabled:hive
mongo
- Click
update
next tohive
- Change the
hive.metastore.uris
tothrift://localhost:9083
- Change the
- Mongo works out of the box
- Click on
- Go to http://127.0.0.1:8765
- Click on
Query
tab - Write and submit the following query:
SHOW DATABASES;
- Query the data from Hive and MongoDB
SELECT * FROM hive.movielens.ratings LIMIT 10; SELECT * FROM mongo.movielens.users LIMIT 10;
- Tie the two together by joining data from Hive and MongoDB (how many ratings were provided by each occupation)
- Join the users and ratings tables with
user_id
column and group byoccupation
SELECT u.occupation, COUNT(*) FROM hive.movielens.ratings r JOIN mongo.movielens.users u ON r.user_id = u.user_id GROUP BY u.occupation;
- Join the users and ratings tables with
- Stop Drill from the command line
bin/drillbit.sh stop
- Stop MongoDB from Ambari
- A SQL driver for HBase
- It supports transactions
- Fast, low latency - Online Transaction Processing (OLTP) support
- Developed by Salesforce, then open sourced
- Exposes a JDBC connector for HBase
- Supports secondary indices and UDFs (user defined functions)
- Integrates with MapReduce, Spark, Hive, Pig, and Flume
Why Phoenix?
- Really fast
- Why Phoenix and not Drill?
- Choose the right tool for the job
- Focuses exclusively on HBase
- Why not HBase's native clients?
- SQL might be easier to use
- Existing applications might already be using SQL
- Phoenix can do a lot of optimizations for you
Phoenix Client | HBase API -> HBase Region Server(s) | Phoenix Co-Processor -> HDFS
-> Zookeeper
- Command-line interface
- JDBC driver
- Java API
- Phoenix Query Server (PQS)
- Intended to eventually enable non-JVM access
- Provides JARs for MapReduce, Pig, Hive, Pig, Flume, and Spark
- Make sure HBase is running from Ambari (by default it is not)
- Login to the VM and start Phoenix
ssh maria_dev@localhost -p 2222 sudo su cd /usr/hdp/current/phoenix-client/ python bin/sqlline.py
- Create a table with US population data
CREATE TABLE IF NOT EXISTS us_population (state CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT) CONSTRAINT my_pk PRIMARY KEY (state, city); !tables
- Insert some data (INSERT doesn't work, use UPSERT)
UPSERT INTO us_population VALUES ('NY', 'New York', 8143197); UPSERT INTO us_population VALUES ('CA', 'Los Angeles', 3844829);
- Query the data
SELECT * FROM us_population; SELECT * FROM us_population WHERE state = 'CA';
- Clean up
DROP table us_population; quit;
- Use the
u.data
file from the MovieLens dataset - We will use Pig to load the data into HBase via Phoenix
- We will then query the data using Phoenix
- Create a table in Phoenix
CREATE TABLE IF NOT EXISTS users (USERID INTEGER NOT NULL, AGE INTEGER, GENDER CHAR(1), OCCUPATION VARCHAR, ZIP VARCHAR CONSTRAINT my_pk PRIMARY KEY (USERID));
- Load the data into HBase via Pig and Phoenix
- Go to the home directory:
cd /home/maria_dev
- Create a directory for the ml-100k dataset:
mkdir ml-100k
if it doesn't exist
cd ml-100k wget http://media.sundog-soft.com/hadoop/ml-100k/u.user cd ..
- Get the Pig script
wget http://media.sundog-soft.com/hadoop/phoenix.pig
set zookeeper.znode.parent '/hbase-unsecure' REGISTER /usr/hdp/current/phoenix-client/phoenix-client.jar users = LOAD '/user/maria_dev/ml-100k/u.user' USING PigStorage('|') AS (USERID:int, AGE:int, GENDER:chararray, OCCUPATION:chararray, ZIP:chararray); STORE users into 'hbase://users' using org.apache.phoenix.pig.PhoenixHBaseStorage('localhost','-batchSize 5000'); occupations = load 'hbase://table/users/USERID,OCCUPATION' using org.apache.phoenix.pig.PhoenixHBaseLoader('localhost'); grpd = GROUP occupations BY OCCUPATION; cnt = FOREACH grpd GENERATE group AS OCCUPATION,COUNT(occupations); DUMP cnt;
- Go to the home directory:
- Run the Pig script
pig phoenix.pig
- Clean up:
/usr/hdp/current/phoenix-client/bin/sqlline.py
DROP TABLE users;
- Stop HBase from Ambari
- Similar to Apache Drill
- Can connect to many different "big data" databases
- Familiar SQL syntax
- Optimized for OLAP (Online Analytical Processing), data warehousing
- Exposes JDBC, CLI, and Tableau interfaces
- Made by Facebook, and still partially maintained by them
- Open source
- Can talk to Cassandra - Drill can't
- Can't talk to MongoDB - Drill can
- Instead of Drill?
- It has a Cassandra connector
- It's used by Facebook against 30PB of data, Dropbox, and AirBnB
- A single Presto query can combine data from multiple sources
- Cassandra
- Hive
- MongoDB
- MySQL
- PostgreSQL
- Local files
- Kafka
- JMX
- Redis
- Accumulo
- Login to the VM
ssh maria_dev@localhost -p 2222 sudo su
- Download and install Presto
wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.165/presto-server-0.165.tar.gz tar -xvf presto-server-0.165.tar.gz
- Get the configuration files (presto has a lot of configuration files, the guide to set them up
is here, but we will use pre-made ones for the HDP
VM)
wget https://raw.githubusercontent.com/sundog-education/hadoop/presto-hdp-config.tgz tar -xvf presto-hdp-config.tgz
- Download the CLI for presto
wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.165/presto-cli-0.165-executable.jar mv presto-cli-0.165-executable.jar presto chmod +x presto
- Start Presto
cd presto-server-0.165 bin/launcher start
- Connect to Presto UI: http://localhost:8090
- Run the CLI
./presto --server 127.0.0.1:8090 --catalog hive
- Run queries against Hive
show tables from default; select * from default.ratings limit 10; select * from default.ratings where rating = 5 limit 10; select count(*) from default.ratings where rating = 1;
- Start Cassandra, enable thrift, and enter the CQL shell
service cassandra start node enablethrift cqlsh --cqlversion="3.4.0"
- Check if movielens keyspace exists
DESCRIBE keyspaces; use movielens; describe tables; select * from users limit 10;
- Quit the CQL shell and set up the Cassandra connector for Presto
exit
cd /home/maria_dev/presto-server-0.165/etc/catalog nano cassandra.properties
- Add the following to the file
connector.name=cassandra cassandra.contact-points=127.0.0.1
- Start presto
cd /home/maria_dev/presto-server-0.165 bin/launcher start
- Connect to the CLI
./presto --server 127.0.0.1:8090 --catalog hive,cassandra
- Query data
show tables from cassandra.movielens; describe cassandra.movielens.users; select * from cassandra.movielens.users limit 10; select count(*) from hive.default.ratings limit 10; select u.occupation, count(*) from hive.default.ratings r join cassandra.movielens.users u on r.user_id = u.user_id group by u.occupation