Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated readme.md #1

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
24 changes: 18 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# tpch-datagen-as-hive-query
This are set of UDFs and queries that you can use with Hive to use TPCH datagen in parrellel on hadoop cluster. You can deploy to azure using :
<a href="https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fdharmeshkakadia%2Ftpch-datagen-as-hive-query%2Fmaster%2Fazure%2Fazuredeploy.json" target="_blank">
<a href="https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fcruizen%2Ftpch-hdinsight%2Fmaster%2Fazure%2Fazuredeploy.json" target="_blank">
<img src="http://azuredeploy.net/deploybutton.png"/>
</a>

Expand All @@ -9,7 +9,7 @@ This are set of UDFs and queries that you can use with Hive to use TPCH datagen
1. Clone this repo.

```shell
git clone https://github.com/dharmeshkakadia/tpch-datagen-as-hive-query/ && cd tpch-datagen-as-hive-query
git clone https://github.com/cruizen/tpch-hdinsight.git && cd tpch-hdinsight
```
2. Run TPCHDataGen.hql with settings.hql file and set the required config variables.
```shell
Expand All @@ -24,6 +24,11 @@ This are set of UDFs and queries that you can use with Hive to use TPCH datagen
```shell
hive -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCH/ -hiveconf DBNAME=tpch
```
For HDI 4.0, allow permissions to other users on the storage by running
```shell
hdfs dfs -chmod -R 777 /HiveTPCH
```

Generate ORC tables and analyze
```shell
hive -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpch_orc -hiveconf SOURCE=tpch
Expand All @@ -39,11 +44,11 @@ This are set of UDFs and queries that you can use with Hive to use TPCH datagen
1. Clone this repo.

```shell
git clone https://github.com/dharmeshkakadia/tpch-datagen-as-hive-query/ && cd tpch-datagen-as-hive-query
git clone https://github.com/cruizen/tpch-hdinsight.git && cd tpch-hdinsight
```
2. Upload the resources to DFS.
```shell
hdfs dfs -copyFromLocal resoruces /tmp
hdfs dfs -copyFromLocal resources /tmp
```

3. Run TPCHDataGen.hql with settings.hql file and set the required config variables.
Expand All @@ -54,11 +59,16 @@ This are set of UDFs and queries that you can use with Hive to use TPCH datagen
`PARTS` is a number of task to use for datagen (parrellelization),
`LOCATION` is the directory where the data will be stored on HDFS,
`TPCHBIN` is where the resources are uploaded on step 2. You can specify specific settings in settings.hql file.
When ADLS is used as the storage instead of Azure blob storage, replace wasb in the URL for fs.defaultFS with abfs since ADLS uses the abfs:// storage scheme.

4. Now you can create tables on the generated data.
```shell
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCH/ -hiveconf DBNAME=tpch
```
For HDI 4.0, allow permissions to other users on the storage by running
```shell
hdfs dfs -chmod -R 777 /HiveTPCH
```
Generate ORC tables and analyze
```shell
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpch_orc -hiveconf SOURCE=tpch
Expand All @@ -70,10 +80,12 @@ This are set of UDFs and queries that you can use with Hive to use TPCH datagen
beeline -u "jdbc:hive2://`hostname -f`:10001/tpch_orc;transportMode=http" -n "" -p "" -i settings.hql -f queries/tpch_query1.hql
```

If you want to run all the queries 10 times and measure the times it takes, you can use the following command:

If you want to run all the queries 10 times and measure the times it takes, you can use the following command

echo "Query,run,start_time,end_time,duration" >> times_orc.csv;
for f in queries/*.sql; do for i in {1..10} ; do STARTTIME="`date +%s`"; beeline -u "jdbc:hive2://`hostname -f`:10001/tpch_orc;transportMode=http" -i settings.hql -f $f > $f.run_$i.out 2>&1 ; ENDTIME="`date +%s`"; echo "$f,$i,$STARTTIME,$ENDTIME,$(($ENDTIME-$STARTTIME))" >> times_orc.csv; done; done;


## FAQ

1. Does it work with scale factor 1?
Expand Down
2 changes: 1 addition & 1 deletion azure/TPCH_installer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
wget -O /tmp/HDInsightUtilities-v01.sh -q https://hdiconfigactions.blob.core.windows.net/linuxconfigactionmodulev01/HDInsightUtilities-v01.sh && source /tmp/HDInsightUtilities-v01.sh && rm -f /tmp/HDInsightUtilities-v01.sh

if [[ `hostname -f` == `get_primary_headnode` ]]; then
wget https://github.com/dharmeshkakadia/tpch-datagen-as-hive-query/archive/master.zip
wget https://github.com/cruizen/tpch-hdinsight/archive/master.zip
unzip master.zip; cd tpch-datagen-as-hive-query-master;
hive -i settings.hql -f TPCHDataGen.hql -hiveconf SCALE=$1 -hiveconf PARTS=$1 -hiveconf LOCATION=/HiveTPCH_$1/ -hiveconf TPCHBIN=resources
hive -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCH_$1/ -hiveconf DBNAME=tpch_$1
Expand Down
4 changes: 2 additions & 2 deletions azure/azuredeploy.json
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
"apiVersion": "[variables('clusterApiVersion')]",
"dependsOn": ["[concat('Microsoft.Storage/storageAccounts/',variables('clusterStorageAccountName'))]"],
"properties": {
"clusterVersion": "3.5",
"clusterVersion": "3.6",
"osType": "Linux",
"tier": "standard",
"clusterDefinition": {
Expand Down Expand Up @@ -116,7 +116,7 @@
"scriptActions": [
{
"name": "TPCH Benchmark",
"uri": "https://raw.githubusercontent.com/dharmeshkakadia/tpch-datagen-as-hive-query/master/azure/TPCH_installer.sh",
"uri": "https://raw.githubusercontent.com/cruizen/tpch-hdinsight/master/azure/TPCH_installer.sh",
"parameters": "[parameters('ScaleFactor')]"
}
]
Expand Down
2 changes: 1 addition & 1 deletion azure/azuredeploy.parameters.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"value": "hdiuser"
},
"loginPassword": {
"value": "changeme"
"value": "Snappy123!!!"
}
}
}
4 changes: 2 additions & 2 deletions ddl/createAllORCTables.hql
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ L_SHIPMODE STRING,
L_COMMENT STRING)
PARTITIONED BY (L_SHIPDATE STRING)
STORED AS ORC
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB');
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB','auto.purge'='true');

INSERT OVERWRITE TABLE lineitem PARTITION(L_SHIPDATE)
SELECT
Expand Down Expand Up @@ -59,7 +59,7 @@ O_SHIPPRIORITY INT,
O_COMMENT STRING)
PARTITIONED BY (O_ORDERDATE STRING)
STORED AS ORC
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB');
TBLPROPERTIES('orc.bloom.filter.columns'='*','orc.compress'='ZLIB','auto.purge'='true');

INSERT OVERWRITE TABLE orders PARTITION(O_ORDERDATE)
SELECT
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query1.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query1_result AS
--CREATE TABLE tpch_query1_result AS

SELECT l_returnflag
,l_linestatus
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query10.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query10_result AS
--CREATE TABLE tpch_query10_result AS

SELECT c_custkey
,c_name
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query11.hql
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ AS
SELECT sum(part_value) AS total_value
FROM q11_part_tmp_cached;

CREATE TABLE tpch_query11_result AS
--CREATE TABLE tpch_query11_result AS

SELECT ps_partkey
,part_value AS value
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query12.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query12_result AS
--CREATE TABLE tpch_query12_result AS

SELECT l_shipmode
,sum(CASE
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query13.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query13_result AS
--CREATE TABLE tpch_query13_result AS

SELECT c_count
,count(*) AS custdist
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query14.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query14_result AS
--CREATE TABLE tpch_query14_result AS

SELECT 100.00 * sum(CASE
WHEN p_type LIKE 'PROMO%'
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query15.hql
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ AS
SELECT max(total_revenue) AS max_revenue
FROM revenue_cached;

CREATE TABLE tpch_query15_result AS
--CREATE TABLE tpch_query15_result AS

SELECT s_suppkey
,s_name
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query16.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query16_result AS
--CREATE TABLE tpch_query16_result AS

SELECT p_brand
,p_type
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query17.hql
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ SELECT l_partkey AS t_partkey
FROM lineitem
GROUP BY l_partkey;

CREATE TABLE tpch_query17_result AS
--CREATE TABLE tpch_query17_result AS

SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM (
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query19.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query19_result AS
--CREATE TABLE tpch_query19_result AS

SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM lineitem
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query2.hql
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ WHERE p_partkey = ps_partkey
AND r_name = 'EUROPE'
GROUP BY p_partkey;

CREATE TABLE tpch_query2_result AS
--CREATE TABLE tpch_query2_result AS

SELECT s_acctbal
,s_name
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query20.hql
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ FROM q20_tmp3_cached
WHERE ps_availqty > sum_quantity
GROUP BY ps_suppkey;

CREATE TABLE tpch_query20_result AS
--CREATE TABLE tpch_query20_result AS

SELECT s_name
,s_address
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query21.hql
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ WHERE l_receiptdate > l_commitdate
AND l_orderkey IS NOT NULL
GROUP BY l_orderkey;

CREATE TABLE tpch_query21_result AS
--CREATE TABLE tpch_query21_result AS

SELECT s_name
,count(1) AS numwait
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query22.hql
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ CREATE VIEW IF NOT EXISTS q22_orders_tmp_cached AS
FROM orders
GROUP BY o_custkey;

CREATE TABLE tpch_query22_result AS
--CREATE TABLE tpch_query22_result AS

SELECT cntrycode
,count(1) AS numcust
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query3.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query3_result AS
--CREATE TABLE tpch_query3_result AS

SELECT l_orderkey
,sum(l_extendedprice * (1 - l_discount)) AS revenue
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query4.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query4_result AS
--CREATE TABLE tpch_query4_result AS

SELECT o_orderpriority
,count(*) AS order_count
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query5.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query5_result AS
--CREATE TABLE tpch_query5_result AS

SELECT n_name
,sum(l_extendedprice * (1 - l_discount)) AS revenue
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query6.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query6_result AS
--CREATE TABLE tpch_query6_result AS

SELECT sum(l_extendedprice * l_discount) AS revenue
FROM lineitem
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query7.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query7_result AS
--CREATE TABLE tpch_query7_result AS

SELECT supp_nation
,cust_nation
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query8.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query8_result AS
--CREATE TABLE tpch_query8_result AS

SELECT o_year
,sum(CASE
Expand Down
2 changes: 1 addition & 1 deletion queries/tpch_query9.hql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
CREATE TABLE tpch_query9_result AS
--CREATE TABLE tpch_query9_result AS

SELECT nation
,o_year
Expand Down
14 changes: 13 additions & 1 deletion settings.hql
Original file line number Diff line number Diff line change
@@ -1 +1,13 @@
SET hive.tez.container.size=2048;
set hive.execution.engine=tez;
set hive.tez.container.size=4096;
set hive.tez.java.opts=-Xmx3800m;
-- set hive.auto.convert.join.noconditionaltask.size=1252698795;
set hive.vectorized.execution.enabled=true;
set hive.execution.mode=llap;
set hive.llap.execution.mode=all;
set hive.llap.io.enabled=true;
set hive.llap.io.memory.mode=cache;

-- Dynamic partitioning in Hive. We tested with the default value as well as the following turned on.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;