Skip to content

Latest commit

 

History

History
76 lines (58 loc) · 2.03 KB

README.md

File metadata and controls

76 lines (58 loc) · 2.03 KB

TiSpark (version >= 2.0) on PySpark:

Note: If you are using TiSpark version less than 2.0, please read this document instead

pytispark will not be necessary since TiSpark version >= 2.0.

Usage

There are currently two ways to use TiSpark on Python:

Directly via pyspark

This is the simplest way, just a decent Spark environment should be enough.

  1. Make sure you have the latest version of TiSpark and a jar with all TiSpark's dependencies.

  2. Remember to add needed configurations listed in README into your $SPARK_HOME/conf/spark-defaults.conf

  3. Copy ./resources/session.py to $SPARK_HOME/python/pyspark/sql/session.py

  4. Run this command in your $SPARK_HOME directory:

./bin/pyspark --jars /where-ever-it-is/tispark-core-${version}-jar-with-dependencies.jar
  1. To use TiSpark, run these commands:
# Query as you are in spark-shell
sql("show databases").show()
sql("use tpch_test")
sql("show tables").show()
sql("select count(*) from customer").show()

# Result
# +--------+
# |count(1)|
# +--------+
# |     150|
# +--------+

Via spark-submit

This way is useful when you want to execute your own Python scripts.

Because of an open issue [SPARK-25003] in Spark 2.3, using spark-submit for python files will only support following api

  1. Use pip install pytispark in your console to install pytispark

  2. Create a Python file named test.py as below:

import pytispark.pytispark as pti

ti = pti.TiContext(spark)

ti.tidbMapDatabase("tpch_test")

sql("select count(*) from customer").show()

# Result
# +--------+
# |count(1)|
# +--------+
# |     150|
# +--------+
  1. Prepare your TiSpark environment as above and execute
./bin/spark-submit --jars /where-ever-it-is/tispark-core-${version}-jar-with-dependencies.jar test.py
  1. Result:
+--------+
|count(1)|
+--------+
|     150|
+--------+

See pytispark for more information.