Note: If you are using TiSpark version less than 2.0, please read this document instead
pytispark will not be necessary since TiSpark version >= 2.0.
There are currently two ways to use TiSpark on Python:
This is the simplest way, just a decent Spark environment should be enough.
-
Make sure you have the latest version of TiSpark and a
jar
with all TiSpark's dependencies. -
Remember to add needed configurations listed in README into your
$SPARK_HOME/conf/spark-defaults.conf
-
Copy
./resources/session.py
to$SPARK_HOME/python/pyspark/sql/session.py
-
Run this command in your
$SPARK_HOME
directory:
./bin/pyspark --jars /where-ever-it-is/tispark-core-${version}-jar-with-dependencies.jar
- To use TiSpark, run these commands:
# Query as you are in spark-shell
sql("show databases").show()
sql("use tpch_test")
sql("show tables").show()
sql("select count(*) from customer").show()
# Result
# +--------+
# |count(1)|
# +--------+
# | 150|
# +--------+
This way is useful when you want to execute your own Python scripts.
Because of an open issue [SPARK-25003] in Spark 2.3, using spark-submit for python files will only support following api
-
Use
pip install pytispark
in your console to installpytispark
-
Create a Python file named
test.py
as below:
import pytispark.pytispark as pti
ti = pti.TiContext(spark)
ti.tidbMapDatabase("tpch_test")
sql("select count(*) from customer").show()
# Result
# +--------+
# |count(1)|
# +--------+
# | 150|
# +--------+
- Prepare your TiSpark environment as above and execute
./bin/spark-submit --jars /where-ever-it-is/tispark-core-${version}-jar-with-dependencies.jar test.py
- Result:
+--------+
|count(1)|
+--------+
| 150|
+--------+
See pytispark for more information.