Instructions and code for the workshop "From Big Data to NLP Insights: Unlocking the Power of PySpark and Spark NLP"
We will run the training code on Databricks Community Edition. Create your account by following the instructions provided in the official documentation. Please complete this step before moving forward.
You can now create a Databricks workspace with the required Jupyter notebooks using this link. The steps for doing this can be seen in the below GIF.
From the left-hand side navbar, click on Workspace
> click on dropdown > click on Import
> choose URL
option and enter the link > click on Import
.
We will now create a compute cluster that we will run our code on.
- Click on the Compute tab on the navbar. Then click on "Create Compute" button. You will be taken to the "New Cluster" configuration view.
- Assign the cluster a name. From the "Databricks runtime version" dropdown, choose "Runtime: 12.2 LTS (Scala 2.12, Spark 3.3.2).
- Click on the "Spark" tab. Add the following lines to "Spark config" field.
spark.kryoserializer.buffer.max 2000M
spark.serializer org.apache.spark.serializer.KryoSerializer
- Click on "Create Cluster". It may take a few minutes before the cluster gets created.
At this point, you can successfully run the code in module 1's notebook. For the next 2 modules, we need to install the Spark NLP library in our cluster.
In Libraries tab inside your cluster you need to follow these steps:
- Install New -> PyPI -> spark-nlp -> Install
- Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp_2.12:4.4.1 -> Install
Voila! You're all set to start now.
The workshop code is distributed across 3 Jupyter notebooks. Each of these correspond to a workshop module. They are:
- Module 1: Basics of PySpark and the DataFrame API
- Module 2: PySpark for NLP
- Module 3: Advanced NLP with Spark NLP
They should be in your workspace if you have successfully completed the setup steps. They are present in this repository too if you want to go through them after the workshop.
Note: A conceptual introduction to Jupyter notebooks can be found here.