Skip to content

Commit

Permalink
Merge pull request #73 from fishtown-analytics/feature/update-readme-…
Browse files Browse the repository at this point in the history
…0160

Update readme (0.16.0)
  • Loading branch information
jtcohen6 authored Apr 7, 2020
2 parents 975ec66 + 8f8b451 commit cc17166
Showing 1 changed file with 9 additions and 29 deletions.
38 changes: 9 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
## dbt-spark

### Documentation
For more information on using Spark with dbt, consult the [dbt documentation](https://docs.getdbt.com/docs/profile-spark).
For more information on using Spark with dbt, consult the dbt documentation:
- [Spark profile](https://docs.getdbt.com/docs/profile-spark)
- [Spark specific configs](https://docs.getdbt.com/docs/spark-configs)

### Installation
This plugin can be installed via pip:
Expand Down Expand Up @@ -36,17 +38,11 @@ A dbt profile can be configured to run against Spark using the following configu

AWS and Azure Databricks have differences in their connections, likely due to differences in how their URLs are generated between the two services.

To connect to an Azure Databricks cluster, you will need to obtain your organization ID, which is a unique ID Azure Databricks generates for each customer workspace. To find the organization ID, see https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect#step-2-configure-connection-properties. When connecting to Azure Databricks, the organization tag is required to be set in the profiles.yml connection file, as it will be defaulted to 0 otherwise, and will not connect to Azure. This connection method follows the databricks-connect package's semantics for connecting to Databricks.

Of special note is the fact that organization ID is treated as a string by dbt-spark, as opposed to a large number. While all examples to date have contained numeric digits, it is unknown how long that may continue, and what the upper limit of this number is. If you do have a leading zero, please include it in the organization tag and dbt-spark will pass that along.

dbt-spark has also been tested against AWS Databricks, and it has some differences in the URLs used. It appears to default the positional value where organization lives in AWS connection URLs to 0, so dbt-spark does the same for AWS connections (i.e. simply leave organization-id out when connecting to the AWS version and dbt-spark will construct the correct AWS URL for you). Note the missing reference to organization here: https://docs.databricks.com/dev-tools/databricks-connect.html#step-2-configure-connection-properties.

Please ignore all references to port 15001 in the databricks-connect docs as that is specific to that tool; port 443 is used for dbt-spark's https connection.

Lastly, the host field for Databricks can be found at the start of your workspace or cluster url (but don't include https://): region.azuredatabricks.net for Azure, or account.cloud.databricks.com for AWS.
**Organization:** To connect to an Azure Databricks cluster, you will need to obtain your organization ID, which is a unique ID Azure Databricks generates for each customer workspace. To find the organization ID, see https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect#step-2-configure-connection-properties. This is a string field; if there is a leading zero, be sure to include it.

**Port:** Please ignore all references to port 15001 in the databricks-connect docs as that is specific to that tool; port 443 is used for dbt-spark's https connection.

**Host:** The host field for Databricks can be found at the start of your workspace or cluster url: `region.azuredatabricks.net` for Azure, or `account.cloud.databricks.com` for AWS. Do not include `https://`.

**Usage with Amazon EMR**

Expand All @@ -55,24 +51,7 @@ To connect to Spark running on an Amazon EMR cluster, you will need to run `sudo

**Example profiles.yml entries:**

**http, e.g. AWS Databricks**
```
your_profile_name:
target: dev
outputs:
dev:
method: http
type: spark
schema: analytics
host: yourorg.sparkhost.com
port: 443
token: abc123
cluster: 01234-23423-coffeetime
connect_retries: 5
connect_timeout: 60
```

**Azure Databricks, via http**
**http, e.g. Databricks**
```
your_profile_name:
target: dev
Expand All @@ -82,9 +61,9 @@ your_profile_name:
type: spark
schema: analytics
host: yourorg.sparkhost.com
organization: 1234567891234567 # Azure Databricks ONLY
port: 443
token: abc123
organization: 1234567891234567
cluster: 01234-23423-coffeetime
connect_retries: 5
connect_timeout: 60
Expand Down Expand Up @@ -123,6 +102,7 @@ The following configurations can be supplied to models run with the dbt-spark pl
| clustered_by | Each partition in the created table will be split into a fixed number of buckets by the specified columns. | Optional | `cluster_1` |
| buckets | The number of buckets to create while clustering | Required if `clustered_by` is specified | `8` |
| incremental_strategy | The strategy to use for incremental models (`insert_overwrite` or `merge`). Note `merge` requires `file_format` = `delta` and `unique_key` to be specified. | Optional (default: `insert_overwrite`) | `merge` |
| persist_docs | Whether dbt should include the model description as a table `comment` | Optional | `{'relation': true}` |


**Incremental Models**
Expand Down

0 comments on commit cc17166

Please sign in to comment.