From 93041c837f508d7f50c1d268a4e0e7ba689e633e Mon Sep 17 00:00:00 2001 From: XuzhouQin <17144939+qxzzxq@users.noreply.github.com> Date: Thu, 20 Aug 2020 17:07:12 +0200 Subject: [PATCH 1/5] Update Connector.md --- docs/Connector.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/docs/Connector.md b/docs/Connector.md index 81d860fe..223758fc 100644 --- a/docs/Connector.md +++ b/docs/Connector.md @@ -98,6 +98,33 @@ To use `org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider`: | fs.s3a.secret.key | your_s3a_secret_key | | fs.s3a.session.token | your_s3a_session_token | +| key | value | +| ------ | ------ | +| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | +| fs.s3a.access.key | your_s3a_access_key | +| fs.s3a.secret.key | your_s3a_secret_key | +| fs.s3a.session.token | your_s3a_session_token | +| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | +| fs.s3a.access.key | your_s3a_access_key | +| fs.s3a.secret.key | your_s3a_secret_key | +| fs.s3a.session.token | your_s3a_session_token | +| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | +| fs.s3a.access.key | your_s3a_access_key | +| fs.s3a.secret.key | your_s3a_secret_key | +| fs.s3a.session.token | your_s3a_session_token | +| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | +| fs.s3a.access.key | your_s3a_access_key | +| fs.s3a.secret.key | your_s3a_secret_key | +| fs.s3a.session.token | your_s3a_session_token | +| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | +| fs.s3a.access.key | your_s3a_access_key | +| fs.s3a.secret.key | your_s3a_secret_key | +| fs.s3a.session.token | your_s3a_session_token | +| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | +| fs.s3a.access.key | your_s3a_access_key | +| fs.s3a.secret.key | your_s3a_secret_key | +| fs.s3a.session.token | your_s3a_session_token | + To use `com.amazonaws.auth.InstanceProfileCredentialsProvider`: | key | value | From 8a7ddfb89e55bf4900997373db218af354834c48 Mon Sep 17 00:00:00 2001 From: XuzhouQin <17144939+qxzzxq@users.noreply.github.com> Date: Thu, 20 Aug 2020 17:10:05 +0200 Subject: [PATCH 2/5] Update Connector.md --- docs/Connector.md | 53 +++++++++++++++++++++-------------------------- 1 file changed, 24 insertions(+), 29 deletions(-) diff --git a/docs/Connector.md b/docs/Connector.md index 223758fc..715fe163 100644 --- a/docs/Connector.md +++ b/docs/Connector.md @@ -24,9 +24,11 @@ trait Connector extends Logging { The **Connector** trait was inherited by two abstract classes: **FileConnector** and **DBConnector** ## Implementation + [![](https://mermaid.ink/img/eyJjb2RlIjoiICBncmFwaCBURDtcblxuICBDb25uZWN0b3IgLS0-IEZpbGVDb25uZWN0b3I7XG4gIENvbm5lY3RvciAtLT4gREJDb25uZWN0b3I7XG5cbiAgRmlsZUNvbm5lY3RvciAtLT4gQ1NWQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBKU09OQ29ubmVjdG9yO1xuICBDb25uZWN0b3IgLS0-IEV4Y2VsQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBQYXJxdWV0Q29ubmVjdG9yO1xuXG4gIERCQ29ubmVjdG9yIC0tPiBDYXNzYW5kcmFDb25uZWN0b3I7XG4gIERCQ29ubmVjdG9yIC0tPiBEeW5hbW9EQkNvbm5lY3RvcjsiLCJtZXJtYWlkIjp7InRoZW1lIjoiZGVmYXVsdCJ9fQ)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiICBncmFwaCBURDtcblxuICBDb25uZWN0b3IgLS0-IEZpbGVDb25uZWN0b3I7XG4gIENvbm5lY3RvciAtLT4gREJDb25uZWN0b3I7XG5cbiAgRmlsZUNvbm5lY3RvciAtLT4gQ1NWQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBKU09OQ29ubmVjdG9yO1xuICBDb25uZWN0b3IgLS0-IEV4Y2VsQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBQYXJxdWV0Q29ubmVjdG9yO1xuXG4gIERCQ29ubmVjdG9yIC0tPiBDYXNzYW5kcmFDb25uZWN0b3I7XG4gIERCQ29ubmVjdG9yIC0tPiBEeW5hbW9EQkNvbm5lY3RvcjsiLCJtZXJtYWlkIjp7InRoZW1lIjoiZGVmYXVsdCJ9fQ) ## FileConnector + [**FileConnector**](https://github.com/SETL-Developers/setl/tree/master/src/main/scala/com/jcdecaux/setl/storage/connector/FileConnector.scala) could be used to access files stored in the different file systems ### Functionalities @@ -38,38 +40,47 @@ val fileConnector = new FileConnector(spark, options) where `spark` is the current **SparkSession** and `options` is a `Map[String, String]` object. #### Read + Read data from persistence storage. Need to be implemented in a concrete **FileConnector**. #### Write + Write data to persistence storage. Need to be implemented in a concrete **FileConnector**. #### Delete + Delete a file if the value of `path` defined in **options** is a file path. If `path` is a directory, then delete the directory with all its contents. Use it with care! #### Schema + The schema of data could be set by adding a key `schema` into the **options** map of the constructor. The schema must be a DDL format string: > partition1 INT, partition2 STRING, clustering1 STRING, value LONG #### Partition + Data could be partitioned before saving. To do this, call `partitionBy(columns: String*)` before `write(df)` and *Spark* will partition the *DataFrame* by creating subdirectories in the root directory. #### Suffix + A suffix is similar to a partition, but it is defined manually while calling `write(df, suffix)`. **Connector** handles the suffix by creating a subdirectory with the same naming convention as Spark partition (by default it will be `_user_defined_suffix=suffix`. >:warning: Currently (v0.3), you **can't** mix with-suffix write and non-suffix write when your data are partitioned. An **IllegalArgumentException** will be thrown in this case. The reason for which it's not supported is that, as suffix is handled by *Connector* and partition is handled by *Spark*, a suffix may confuse Spark when the latter tries to infer the structure of DataFrame. #### Multiple files reading and name pattern matching + You can read multiple files at once if the `path` you set in **options** is a directory (instead of a file path). You can also filter files by setting a regex pattern `filenamePattern` in **options**. #### File system support + - Local file system - AWS S3 - Hadoop File System #### S3 Authentication + To access S3, if *authentication error* occurs, you may have to provide extra settings in **options** for its authentication process. There are multiple authentication methods that could be set by changing **Authentication Providers**. To configure authentication, you can: @@ -98,33 +109,6 @@ To use `org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider`: | fs.s3a.secret.key | your_s3a_secret_key | | fs.s3a.session.token | your_s3a_session_token | -| key | value | -| ------ | ------ | -| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | -| fs.s3a.access.key | your_s3a_access_key | -| fs.s3a.secret.key | your_s3a_secret_key | -| fs.s3a.session.token | your_s3a_session_token | -| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | -| fs.s3a.access.key | your_s3a_access_key | -| fs.s3a.secret.key | your_s3a_secret_key | -| fs.s3a.session.token | your_s3a_session_token | -| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | -| fs.s3a.access.key | your_s3a_access_key | -| fs.s3a.secret.key | your_s3a_secret_key | -| fs.s3a.session.token | your_s3a_session_token | -| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | -| fs.s3a.access.key | your_s3a_access_key | -| fs.s3a.secret.key | your_s3a_secret_key | -| fs.s3a.session.token | your_s3a_session_token | -| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | -| fs.s3a.access.key | your_s3a_access_key | -| fs.s3a.secret.key | your_s3a_secret_key | -| fs.s3a.session.token | your_s3a_session_token | -| fs.s3a.aws.credentials.provider | org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider | -| fs.s3a.access.key | your_s3a_access_key | -| fs.s3a.secret.key | your_s3a_secret_key | -| fs.s3a.session.token | your_s3a_session_token | - To use `com.amazonaws.auth.InstanceProfileCredentialsProvider`: | key | value | @@ -134,14 +118,17 @@ To use `com.amazonaws.auth.InstanceProfileCredentialsProvider`: More information could be found [here](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods) ## DBConnector + [DBConnector](https://github.com/SETL-Developers/setl/tree/master/src/main/scala/com/jcdecaux/setl/storage/connector/DBConnector.scala) could be used for accessing databases. ### Functionalities #### Read + Read data from a database. Need to be implemented in a concrete **DBConnector**. #### Create + Create a table in a database. Need to be implemented in a concrete **DBConnector**. #### Write @@ -153,6 +140,7 @@ Send a delete request. ## CSVConnector ### Options + | name | default | | ------ | ------- | | path | | @@ -174,7 +162,9 @@ Send a delete request. For other options, please refer to [this doc](https://docs.databricks.com/spark/latest/data-sources/read-csv.html). ## JSONConnector + ### Options + | name | default | | ------ | ------- | | path | | @@ -196,7 +186,9 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/ ## ParquetConnector + ### Options + | name | default | | ------ | ------- | | path | | @@ -205,6 +197,7 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/ ## ExcelConnector ### Options + | name | default | | ------ | ------- | | path | | @@ -218,11 +211,12 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/ | addColorColumns | `false` | | dateFormat | `yyyy-MM-dd` | | timestampFormat | `yyyy-mm-dd hh:mm:ss.000` | -| maxRowsInMemory | None | +| maxRowsInMemory | `None` | | excerptSize | 10 | -| workbookPassword | None | +| workbookPassword | `None` | ## DynamoDBConnector + ### Options | name | default | @@ -232,6 +226,7 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/ | saveMode | | ## CassandraConnector + ### Options | name | default | From 0871022079e434de57fa47d7a0ac68cf6bf821a2 Mon Sep 17 00:00:00 2001 From: XuzhouQin <17144939+qxzzxq@users.noreply.github.com> Date: Thu, 20 Aug 2020 17:21:47 +0200 Subject: [PATCH 3/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5fe0c0e9..ac8c67f8 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ [![codecov](https://codecov.io/gh/SETL-Developers/setl/branch/master/graph/badge.svg)](https://codecov.io/gh/SETL-Developers/setl) [![Maven Central](https://img.shields.io/maven-central/v/com.jcdecaux.setl/setl_2.11.svg?label=Maven%20Central&color=blue)](https://mvnrepository.com/artifact/com.jcdecaux.setl/setl) [![javadoc](https://javadoc.io/badge2/com.jcdecaux.setl/setl_2.11/javadoc.svg)](https://javadoc.io/doc/com.jcdecaux.setl/setl_2.11) -[![Gitter](https://badges.gitter.im/setl-by-jcdecaux/community.svg)](https://gitter.im/setl-by-jcdecaux/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) +[![documentation](https://img.shields.io/badge/docs-passing-1f425f.svg)](https://setl-developers.github.io/setl/) If you’re a **data scientist** or **data engineer**, this might sound familiar while working on an **ETL** project: From 534759f56dbab6a009cd7bdbb5a8ab96588da852 Mon Sep 17 00:00:00 2001 From: XuzhouQin <17144939+qxzzxq@users.noreply.github.com> Date: Thu, 20 Aug 2020 17:23:09 +0200 Subject: [PATCH 4/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ac8c67f8..028d93bf 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ ![logo](docs/img/logo_setl.png) ---------- -![build](https://github.com/SETL-Developers/setl/workflows/build/badge.svg?branch=master) +[![build](https://github.com/SETL-Developers/setl/workflows/build/badge.svg?branch=master)](https://github.com/SETL-Developers/setl/actions) [![codecov](https://codecov.io/gh/SETL-Developers/setl/branch/master/graph/badge.svg)](https://codecov.io/gh/SETL-Developers/setl) [![Maven Central](https://img.shields.io/maven-central/v/com.jcdecaux.setl/setl_2.11.svg?label=Maven%20Central&color=blue)](https://mvnrepository.com/artifact/com.jcdecaux.setl/setl) [![javadoc](https://javadoc.io/badge2/com.jcdecaux.setl/setl_2.11/javadoc.svg)](https://javadoc.io/doc/com.jcdecaux.setl/setl_2.11) From 6819da5462d4102625c7a8eb7e63da453897d74c Mon Sep 17 00:00:00 2001 From: XuzhouQin <17144939+qxzzxq@users.noreply.github.com> Date: Thu, 20 Aug 2020 17:32:37 +0200 Subject: [PATCH 5/5] Update README.md --- README.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 028d93bf..8e1745d8 100644 --- a/README.md +++ b/README.md @@ -18,9 +18,11 @@ If you’re a **data scientist** or **data engineer**, this might sound familiar ## Use SETL ### In a new project + You can start working by cloning [this template project](https://github.com/qxzzxq/setl-template). ### In an existing project + ```xml com.jcdecaux.setl @@ -48,7 +50,9 @@ To use the SNAPSHOT version, add Sonatype snapshot repository to your `pom.xml` ``` ## Quick Start + ### Basic concept + With SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each stage, we could find one or several `Factories`. The class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4 methods (*read*, *process*, *write* and *get*) that should be implemented by the developer. @@ -58,6 +62,7 @@ The class `SparkRepository[T]` is a data access layer abstraction. It could be u The entry point of a SETL project is the object `com.jcdecaux.setl.Setl`, which will handle the pipeline and spark repository instantiation. ### Show me some code + You can find the following tutorial code in [the starter template of SETL](https://github.com/qxzzxq/setl-template). Go and clone it :) Here we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined as follows: @@ -67,6 +72,7 @@ case class TestObject(partition1: Int, partition2: String, clustering1: String, ``` #### Context initialization + Suppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset: ```txt @@ -92,6 +98,7 @@ setl.setSparkRepository[TestObject]("testObjectRepository") ``` #### Implementation of Factory + We will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an object of type `A`, and it contains 4 abstract methods that you need to implement: - read - process @@ -133,6 +140,7 @@ class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession { ``` #### Define the pipeline + To execute the factory, we should add it into a pipeline. When we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline. @@ -144,12 +152,14 @@ val pipeline = setl ``` #### Run our pipeline + ```scala pipeline.describe().run() ``` The dataset will be saved into `src/main/resources/test_csv` #### What's more? + As our `MyFactory` produces a `Dataset[TestObject]`, it can be used by other factories of the same pipeline. ```scala @@ -180,6 +190,7 @@ pipeline.addStage[AnotherFactory]() ``` ### Generate pipeline diagram (with v0.4.1+) + You can generate a [Mermaid diagram](https://mermaid-js.github.io/mermaid/#/) by doing: ```scala pipeline.showDiagram() @@ -264,15 +275,19 @@ You should also provide Scala and Spark in your pom file. SETL is tested against | 2.3 | 2.11 | :warning: see *known issues* | ## Known issues + - `DynamoDBConnector` doesn't work with Spark version 2.3 - `Compress` annotation can only be used on Struct field or Array of Struct field with Spark 2.3 ## Test Coverage -![](https://codecov.io/gh/SETL-Developers/setl/branch/master/graphs/sunburst.svg) + +[![coverage.svg](https://codecov.io/gh/SETL-Developers/setl/branch/master/graphs/sunburst.svg)](https://codecov.io/gh/SETL-Developers/setl) ## Documentation + [https://setl-developers.github.io/setl/](https://setl-developers.github.io/setl/) ## Contributing to SETL + [Check our contributing guide](https://github.com/SETL-Developers/setl/blob/master/CONTRIBUTING.md)