Merge remote-tracking branch 'upstream/master'

SETL-Framework · Aug 21, 2020 · b23c9ea · b23c9ea
2 parents a5c78da + 6819da5
commit b23c9ea
Show file tree

Hide file tree

Showing 2 changed files with 42 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -1,11 +1,11 @@
 ![logo](docs/img/logo_setl.png)
 ----------
 
-![build](https://github.com/SETL-Developers/setl/workflows/build/badge.svg?branch=master)
+[![build](https://github.com/SETL-Developers/setl/workflows/build/badge.svg?branch=master)](https://github.com/SETL-Developers/setl/actions)
 [![codecov](https://codecov.io/gh/SETL-Developers/setl/branch/master/graph/badge.svg)](https://codecov.io/gh/SETL-Developers/setl)
 [![Maven Central](https://img.shields.io/maven-central/v/com.jcdecaux.setl/setl_2.11.svg?label=Maven%20Central&color=blue)](https://mvnrepository.com/artifact/com.jcdecaux.setl/setl)
 [![javadoc](https://javadoc.io/badge2/com.jcdecaux.setl/setl_2.11/javadoc.svg)](https://javadoc.io/doc/com.jcdecaux.setl/setl_2.11)
-[![Gitter](https://badges.gitter.im/setl-by-jcdecaux/community.svg)](https://gitter.im/setl-by-jcdecaux/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
+[![documentation](https://img.shields.io/badge/docs-passing-1f425f.svg)](https://setl-developers.github.io/setl/)
 
 If you’re a **data scientist** or **data engineer**, this might sound familiar while working on an **ETL** project: 
 
@@ -18,9 +18,11 @@ If you’re a **data scientist** or **data engineer**, this might sound familiar
 ## Use SETL
 
 ### In a new project
+
 You can start working by cloning [this template project](https://github.com/qxzzxq/setl-template).
 
 ### In an existing project
+
 ```xml
 <dependency>
   <groupId>com.jcdecaux.setl</groupId>
@@ -48,7 +50,9 @@ To use the SNAPSHOT version, add Sonatype snapshot repository to your `pom.xml`
 ```
 
 ## Quick Start
+
 ### Basic concept
+
 With SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each stage, we could find one or several `Factories`.
 
 The class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4 methods (*read*, *process*, *write* and *get*) that should be implemented by the developer.
@@ -58,6 +62,7 @@ The class `SparkRepository[T]` is a data access layer abstraction. It could be u
 The entry point of a SETL project is the object `com.jcdecaux.setl.Setl`, which will handle the pipeline and spark repository instantiation.
 
 ### Show me some code
+
 You can find the following tutorial code in [the starter template of SETL](https://github.com/qxzzxq/setl-template). Go and clone it :)
 
 Here we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined as follows:
@@ -67,6 +72,7 @@ case class TestObject(partition1: Int, partition2: String, clustering1: String,
 ```
 
 #### Context initialization
+
 Suppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset:
 
 ```txt
@@ -92,6 +98,7 @@ setl.setSparkRepository[TestObject]("testObjectRepository")
 ```
 
 #### Implementation of Factory
+
 We will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an object of type `A`, and it contains 4 abstract methods that you need to implement:
 - read
 - process
@@ -133,6 +140,7 @@ class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {
 ```
 
 #### Define the pipeline
+
 To execute the factory, we should add it into a pipeline.
 
 When we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline.
@@ -144,12 +152,14 @@ val pipeline = setl
 ```
 
 #### Run our pipeline
+
 ```scala
 pipeline.describe().run()
 ```
 The dataset will be saved into `src/main/resources/test_csv`
 
 #### What's more?
+
 As our `MyFactory` produces a `Dataset[TestObject]`, it can be used by other factories of the same pipeline.
 
 ```scala
@@ -180,6 +190,7 @@ pipeline.addStage[AnotherFactory]()
 ```
 
 ### Generate pipeline diagram (with v0.4.1+)
+
 You can generate a [Mermaid diagram](https://mermaid-js.github.io/mermaid/#/) by doing:
 ```scala
 pipeline.showDiagram()
@@ -264,15 +275,19 @@ You should also provide Scala and Spark in your pom file. SETL is tested against
 |     2.3       |        2.11    | :warning: see *known issues* |
 
 ## Known issues
+
 - `DynamoDBConnector` doesn't work with Spark version 2.3
 - `Compress` annotation can only be used on Struct field or Array of Struct field with Spark 2.3
 
 ## Test Coverage
-![](https://codecov.io/gh/SETL-Developers/setl/branch/master/graphs/sunburst.svg)
+
+[![coverage.svg](https://codecov.io/gh/SETL-Developers/setl/branch/master/graphs/sunburst.svg)](https://codecov.io/gh/SETL-Developers/setl)
 
 ## Documentation
+
 [https://setl-developers.github.io/setl/](https://setl-developers.github.io/setl/)
 
 ## Contributing to SETL
+
 [Check our contributing guide](https://github.com/SETL-Developers/setl/blob/master/CONTRIBUTING.md)
 
diff --git a/docs/Connector.md b/docs/Connector.md
@@ -24,9 +24,11 @@ trait Connector extends Logging {
 The **Connector** trait was inherited by two abstract classes: **FileConnector** and **DBConnector**
 
 ## Implementation
+
 [![](https://mermaid.ink/img/eyJjb2RlIjoiICBncmFwaCBURDtcblxuICBDb25uZWN0b3IgLS0-IEZpbGVDb25uZWN0b3I7XG4gIENvbm5lY3RvciAtLT4gREJDb25uZWN0b3I7XG5cbiAgRmlsZUNvbm5lY3RvciAtLT4gQ1NWQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBKU09OQ29ubmVjdG9yO1xuICBDb25uZWN0b3IgLS0-IEV4Y2VsQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBQYXJxdWV0Q29ubmVjdG9yO1xuXG4gIERCQ29ubmVjdG9yIC0tPiBDYXNzYW5kcmFDb25uZWN0b3I7XG4gIERCQ29ubmVjdG9yIC0tPiBEeW5hbW9EQkNvbm5lY3RvcjsiLCJtZXJtYWlkIjp7InRoZW1lIjoiZGVmYXVsdCJ9fQ)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiICBncmFwaCBURDtcblxuICBDb25uZWN0b3IgLS0-IEZpbGVDb25uZWN0b3I7XG4gIENvbm5lY3RvciAtLT4gREJDb25uZWN0b3I7XG5cbiAgRmlsZUNvbm5lY3RvciAtLT4gQ1NWQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBKU09OQ29ubmVjdG9yO1xuICBDb25uZWN0b3IgLS0-IEV4Y2VsQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBQYXJxdWV0Q29ubmVjdG9yO1xuXG4gIERCQ29ubmVjdG9yIC0tPiBDYXNzYW5kcmFDb25uZWN0b3I7XG4gIERCQ29ubmVjdG9yIC0tPiBEeW5hbW9EQkNvbm5lY3RvcjsiLCJtZXJtYWlkIjp7InRoZW1lIjoiZGVmYXVsdCJ9fQ)
 
 ## FileConnector
+
 [**FileConnector**](https://github.com/SETL-Developers/setl/tree/master/src/main/scala/com/jcdecaux/setl/storage/connector/FileConnector.scala) could be used to access files stored in the different file systems
 
 ### Functionalities
@@ -38,38 +40,47 @@ val fileConnector = new FileConnector(spark, options)
 where `spark` is the current **SparkSession** and `options` is a `Map[String, String]` object.
 
 #### Read
+
 Read data from persistence storage. Need to be implemented in a concrete **FileConnector**.
 
 #### Write
+
 Write data to persistence storage. Need to be implemented in a concrete **FileConnector**.
 
 #### Delete
+
 Delete a file if the value of `path` defined in **options** is a file path. If `path` is a directory, then delete the directory with all its contents.
 
 Use it with care!
 
 #### Schema
+
 The schema of data could be set by adding a key `schema` into the **options** map of the constructor. The schema must be a DDL format string:
 > partition1 INT, partition2 STRING, clustering1 STRING, value LONG
 
 #### Partition
+
 Data could be partitioned before saving. To do this, call `partitionBy(columns: String*)` before `write(df)` and *Spark* will partition the *DataFrame* by creating subdirectories in the root directory.
 
 #### Suffix
+
 A suffix is similar to a partition, but it is defined manually while calling `write(df, suffix)`. **Connector** handles the suffix by creating a subdirectory with the same naming convention as Spark partition (by default it will be `_user_defined_suffix=suffix`.
 
 >:warning: Currently (v0.3), you **can't** mix with-suffix write and non-suffix write when your data are partitioned. An **IllegalArgumentException** will be thrown in this case. The reason for which it's not supported is that, as suffix is handled by *Connector* and partition is handled by *Spark*, a suffix may confuse Spark when the latter tries to infer the structure of DataFrame.
 
 
 #### Multiple files reading and name pattern matching
+
 You can read multiple files at once if the `path` you set in **options** is a directory (instead of a file path). You can also filter files by setting a regex pattern `filenamePattern` in **options**.
 
 #### File system support
+
 - Local file system
 - AWS S3
 - Hadoop File System
 
 #### S3 Authentication
+
 To access S3, if *authentication error* occurs, you may have to provide extra settings in **options** for its authentication process. There are multiple authentication methods that could be set by changing **Authentication Providers**. 
 
 To configure authentication, you can:
@@ -107,14 +118,17 @@ To use `com.amazonaws.auth.InstanceProfileCredentialsProvider`:
 More information could be found [here](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods)
 
 ## DBConnector
+
 [DBConnector](https://github.com/SETL-Developers/setl/tree/master/src/main/scala/com/jcdecaux/setl/storage/connector/DBConnector.scala) could be used for accessing databases.
 
 ### Functionalities
 
 #### Read
+
 Read data from a database. Need to be implemented in a concrete **DBConnector**.
 
 #### Create
+
 Create a table in a database. Need to be implemented in a concrete **DBConnector**.
 
 #### Write
@@ -126,6 +140,7 @@ Send a delete request.
 ## CSVConnector
 
 ### Options
+
 | name  | default   |
 | ------  |  ------- |
 | path |   <user_input> |
@@ -147,7 +162,9 @@ Send a delete request.
 For other options, please refer to [this doc](https://docs.databricks.com/spark/latest/data-sources/read-csv.html).
 
 ## JSONConnector
+
 ### Options
+
 | name  | default |
 | ------  |  ------- |
 | path | <user_input>  |
@@ -169,7 +186,9 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/
 
 
 ## ParquetConnector
+
 ### Options
+
 | name  | default |
 | ------  |  ------- |
 | path | <user_input>  |
@@ -178,6 +197,7 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/
 
 ## ExcelConnector
 ### Options
+
 | name  | default |
 | ------  |  ------- |
 | path | <user_input>  |
@@ -191,11 +211,12 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/
 | addColorColumns | `false` |
 | dateFormat  | `yyyy-MM-dd` |
 | timestampFormat  | `yyyy-mm-dd hh:mm:ss.000` |
-| maxRowsInMemory  | None |
+| maxRowsInMemory  | `None` |
 | excerptSize  | 10 |
-| workbookPassword  | None |
+| workbookPassword  | `None` |
 
 ## DynamoDBConnector
+
 ### Options
 
 | name  | default |
@@ -205,6 +226,7 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/
 | saveMode | <user_input>  |
 
 ## CassandraConnector
+
 ### Options
 
 | name  | default |