Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
qxzzxq committed Aug 21, 2020
2 parents a5c78da + 6819da5 commit b23c9ea
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 5 deletions.
21 changes: 18 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
![logo](docs/img/logo_setl.png)
----------

![build](https://github.com/SETL-Developers/setl/workflows/build/badge.svg?branch=master)
[![build](https://github.com/SETL-Developers/setl/workflows/build/badge.svg?branch=master)](https://github.com/SETL-Developers/setl/actions)
[![codecov](https://codecov.io/gh/SETL-Developers/setl/branch/master/graph/badge.svg)](https://codecov.io/gh/SETL-Developers/setl)
[![Maven Central](https://img.shields.io/maven-central/v/com.jcdecaux.setl/setl_2.11.svg?label=Maven%20Central&color=blue)](https://mvnrepository.com/artifact/com.jcdecaux.setl/setl)
[![javadoc](https://javadoc.io/badge2/com.jcdecaux.setl/setl_2.11/javadoc.svg)](https://javadoc.io/doc/com.jcdecaux.setl/setl_2.11)
[![Gitter](https://badges.gitter.im/setl-by-jcdecaux/community.svg)](https://gitter.im/setl-by-jcdecaux/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
[![documentation](https://img.shields.io/badge/docs-passing-1f425f.svg)](https://setl-developers.github.io/setl/)

If you’re a **data scientist** or **data engineer**, this might sound familiar while working on an **ETL** project:

Expand All @@ -18,9 +18,11 @@ If you’re a **data scientist** or **data engineer**, this might sound familiar
## Use SETL

### In a new project

You can start working by cloning [this template project](https://github.com/qxzzxq/setl-template).

### In an existing project

```xml
<dependency>
<groupId>com.jcdecaux.setl</groupId>
Expand Down Expand Up @@ -48,7 +50,9 @@ To use the SNAPSHOT version, add Sonatype snapshot repository to your `pom.xml`
```

## Quick Start

### Basic concept

With SETL, an ETL application could be represented by a `Pipeline`. A `Pipeline` contains multiple `Stages`. In each stage, we could find one or several `Factories`.

The class `Factory[T]` is an abstraction of a data transformation that will produce an object of type `T`. It has 4 methods (*read*, *process*, *write* and *get*) that should be implemented by the developer.
Expand All @@ -58,6 +62,7 @@ The class `SparkRepository[T]` is a data access layer abstraction. It could be u
The entry point of a SETL project is the object `com.jcdecaux.setl.Setl`, which will handle the pipeline and spark repository instantiation.

### Show me some code

You can find the following tutorial code in [the starter template of SETL](https://github.com/qxzzxq/setl-template). Go and clone it :)

Here we show a simple example of creating and saving a **Dataset[TestObject]**. The case class **TestObject** is defined as follows:
Expand All @@ -67,6 +72,7 @@ case class TestObject(partition1: Int, partition2: String, clustering1: String,
```

#### Context initialization

Suppose that we want to save our output into `src/main/resources/test_csv`. We can create a configuration file **local.conf** in `src/main/resources` with the following content that defines the target datastore to save our dataset:

```txt
Expand All @@ -92,6 +98,7 @@ setl.setSparkRepository[TestObject]("testObjectRepository")
```

#### Implementation of Factory

We will create our `Dataset[TestObject]` inside a `Factory[Dataset[TestObject]]`. A `Factory[A]` will always produce an object of type `A`, and it contains 4 abstract methods that you need to implement:
- read
- process
Expand Down Expand Up @@ -133,6 +140,7 @@ class MyFactory() extends Factory[Dataset[TestObject]] with HasSparkSession {
```

#### Define the pipeline

To execute the factory, we should add it into a pipeline.

When we call `setl.newPipeline()`, **Setl** will instantiate a new **Pipeline** and configure all the registered repositories as inputs of the pipeline. Then we can call `addStage` to add our factory into the pipeline.
Expand All @@ -144,12 +152,14 @@ val pipeline = setl
```

#### Run our pipeline

```scala
pipeline.describe().run()
```
The dataset will be saved into `src/main/resources/test_csv`

#### What's more?

As our `MyFactory` produces a `Dataset[TestObject]`, it can be used by other factories of the same pipeline.

```scala
Expand Down Expand Up @@ -180,6 +190,7 @@ pipeline.addStage[AnotherFactory]()
```

### Generate pipeline diagram (with v0.4.1+)

You can generate a [Mermaid diagram](https://mermaid-js.github.io/mermaid/#/) by doing:
```scala
pipeline.showDiagram()
Expand Down Expand Up @@ -264,15 +275,19 @@ You should also provide Scala and Spark in your pom file. SETL is tested against
| 2.3 | 2.11 | :warning: see *known issues* |

## Known issues

- `DynamoDBConnector` doesn't work with Spark version 2.3
- `Compress` annotation can only be used on Struct field or Array of Struct field with Spark 2.3

## Test Coverage
![](https://codecov.io/gh/SETL-Developers/setl/branch/master/graphs/sunburst.svg)

[![coverage.svg](https://codecov.io/gh/SETL-Developers/setl/branch/master/graphs/sunburst.svg)](https://codecov.io/gh/SETL-Developers/setl)

## Documentation

[https://setl-developers.github.io/setl/](https://setl-developers.github.io/setl/)

## Contributing to SETL

[Check our contributing guide](https://github.com/SETL-Developers/setl/blob/master/CONTRIBUTING.md)

26 changes: 24 additions & 2 deletions docs/Connector.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,11 @@ trait Connector extends Logging {
The **Connector** trait was inherited by two abstract classes: **FileConnector** and **DBConnector**

## Implementation

[![](https://mermaid.ink/img/eyJjb2RlIjoiICBncmFwaCBURDtcblxuICBDb25uZWN0b3IgLS0-IEZpbGVDb25uZWN0b3I7XG4gIENvbm5lY3RvciAtLT4gREJDb25uZWN0b3I7XG5cbiAgRmlsZUNvbm5lY3RvciAtLT4gQ1NWQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBKU09OQ29ubmVjdG9yO1xuICBDb25uZWN0b3IgLS0-IEV4Y2VsQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBQYXJxdWV0Q29ubmVjdG9yO1xuXG4gIERCQ29ubmVjdG9yIC0tPiBDYXNzYW5kcmFDb25uZWN0b3I7XG4gIERCQ29ubmVjdG9yIC0tPiBEeW5hbW9EQkNvbm5lY3RvcjsiLCJtZXJtYWlkIjp7InRoZW1lIjoiZGVmYXVsdCJ9fQ)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiICBncmFwaCBURDtcblxuICBDb25uZWN0b3IgLS0-IEZpbGVDb25uZWN0b3I7XG4gIENvbm5lY3RvciAtLT4gREJDb25uZWN0b3I7XG5cbiAgRmlsZUNvbm5lY3RvciAtLT4gQ1NWQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBKU09OQ29ubmVjdG9yO1xuICBDb25uZWN0b3IgLS0-IEV4Y2VsQ29ubmVjdG9yO1xuICBGaWxlQ29ubmVjdG9yIC0tPiBQYXJxdWV0Q29ubmVjdG9yO1xuXG4gIERCQ29ubmVjdG9yIC0tPiBDYXNzYW5kcmFDb25uZWN0b3I7XG4gIERCQ29ubmVjdG9yIC0tPiBEeW5hbW9EQkNvbm5lY3RvcjsiLCJtZXJtYWlkIjp7InRoZW1lIjoiZGVmYXVsdCJ9fQ)

## FileConnector

[**FileConnector**](https://github.com/SETL-Developers/setl/tree/master/src/main/scala/com/jcdecaux/setl/storage/connector/FileConnector.scala) could be used to access files stored in the different file systems

### Functionalities
Expand All @@ -38,38 +40,47 @@ val fileConnector = new FileConnector(spark, options)
where `spark` is the current **SparkSession** and `options` is a `Map[String, String]` object.

#### Read

Read data from persistence storage. Need to be implemented in a concrete **FileConnector**.

#### Write

Write data to persistence storage. Need to be implemented in a concrete **FileConnector**.

#### Delete

Delete a file if the value of `path` defined in **options** is a file path. If `path` is a directory, then delete the directory with all its contents.

Use it with care!

#### Schema

The schema of data could be set by adding a key `schema` into the **options** map of the constructor. The schema must be a DDL format string:
> partition1 INT, partition2 STRING, clustering1 STRING, value LONG
#### Partition

Data could be partitioned before saving. To do this, call `partitionBy(columns: String*)` before `write(df)` and *Spark* will partition the *DataFrame* by creating subdirectories in the root directory.

#### Suffix

A suffix is similar to a partition, but it is defined manually while calling `write(df, suffix)`. **Connector** handles the suffix by creating a subdirectory with the same naming convention as Spark partition (by default it will be `_user_defined_suffix=suffix`.

>:warning: Currently (v0.3), you **can't** mix with-suffix write and non-suffix write when your data are partitioned. An **IllegalArgumentException** will be thrown in this case. The reason for which it's not supported is that, as suffix is handled by *Connector* and partition is handled by *Spark*, a suffix may confuse Spark when the latter tries to infer the structure of DataFrame.

#### Multiple files reading and name pattern matching

You can read multiple files at once if the `path` you set in **options** is a directory (instead of a file path). You can also filter files by setting a regex pattern `filenamePattern` in **options**.

#### File system support

- Local file system
- AWS S3
- Hadoop File System

#### S3 Authentication

To access S3, if *authentication error* occurs, you may have to provide extra settings in **options** for its authentication process. There are multiple authentication methods that could be set by changing **Authentication Providers**.

To configure authentication, you can:
Expand Down Expand Up @@ -107,14 +118,17 @@ To use `com.amazonaws.auth.InstanceProfileCredentialsProvider`:
More information could be found [here](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A_Authentication_methods)

## DBConnector

[DBConnector](https://github.com/SETL-Developers/setl/tree/master/src/main/scala/com/jcdecaux/setl/storage/connector/DBConnector.scala) could be used for accessing databases.

### Functionalities

#### Read

Read data from a database. Need to be implemented in a concrete **DBConnector**.

#### Create

Create a table in a database. Need to be implemented in a concrete **DBConnector**.

#### Write
Expand All @@ -126,6 +140,7 @@ Send a delete request.
## CSVConnector

### Options

| name | default |
| ------ | ------- |
| path | <user_input> |
Expand All @@ -147,7 +162,9 @@ Send a delete request.
For other options, please refer to [this doc](https://docs.databricks.com/spark/latest/data-sources/read-csv.html).

## JSONConnector

### Options

| name | default |
| ------ | ------- |
| path | <user_input> |
Expand All @@ -169,7 +186,9 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/


## ParquetConnector

### Options

| name | default |
| ------ | ------- |
| path | <user_input> |
Expand All @@ -178,6 +197,7 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/

## ExcelConnector
### Options

| name | default |
| ------ | ------- |
| path | <user_input> |
Expand All @@ -191,11 +211,12 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/
| addColorColumns | `false` |
| dateFormat | `yyyy-MM-dd` |
| timestampFormat | `yyyy-mm-dd hh:mm:ss.000` |
| maxRowsInMemory | None |
| maxRowsInMemory | `None` |
| excerptSize | 10 |
| workbookPassword | None |
| workbookPassword | `None` |

## DynamoDBConnector

### Options

| name | default |
Expand All @@ -205,6 +226,7 @@ For other options, please refer to [this doc](https://docs.databricks.com/spark/
| saveMode | <user_input> |

## CassandraConnector

### Options

| name | default |
Expand Down

0 comments on commit b23c9ea

Please sign in to comment.