Data engineers are responsible for making data accessible to all the people who use it across an organization. That could mean creating a data warehouse for the analytics team, building a data pipeline for a front-end application, or summarizing massive datasets to be more user-friendly.
During this program, we will complete four courses and five projects. Throughout the projects, we will play the part of a data engineer at a music streaming company. We will work with the same type of data in each project, but with increasing data volume, velocity, and complexity. Here’s a course-by- course breakdown.
In this course, we will learn to create relational and NoSQL data models to fit the diverse needs of data consumers. In the project, we will build SQL (Postgres) and NoSQL (Apache Cassandra) data models using user activity data for a music streaming app.
In this course, we will learn to create cloud-based data warehouses. In the project, we will build an ELT pipeline that extracts data from Amazon S3, stages it in Amazon Redshift, and transforms it into a set of dimensional tables.
In this course, we will learn more about the big data ecosystem, how to work with massive datasets with Apache Spark, and how to store big data in a data lake. In the project, we will build an ETL pipeline for a data lake using Apache Spark and S3.
In this course, we will learn to schedule, automate, and monitor data pipelines using Apache Airflow. In the project, they’ll continue your work on the music streaming company’s data infrastructure by creating and automating a set of data pipelines.
In the Capstone project, we combine Twitter data, World happiness index data and Earth surface temperature data data to explore whether there is any correlation between the above. The Twitter data is dynamic and the other two dataset are static in nature. The general idea of this project is to extract Twitter data, analyze its sentiment and use the resulting data to gain insights with the other datasets.