Skip to content

Commit

Permalink
General Improvements and Refinements across the quick start (#1)
Browse files Browse the repository at this point in the history
General Improvements and Refinements across the quickstart. 

Code Clean-up: undertook significant clean-up activities and eradicated warnings, leading to a more streamlined and efficient codebase. 
. Updated Python Image: The Python image has been refreshed.
. Enhanced Health Checks: We introduced additional health checks for Pinot containers, further ensuring their reliability and optimal function. 
. Documentation: Undertook a thorough refactoring of the Readme file, making it more comprehensive, user-friendly and easy to understand. Also, addressed and fixed a few typographical errors, thereby improving overall readability.
  • Loading branch information
gAmUssA authored Apr 16, 2024
1 parent 778e347 commit 4dfb97e
Show file tree
Hide file tree
Showing 8 changed files with 133 additions and 107 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/makefile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,12 @@ jobs:
- name: Create Pinot Tables
run: make tables

- name: Import Data
run: make import

- name: Validate that cluster is Up and Schemas are deployed
run: make validate

- name: Import Data
run: make import

- name: Teardown
run: make destroy

Expand Down
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
logs
.venv
# Python virtual environments
venv/
myenv/
*.venv
env/
ENV/
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
base: create tables import
base: create tables import info

schema:
docker run \
Expand Down Expand Up @@ -90,6 +90,9 @@ validate:
exit 1; \
fi

info:
@printf "🍷 Pinot Query UI - \033[4mhttp://localhost:9000\033[0m\n"

destroy:
docker compose down -v

125 changes: 71 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
# Pinot Getting Started

This repository gets you started with Apache Pinot. It loads two sources of data: a real-time stream of movie ratings and a batch source of movies. The two data sets can be joined together in Apache Pinot.
# Pinot Getting Started Guide

Welcome to the Apache Pinot Getting Started guide.
This repository will help you set up and run a demonstration that involves streaming and batch data sources.
The demonstration includes a real-time stream of movie ratings and a batch data source of movies, which can be joined in Apache Pinot for querying.

<!-- TOC -->
* [Pinot Getting Started Guide](#pinot-getting-started-guide)
* [Architecture Diagram](#architecture-diagram-)
* [A Quick Shortcut](#a-quick-shortcut)
* [Step-by-Step Details](#step-by-step-details)
* [Step 1: Build and Launch with Docker](#step-1-build-and-launch-with-docker)
* [Step 2: Create a Kafka Topic](#step-2-create-a-kafka-topic)
* [Step 3: Configure Pinot Tables](#step-3-configure-pinot-tables)
* [Step 4: Load Data into the Movies Table](#step-4-load-data-into-the-movies-table)
* [Step 5: Apache Pinot Advanced Usage](#step-5-apache-pinot-advanced-usage)
* [Clean Up](#clean-up)
* [Troubleshooting](#troubleshooting)
* [Further Reading](#further-reading)
<!-- TOC -->

## Architecture Diagram

```mermaid
flowchart LR
Expand All @@ -14,47 +33,45 @@ p-->mrp[Movie Ratings]
p-->Movies
```

## Just Run It
## A Quick Shortcut

Use `make` to just see the demonstration run. Run the command below. To delve into the setup, go to [step by step](#step-by-step) section.
To quickly see the demonstration in action, you can use the following command:

```bash
make base
make
```

Skip to the [Apache Pinot](#apache-pinot) section to run the `multi-stage` join between the ratings and movies table.
For a detailed step-by-step setup, please refer to the [Step-by-Step Details](#step-by-step-details) section.

## Step-By-Step Details
If you're ready to explore the advanced features, jump directly to the [Apache Pinot Advanced Usage](#step-5-apache-pinot-advanced-usage) section to run a multi-stage join between the ratings and movies tables.

This section is a step-by-step outline of the process to get this demonstration running. It describes the steps in more detail.
## Step-by-Step Details

### Step 1 - Build and Compose Up with Docker
This section provides detailed instructions to get the demonstration up and running from scratch.

Apache Pinot's can query real-time streaming data flowing through streaming platforms like Apache Kafka.
### Step 1: Build and Launch with Docker

To mock streaming data, this quick start has a built-in stream producer that writes to Kafka using Python. All Python-related details for this producer can be found in its [Dockerfile](docker/producer/Dockerfile).
Apache Pinot queries real-time data through streaming platforms like Apache Kafka.
This setup includes a mock stream producer using Python to write data into Kafka.

Build the producer image and start all the services by running these commands.
First, build the producer image and start all services using the following commands:

```bash
docker compose build --no-cache

docker compose up -d
```

The [docker-compose](./docker-compose.yml) file starts up these containers:

- Dedicated Zookeeper for Pinot
- Pinot Controller
- Pinot Broker
- Pinot Server
- Kraft "Zookeeperless" Kafka
- The python producer
The `docker-compose.yml` file configures the following services:

- Zookeeper (dedicated to Pinot)
- Pinot Controller, Broker, and Server
- Kraft (Zookeeperless Kafka)
- Python producer

### Step 2 - Create a Kafka Topic
### Step 2: Create a Kafka Topic

Create the Kafka topic for the producer to write into and for the Pinot table to read from.
Next, create a Kafka topic for the producer to send data to, which Pinot will then read from:

```bash
docker exec -it kafka kafka-topics.sh \
Expand All @@ -63,7 +80,7 @@ docker exec -it kafka kafka-topics.sh \
--topic movie_ratings
```

At this point, the producer should be sending data to a topic in Kafka called `movie_ratings`. You can test this by running the command below.
To verify the stream, check the data flowing into the Kafka topic:

```bash
docker exec -it kafka \
Expand All @@ -72,14 +89,15 @@ docker exec -it kafka \
--topic movie_ratings
```

### Step 3 - Create the Pinot Tables
### Step 3: Configure Pinot Tables

There are two tables we need to create in Pinot:
In Pinot, create two types of tables:

- A REALTIME table called `movie_ratings`.
- An OFFLINE table called `movies`.
1. A REALTIME table for streaming data (`movie_ratings`).
2. An OFFLINE table for batch data (`movies`).

To query the Kafka topic in Pinot, we add the real-time table using the `pinot-admin` CLI, providing it with a [schema](./table/ratings.schema.json) and a [table configuration](./table/ratings.table.json). The table configuration contains the connection information to Kafka.
To query the Kafka topic in Pinot, we add the real-time table using the `pinot-admin` CLI, providing it with a [schema](./table/ratings.schema.json) and a [table configuration](./table/ratings.table.json).
The table configuration contains the connection information to Kafka.

```bash
docker exec -it pinot-controller ./bin/pinot-admin.sh \
Expand All @@ -101,61 +119,60 @@ docker exec -it pinot-controller ./bin/pinot-admin.sh \
-exec
```

Once added, the OFFLINE table will not have any data. Let's add data in the next step.
Once added, the OFFLINE table will not have any data.
Let's add data in the next step.

### Step 4 - Load the Movies Table

We again leverage the `pinot-admin.sh` CLI to load data into an OFFLINE table.
### Step 4: Load Data into the Movies Table

Use the following command to load data into the OFFLINE movies table:

```bash
docker exec -it pinot-controller ./bin/pinot-admin.sh \
LaunchDataIngestionJob \
-jobSpecFile /tmp/pinot/table/jobspec.yaml
```

In this command, we use a YAML [file](table/jobspec.yaml) that provides the specification for loading the [movies data](data/movies.json). Once this job is completed, you can query the movies table [here](http://localhost:9000/#/query?query=select+*+from+movies+limit+10&tracing=false&useMSE=false).

Now that you can query both the REALTIME and OFFLINE tables, you can perform a JOIN query in the next section.
Now, both the REALTIME and OFFLINE tables are queryable.

## Apache Pinot
### Step 5: Apache Pinot Advanced Usage

Click to open the Pinot console [here](http://localhost:9000/#/query). To perform a join, you'll need to select the `Use Multi-Stage Engine` before clicking on `RUN QUERY`.
To perform complex queries such as joins, open the Pinot console [here](http://localhost:9000/#/query) and enable `Use Multi-Stage Engine`. Example query:

```sql
select
r.rating latest_rating,
m.rating initial_rating,
m.title,
m.genres,
m.releaseYear
select
r.rating latest_rating,
m.rating initial_rating,
m.title,
m.genres,
m.releaseYear
from movies m
left join movie_ratings r on m.movieId = r.movieId
left join movie_ratings r on m.movieId = r.movieId
where r.rating > .9
order by r.rating desc
limit 10

limit 10
```

You should see a similar result:

![alt](./images/results.png)


## Clean Up

To destroy the demo, run the command below.
To stop and remove all services related to the demonstration, run:

```bash
docker compose down
```

## Trouble Shooting

If you get "No space left on device" when executing docker build.
## Troubleshooting

```docker system prune -f```
If you encounter "No space left on device" during the Docker build process, you can free up space with:

```bash
docker system prune -f
```

## Getting Started
## Further Reading

Get started for yourself by visiting StarTree developer page [here](https://dev.startree.ai/docs/pinot/getting-started/quick-start)
For more detailed tutorials and documentation, visit the StarTree developer page [here](https://dev.startree.ai/)
32 changes: 18 additions & 14 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
version: "3.7"
version: "3.8"

services:

producer:
build:
context: docker/producer
container_name: producer
environment:
- SCHEMA=/code/schema.json
- BOOTSTRAPSERVER=kafka:9092
- TOPIC=movie_ratings
- DATA=/tmp/movies.json
Expand All @@ -16,7 +14,7 @@ services:
- kafka
volumes:
- ./data/:/tmp/

pinot-zookeeper:
image: zookeeper:latest
container_name: pinot-zookeeper
Expand All @@ -25,7 +23,7 @@ services:
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000

pinot-controller:
image: apachepinot/pinot:1.1.0-21-openjdk
command: "StartController -zkAddress pinot-zookeeper:2181"
Expand All @@ -39,10 +37,11 @@ services:
depends_on:
- pinot-zookeeper
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:9000/health" ]
test: [ "CMD-SHELL", "curl -f http://localhost:9000/health || exit 1" ]
interval: 30s
timeout: 10s
retries: 10
retries: 3
start_period: 10s
volumes:
- ./table/:/tmp/pinot/table/
- ./data/:/tmp/pinot/data/
Expand All @@ -57,12 +56,14 @@ services:
environment:
JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log"
depends_on:
- pinot-controller
pinot-controller:
condition: service_healthy
healthcheck:
test: [ "CMD", "curl", "-f", "http://localhost:8099/health" ]
test: [ "CMD-SHELL", "curl -f http://localhost:8099/health || exit 1" ]
interval: 30s
timeout: 10s
retries: 10
retries: 3
start_period: 10s

pinot-server:
image: apachepinot/pinot:1.1.0-21-openjdk
Expand All @@ -74,14 +75,16 @@ services:
environment:
JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log"
depends_on:
- pinot-broker
pinot-broker:
condition: service_healthy

kafka:
image: docker.io/bitnami/kafka:3.6
hostname: kafka
container_name: kafka
ports:
- "9092:9092"
- "29092:29092"
healthcheck:
test: [ "CMD", "nc", "-z", "localhost", "9092" ]
interval: 5s
Expand All @@ -93,8 +96,9 @@ services:
- KAFKA_CFG_PROCESS_ROLES=controller,broker
- KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093
# Listeners
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://:9092
- KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,PLAINTEXT_HOST://:29092
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://:9092,PLAINTEXT_HOST://localhost:29092
- KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
- KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
- KAFKA_CFG_INTER_BROKER_LISTENER_NAME=PLAINTEXT

2 changes: 1 addition & 1 deletion docker/producer/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.11.1
FROM python:3.12.3-slim

RUN apt-get update -y && \
apt-get install -y librdkafka-dev
Expand Down
Loading

0 comments on commit 4dfb97e

Please sign in to comment.