General Improvements and Refinements across the quick start (#1)

General Improvements and Refinements across the quickstart. Code Clean-up: undertook significant clean-up activities and eradicated warnings, leading to a more streamlined and efficient codebase. . Updated Python Image: The Python image has been refreshed. . Enhanced Health Checks: We introduced additional health checks for Pinot containers, further ensuring their reliability and optimal function. . Documentation: Undertook a thorough refactoring of the Readme file, making it more comprehensive, user-friendly and easy to understand. Also, addressed and fixed a few typographical errors, thereby improving overall readability.
startreedata · Apr 16, 2024 · 4dfb97e · 4dfb97e
1 parent 778e347
commit 4dfb97e
Show file tree

Hide file tree

Showing 8 changed files with 133 additions and 107 deletions.
diff --git a/.github/workflows/makefile.yml b/.github/workflows/makefile.yml
@@ -40,12 +40,12 @@ jobs:
       - name: Create Pinot Tables
         run: make tables
 
-      - name: Import Data
-        run: make import
-
       - name: Validate that cluster is Up and Schemas are deployed
         run: make validate
 
+      - name: Import Data
+        run: make import
+
       - name: Teardown
         run: make destroy
 

diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,7 @@
 logs
-.venv
+# Python virtual environments
+venv/
+myenv/
+*.venv
+env/
+ENV/
diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-base: create tables import
+base: create tables import info
 
 schema:
 	docker run \
@@ -90,6 +90,9 @@ validate:
 		exit 1; \
 	fi
 
+info:     	
+	@printf "🍷 Pinot Query UI - \033[4mhttp://localhost:9000\033[0m\n"
+
 destroy:
 	docker compose down -v
 
diff --git a/README.md b/README.md
@@ -1,6 +1,25 @@
-# Pinot Getting Started
-
-This repository gets you started with Apache Pinot. It loads two sources of data: a real-time stream of movie ratings and a batch source of movies. The two data sets can be joined together in Apache Pinot.
+# Pinot Getting Started Guide
+
+Welcome to the Apache Pinot Getting Started guide.
+This repository will help you set up and run a demonstration that involves streaming and batch data sources.
+The demonstration includes a real-time stream of movie ratings and a batch data source of movies, which can be joined in Apache Pinot for querying.
+
+<!-- TOC -->
+* [Pinot Getting Started Guide](#pinot-getting-started-guide)
+  * [Architecture Diagram](#architecture-diagram-)
+  * [A Quick Shortcut](#a-quick-shortcut)
+  * [Step-by-Step Details](#step-by-step-details)
+    * [Step 1: Build and Launch with Docker](#step-1-build-and-launch-with-docker)
+    * [Step 2: Create a Kafka Topic](#step-2-create-a-kafka-topic)
+    * [Step 3: Configure Pinot Tables](#step-3-configure-pinot-tables)
+    * [Step 4: Load Data into the Movies Table](#step-4-load-data-into-the-movies-table)
+    * [Step 5: Apache Pinot Advanced Usage](#step-5-apache-pinot-advanced-usage)
+  * [Clean Up](#clean-up)
+  * [Troubleshooting](#troubleshooting)
+  * [Further Reading](#further-reading)
+<!-- TOC -->
+
+## Architecture Diagram 
 
 ```mermaid
 flowchart LR
@@ -14,47 +33,45 @@ p-->mrp[Movie Ratings]
 p-->Movies
 ```
 
-## Just Run It
+## A Quick Shortcut
 
-Use `make` to just see the demonstration run. Run the command below. To delve into the setup, go to [step by step](#step-by-step) section.
+To quickly see the demonstration in action, you can use the following command:
 
 ```bash
-make base
+make
 ```
 
-Skip to the [Apache Pinot](#apache-pinot) section to run the `multi-stage` join between the ratings and movies table.
+For a detailed step-by-step setup, please refer to the [Step-by-Step Details](#step-by-step-details) section.
 
-## Step-By-Step Details
+If you're ready to explore the advanced features, jump directly to the [Apache Pinot Advanced Usage](#step-5-apache-pinot-advanced-usage) section to run a multi-stage join between the ratings and movies tables.
 
-This section is a step-by-step outline of the process to get this demonstration running. It describes the steps in more detail.
+## Step-by-Step Details
 
-### Step 1 - Build and Compose Up with Docker
+This section provides detailed instructions to get the demonstration up and running from scratch.
 
-Apache Pinot's can query real-time streaming data flowing through streaming platforms like Apache Kafka.
+### Step 1: Build and Launch with Docker
 
-To mock streaming data, this quick start has a built-in stream producer that writes to Kafka using Python. All Python-related details for this producer can be found in its [Dockerfile](docker/producer/Dockerfile).
+Apache Pinot queries real-time data through streaming platforms like Apache Kafka. 
+This setup includes a mock stream producer using Python to write data into Kafka.
 
-Build the producer image and start all the services by running these commands.
+First, build the producer image and start all services using the following commands:
 
 ```bash
 docker compose build --no-cache
 
 docker compose up -d
 ```
 
-The [docker-compose](./docker-compose.yml) file starts up these containers:
-
-- Dedicated Zookeeper for Pinot
-- Pinot Controller
-- Pinot Broker
-- Pinot Server
-- Kraft "Zookeeperless" Kafka
-- The python producer
+The `docker-compose.yml` file configures the following services:
 
+- Zookeeper (dedicated to Pinot)
+- Pinot Controller, Broker, and Server
+- Kraft (Zookeeperless Kafka)
+- Python producer
 
-### Step 2 - Create a Kafka Topic
+### Step 2: Create a Kafka Topic
 
-Create the Kafka topic for the producer to write into and for the Pinot table to read from.
+Next, create a Kafka topic for the producer to send data to, which Pinot will then read from:
 
 ```bash
 docker exec -it kafka kafka-topics.sh \
@@ -63,7 +80,7 @@ docker exec -it kafka kafka-topics.sh \
     --topic movie_ratings
 ```
 
-At this point, the producer should be sending data to a topic in Kafka called `movie_ratings`. You can test this by running the command below.
+To verify the stream, check the data flowing into the Kafka topic:
 
 ```bash
 docker exec -it kafka \
@@ -72,14 +89,15 @@ docker exec -it kafka \
     --topic movie_ratings
 ```
 
-### Step 3 - Create the Pinot Tables
+### Step 3: Configure Pinot Tables
 
-There are two tables we need to create in Pinot:
+In Pinot, create two types of tables:
 
-- A REALTIME table called `movie_ratings`.
-- An OFFLINE table called `movies`.
+1. A REALTIME table for streaming data (`movie_ratings`).
+2. An OFFLINE table for batch data (`movies`).
 
-To query the Kafka topic in Pinot, we add the real-time table using the `pinot-admin` CLI, providing it with a [schema](./table/ratings.schema.json) and a [table configuration](./table/ratings.table.json). The table configuration contains the connection information to Kafka.
+To query the Kafka topic in Pinot, we add the real-time table using the `pinot-admin` CLI, providing it with a [schema](./table/ratings.schema.json) and a [table configuration](./table/ratings.table.json). 
+The table configuration contains the connection information to Kafka.
 
 ```bash
 docker exec -it pinot-controller ./bin/pinot-admin.sh \
@@ -101,61 +119,60 @@ docker exec -it pinot-controller ./bin/pinot-admin.sh \
     -exec
 ```
 
-Once added, the OFFLINE table will not have any data. Let's add data in the next step.
+Once added, the OFFLINE table will not have any data.
+Let's add data in the next step.
 
-### Step 4 - Load the Movies Table
 
-We again leverage the `pinot-admin.sh` CLI to load data into an OFFLINE table.
+### Step 4: Load Data into the Movies Table
+
+Use the following command to load data into the OFFLINE movies table:
 
 ```bash
 docker exec -it pinot-controller ./bin/pinot-admin.sh \
     LaunchDataIngestionJob \
     -jobSpecFile /tmp/pinot/table/jobspec.yaml
 ```
 
-In this command, we use a YAML [file](table/jobspec.yaml) that provides the specification for loading the [movies data](data/movies.json). Once this job is completed, you can query the movies table [here](http://localhost:9000/#/query?query=select+*+from+movies+limit+10&tracing=false&useMSE=false).
-
-Now that you can query both the REALTIME and OFFLINE tables, you can perform a JOIN query in the next section.
+Now, both the REALTIME and OFFLINE tables are queryable.
 
-## Apache Pinot
+### Step 5: Apache Pinot Advanced Usage
 
-Click to open the Pinot console [here](http://localhost:9000/#/query). To perform a join, you'll need to select the `Use Multi-Stage Engine` before clicking on `RUN QUERY`.
+To perform complex queries such as joins, open the Pinot console [here](http://localhost:9000/#/query) and enable `Use Multi-Stage Engine`. Example query:
 
 ```sql
-select 
-    r.rating latest_rating, 
-    m.rating initial_rating, 
-    m.title, 
-    m.genres, 
-    m.releaseYear 
+select
+    r.rating latest_rating,
+    m.rating initial_rating,
+    m.title,
+    m.genres,
+    m.releaseYear
 from movies m
-left join movie_ratings r on m.movieId = r.movieId
+         left join movie_ratings r on m.movieId = r.movieId
 where r.rating > .9
 order by r.rating desc
-limit 10
-
+    limit 10
 ```
 
-You should see a similar result:
 
 ![alt](./images/results.png)
 
 
 ## Clean Up
 
-To destroy the demo, run the command below.
+To stop and remove all services related to the demonstration, run:
 
 ```bash
 docker compose down
 ```
 
-## Trouble Shooting
-
-If you get "No space left on device" when executing docker build.
+## Troubleshooting
 
-```docker system prune -f```
+If you encounter "No space left on device" during the Docker build process, you can free up space with:
 
+```bash
+docker system prune -f
+```
 
-## Getting Started
+## Further Reading
 
-Get started for yourself by visiting StarTree developer page [here](https://dev.startree.ai/docs/pinot/getting-started/quick-start)
+For more detailed tutorials and documentation, visit the StarTree developer page [here](https://dev.startree.ai/)
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -1,13 +1,11 @@
-version: "3.7"
+version: "3.8"
 
 services:
-
   producer:
     build:
       context: docker/producer
     container_name: producer
     environment:
-      - SCHEMA=/code/schema.json
       - BOOTSTRAPSERVER=kafka:9092
       - TOPIC=movie_ratings
       - DATA=/tmp/movies.json
@@ -16,7 +14,7 @@ services:
       - kafka
     volumes:
       - ./data/:/tmp/
-
+  
   pinot-zookeeper:
     image: zookeeper:latest
     container_name: pinot-zookeeper
@@ -25,7 +23,7 @@ services:
     environment:
       ZOOKEEPER_CLIENT_PORT: 2181
       ZOOKEEPER_TICK_TIME: 2000
-
+  
   pinot-controller:
     image: apachepinot/pinot:1.1.0-21-openjdk
     command: "StartController -zkAddress pinot-zookeeper:2181"
@@ -39,10 +37,11 @@ services:
     depends_on:
       - pinot-zookeeper
     healthcheck:
-      test: [ "CMD", "curl", "-f", "http://localhost:9000/health" ]
+      test: [ "CMD-SHELL", "curl -f http://localhost:9000/health || exit 1" ]
       interval: 30s
       timeout: 10s
-      retries: 10
+      retries: 3
+      start_period: 10s
     volumes:
       - ./table/:/tmp/pinot/table/
       - ./data/:/tmp/pinot/data/
@@ -57,12 +56,14 @@ services:
     environment:
       JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx4G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-broker.log"
     depends_on:
-      - pinot-controller
+      pinot-controller:
+        condition: service_healthy
     healthcheck:
-      test: [ "CMD", "curl", "-f", "http://localhost:8099/health" ]
+      test: [ "CMD-SHELL", "curl -f http://localhost:8099/health || exit 1" ]
       interval: 30s
       timeout: 10s
-      retries: 10
+      retries: 3
+      start_period: 10s
 
   pinot-server:
     image: apachepinot/pinot:1.1.0-21-openjdk
@@ -74,14 +75,16 @@ services:
     environment:
       JAVA_OPTS: "-Dplugins.dir=/opt/pinot/plugins -Xms4G -Xmx16G -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xloggc:gc-pinot-server.log"
     depends_on:
-      - pinot-broker
+      pinot-broker:
+        condition: service_healthy
 
   kafka:
     image: docker.io/bitnami/kafka:3.6
     hostname: kafka
     container_name: kafka
     ports:
       - "9092:9092"
+      - "29092:29092"
     healthcheck:
       test: [ "CMD", "nc", "-z", "localhost", "9092" ]
       interval: 5s
@@ -93,8 +96,9 @@ services:
       - KAFKA_CFG_PROCESS_ROLES=controller,broker
       - KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093
       # Listeners
-      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
-      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://:9092
-      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
+      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,PLAINTEXT_HOST://:29092
+      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://:9092,PLAINTEXT_HOST://localhost:29092
+      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
       - KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
       - KAFKA_CFG_INTER_BROKER_LISTENER_NAME=PLAINTEXT
+
diff --git a/docker/producer/Dockerfile b/docker/producer/Dockerfile
@@ -1,4 +1,4 @@
-FROM python:3.11.1
+FROM python:3.12.3-slim
 
 RUN apt-get update -y && \
     apt-get install -y librdkafka-dev