A lightweight tool based on sweetviz that generates high-density visualizations to kickstart Exploratory Data Analysis within Amazon Redshift using pyodbc with just one line of codee.
Copy the main.py
script and install the requirements located in the dist folder.
pip install -r requirements.txt
We will also need to download and install the ODBC Driver for Amazon Redshift.
Download Amazon Redshift ODBC Driver
Positional argument | Example/Description |
---|---|
server | komodo-cluster-3000.abcdefghijkl.us-east-2.redshift.amazonaws.com |
user | awsuser |
password | specifies the user password |
database | dev, sample_data_dev |
schema | public |
Option | Example/Description |
---|---|
-h, --help | show this help message and exit |
-r, --rows | specifies the number of rows to sample from the table (default: 500000) |
-l, --level | specifies the database object level in which the analysis should be executed, "s" for schema and "t" for table (default: "s") |
-t, --table | specifies the database table name |
--associations | indicates that a correlation graph should be generated |
--open-browser | indicates that a web browser tab should be opened while datasets are analyzed |
The default behaviour of the script will load and analyze the specified number of rows of each table in the selected database schema.
python main.py komodo-cluster-3000.abcdefghijkl.us-east-2.redshift.amazonaws.com awsuser S3cUr3P@S$w0rD dev public -r=10000
The program will build and save locally high-density HTML visualizations and generate an Excel summary with table name, table rows, data size, table index size and parsed record count in a new folder called obj.
If we need a correlation graph to be generated for the columns of each table, we must include the --associations
flag.
python main.py komodo-cluster-3000.abcdefghijkl.us-east-2.redshift.amazonaws.com awsuser S3cUr3P@S$w0rD dev public -r=10000 --associations
We must consider that correlations and other associations may take a quadratic time (n^2) to complete.
If we only need the analysis for a single table we must specify "t" as -l
or --level
argument value with the corresponding table name in -t
or --table
argument.
python main.py komodo-cluster-3000.abcdefghijkl.us-east-2.redshift.amazonaws.com awsuser S3cUr3P@S$w0rD dev public -r=500000 -l=t -t=sales
Finally let's take into consideration that unlike Snowflake or Azure SQL Server, Amazon Redshift does not support TABLESAMPLE
. For this reason, the present project makes use of RANDOM()
which represents a more expensive computation.
Amazon Redshift Statistics Descriptor was tested with:
- Python: 3.7.16
- Packages:
- pyodbc: 4.0.39
- pandas: 1.3.5
- sweetviz: 2.1.4
- XlsxWriter: 3.1.0
- Anaconda: 2.4.0
This project is licenced under the MIT License.