Hi there!
I'm learning to code, interested in data analytics and data science, aiming to be hired by the mid of 2021.
This repository is for data analysis course I've enrolled in January 2021.
The course curriculum includes the following technologies and topics mastered by me:
- Python
- Pandas
- Numpy
- Seaborn
- Google APIs
- Git
- Airflow
- SQL
- ClickHouse
- PostgreSQL
- Redash
- Superset
- Statistics
- A/B-tests
- Bootstrapping
- Amplitude
- Tableau
- DAU, MAU, ARPU, LTV, Retention, CR and other metrics
- Product Development basics
- Product Management basics
- Soft-skills
List of projects:
- Taxi in NYC -- analising NYC taxi orders with Pandas. Read_csv, rename, groupby, agg, query, sort_values, idxmax, idxmin, value_counts, pivot methods were used for Exploratory Data Analysis.
- Hotel Bookings -- analising hotel bookings with Pandas. Read_csv, info, rename, groupby, agg, query, sort_values, idxmax, idxmin, value_counts, pivot methods were used for Exploratory Data Analysis. Customers Churn rate was calculated.
- User Logs -- analising customers data. Finding the most popular platform and the most active users. Visualizing data with Seaborn distplot, barplot and countplot methods.
- Taxi in Peru -- analising taxi orders in Peru with Pandas. An Exploratory Data Analysis was performed. Drivers' score, passengers' score, DAU and MAU metrics were calculated and plotted with Seaborn.
- Raw Data Handling -- creating dataframe from a set of csv-files stored in various folders. Practicing Python skills to automate data handling.
- Retail in Germany -- having a dataset with purchases of clients from Europe. Count basic sales statistics for clients from Germany. Duplicated, drop_duplicates, groupby, agg, query, sort_values, assign, quantile and str methods were used for Exploratory Data Analysis.
- Error in Transactions Data -- we've found and corrected an error while analising a dataset with transactions. Plotting data in logarithmic scale, converting data to datetime format as well as implementing describe, isna, sum, value_counts, groupby, agg, query, sort_values, rename, min, max and pivot methods were used for Exploratory Data Analysis.
- Avocado Price -- comparing avocado average, simple moving average and exponential weighted average price values. Categorizing delay data and labeling it. Plotting results with help of subplots and interactive Plotly plots.
- Ads Campaign -- plotting data in logarithmic scale to find the type of data distribution, finding ad_id with an anomalistic number of views. Comparing average and simple moving average views data. Calculating clients' registration to publishing ad conversion rate (CR). Categorizing clients' registration data and labeling it. Plotting results with help of interactive Plotly plot.
- Visits by Browser -- analising web-site visits. Defining proportion of real users and visits by bots. Finding the most popular browser for users and for bots. Bar-plotting results, downloading data using Google Docs API and merging it to our dataframe. Read_csv, groupby, agg, query, sort_values, pivot, fillna, assign and merge methods were used for Exploratory Data Analysis.
- Telegram Bot Airflow Reporting -- reading an advertising campaign data from Google Docs spreadsheet, creating pandas dataframe to calculate clicks, views, CTR and money spent on the campaign. Calculating day by day change of the metrics, writing report with results to a txt file and sending this file via telegram bot to your mobile phone. The script is executed by Airflow every Monday at 12:00 p.m.
- SQL Tasks -- SQL exercises done by me while passing this data analysis course. Clickhouse (via Tabix) was used to solve the tasks.
- NYC taxi & timeit optimization -- calculating distance of a ride using pick-up and drop-off coordinates. Compared a couple of ways to apply distance calculation to the dataframe. The optimization helped to decrease calculation run-time about 3276 times! Checked calculation results, found outliers using boxplot graphs and descriptive statistics. Fixed dataframe by removing outliers and found the cost of the longest ride.
- Bikes rent in Chicago -- dates to dateformat conversion, resampling data to aggregate by days, automatically merging data from distinct files into one dataframe using os.walk(), differentiating bikes rents by user type, finding the most popular destination points overall and based on the week of the day.
- Bookings in London -- used Pandahouse and SQL queries to import data from Clickhouse into pandas dataframe. Processed imported data and performed Exploratory Data Analysis. Built scatterplot, distplot, lineplot and heatmap using Seaborn and Matplotlib.
- Retail dashboard -- built a series of visualizations and a dashboard using SQL queries and Redash. Calculated and checked the dynamics of MAU and AOV. Found an anomaly in data, defined the market generating the majority of revenue, analyzed the most popular goods sold in the store. Wrote a dashboard summary with recommendation to push up sales.
- Video games -- analising video games sales dynamics with Pandas. Read_csv, head, columns, dtypes, info, isna, dropna, describe, mode, shape, groupby, agg, sort_values, rename, index, to_list, value_counts methods were user for Exploratory Data Analysis. Barplot, boxplot and lineplot were used for graphing results.
- Ads conversion -- calculating CTR, CPC, CR metrics. Plotting them using distplot, hist, displot and histplot methods.
- Yandex Music -- analyzing music streaming platform songs popularity, comparing music preferences and listening templates in Moscow and Saint-Petersburg. Reading and cleaning data, renaming columns, removing duplicates, dealing with missing data, slicing dataframe to query required portion of data.
- Bikes rent in London -- loading dataset, plotting rides count data, resampling timestamps, describing the main trends, looking for anomaly values by smoothing the data with a simple moving average, calculating the difference between the real data and smoothed data, finding standard deviation and defining the 99% confidence interval. Then we compare values with the confidence interval to find data hikes and explain them.
- Delivery A/B -- Finding how a new navigational algorithm has changed the delivery time of the service. Formulating null and alternative hypothesis and performing A/B test with help of t-test.
- App interface A/B -- Testing how images aspect ratio and a new order button design influence on the amount of orders placed by customers. Performed Levene's test to check group variance equality, Shapiro-Wilk test to check groups for normality, one-way ANOVA to check statistically significant difference between tested groups, Tukey's test to find statistically significant difference between groups, linear model multivariate analysis of variance, visualized and interpreted results, gave recommendations to put (or not to put) changes into production.
- Cars sales -- predicting cars sales price using linear regression models (statsmodels.api & statsmodels.formula.api). Finding statistically significant predictors.
- Bootstrap A/B -- comparing results of Mann-Whitney test and Bootstrap mean/median running on data with and without outliers.
- Mobile App A/A -- running A/A test to check data splitting system works well. Unfortunately, we were not able to pass the test (FPR was greater than significance value). Thus, we have to dig into data and find the reason of malfunction. After removing the corrupted data we were able to pass the A/A-test.
- Taxi Churn -- performing Exploratory Data Analysis, defining churn, checking distributions for normality with Shapiro-Wilk test, plotting data using plotly and A/B testing four different hypothesis with Chi-squared test, Dunn's test, Mann-Whitney U non-parametric test.
- A/B simulation -- performed a range of A/B tests to simulate how sample size and the magnitude of difference between samples influence A/B tests performance. Investigated situations when we could have a false positive error. Gained valuable lessons on A/B tests performance.
- Sales Monthly Overview -- Tableau Public dashboard consisted of: KPIs, line chart, bar chart, table by category with bar charts.
- Profit Monthly Overview -- Tableau Public dashboard consisted of: KPIs, line chart, bar chart, table by region with bar charts, profit ratio by category with horizontal bar charts.
- Analytics Vacancies Overview -- Tableau Public dashboard consisted of: horizontal bar chart, pie chart, boxplot and bubble chart.
- Sales Overview -- Tableau Public dashboard consisted of: horizontal bar tables, sparklines, KPI, line charts and various filters and sortings to display the data.
- Airbnb Listings Analytics -- Tableau Public dashboard consisted of: calculated renting property occupation rate; analytical chart to choose the best property by occupation rate, review score and price per night; a ranked table of top 10 listings by calculated potential annual revenue; average price, average occupation rate and a number of unique listings KPIs; filters by neighbourhood, occupation rate and a number of reviews per the last twelve month.
- Metrics calculations -- Google analytics data cleaning and calculation of following metrics: number of unique users, conversion, average check, average purchases per user, ARPPU, ARPU.
- Retention Analysis -- Tableau Public dashboard contains users retention and ARPU highlight tables.
- RFM analysis -- performed RFM analysis, built LTV heatmap and found insights about users segmentation.
- Probabilities -- solving probability theory problems including AND/OR probabilities, Bernoulli trial and conditional probability (the Bayes theorem).
- Final project -- you're employed in a mobile games development company. A Product Manager gives you following tasks: find and visualize retention, make a decision based on the A/B test data, suggest a number of metrics to evaluate the results of the last monthly campaign.
Hope this repo will help you to assess my coding, data analytics and SQL skills or will be just fun for you to look through.
Feel free to contact me via nktn.lx@gmal.com
Follow me on twitter: @nktn_lx
And here on github: github.com/nktnlx