This dataset & Repository consists of all Netflix original films released as of June 1st, 2021. Additionally, it also includes all Netflix documentaries and specials. The data was webscraped off of this Wikipedia page, which was then integrated with a dataset consisting of all of their corresponding IMDB scores. IMDB scores are voted on by community members, and the majority of the films have 1,000+ reviews. Dataset consist of: Title Genre Premiere date,IMDB scores Runtime,Languages
This repository will be looking at Football doing a range of different activities with football data this will include Exploratory Data Analysis, Data visualization,many other topics. This repository will consist of mainly Jupyter Notebooks and Python programming language.
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by classification, text mining, text analysis, data analysis and data visualization
Power BI Sales Dashboard for Global Super Store • The project involves creating an interactive Power BI Sales Dashboard using Global_super_store sales data.
• The ETL process was performed to clean and transform the data using Power query.
• DAX was used for creating calculated measures and calculated columns.
• Visualizations and reports were created using cards, charts and slicers to provide insights and easy understanding for end users.
• The tools used were Microsoft Power BI and MS Excel.
Data Science Job Salaries Dataset contains 11 columns, each are:
• work_year: The year the salary was paid.
• experience_level: The experience level in the job during the year
• employment_type: The type of employment for the role
• job_title: The role worked in during the year.
• salary: The total gross salary amount paid.
• employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
• remote_ratio: The overall amount of work done remotely
• company_location: The country of the employer's main office or contracting branch
• company_size: The median number of people that worked for the company during the year
Here are the things I have done.
•Basics of Apache Spark (architecture, transformation, action, lazy evaluation)
•Creating a Databricks account and the basics of it
•Structured API and how to write transformation functions
•Using SQL to analyze IPL Data
•Building visualization to gain more insights
The goal of this project is to give you an overall understanding of Apache Spark and its different functions to write transformation blocks on top of that you will learn SQL to analyze data and build visualization.
Data Loading and Exploration: Imported necessary libraries and loaded the dataset from a CSV file. Explored the dataset with head(), info(), shape, and describe() methods to understand its structure and summary statistics.
Identified missing values using isnull().sum(). Filled missing values in categorical columns (e.g., Gender, Married) with the mode, and in numerical columns (e.g., LoanAmount, Loan_Amount_Term) with mean or mode as appropriate. Feature Engineering:
Created new features such as TotalIncome by summing ApplicantIncome and CoapplicantIncome. Transformed skewed data using logarithmic scaling (LoanAmount_log and TotalIncome_log).
Data Visualization: Used histograms and boxplots to visualize the distribution of ApplicantIncome, CoapplicantIncome, LoanAmount, and their logarithmic transformations. Examined the relationship between Credit_History and Loan_Status using cross-tabulation.
Data Preparation: Selected relevant features for model training and separated the target variable (Loan_Status). Split the data into training and testing sets using train_test_split. Encoded categorical variables into numerical values using LabelEncoder.
Model Training and Evaluation: Applied the Naive Bayes Classifier to train the model on the training set. Evaluated the model's performance on the test set, likely calculating metrics such as accuracy, precision, recall, and F1-score (though the evaluation part isn't explicitly mentioned in the provided code).