The aim of this project is to develop a tool for scraping an entire dataset of the products contained in an e-commerce website, transforming the web visualization into a well structured dataset. This tool should be able to scrape different products (for the same website) just by changing the word key (wordKey variable) defined in the beginning. The website used for this project is:
(Consulted on September 5th 2020. Results may change according to date and products available for that date)
The process followed by the project was:
- Definition of parameters: There are defined the website to scrape, the product to scrape and tools needed for it.
- Page scraping: The important values are searched along a page of the website.
- Complete dataset scraping: Search cycle of page scraping over the n number of pages available.
- Handling unstructured data: The raw text found is transformed into the appropriate columns.
- Data type definition: The columns are converted into their corresponding data type.
- Exploratory data analysis: An analysis is made from the information obtained from the created process.
The results obtained from the presented project are the transformation of grid view of products.
- View grid of products:
Grid view of products
- Tidy dataset:
Structured dataset
First, all the libraries needed for the project are loaded, the main job of scraping was made with the Selenium library. For the remaining process there are some variables that are needed, these values are:
- Keyword: The word key variable represents the product that will be searched in the website, for the case presented the word key is "monitor" so the dataset of products created will correspond to different models of monitors for sale in the website.
- URL: The selected page for scraping, in this case newegg page, the code for scrapig will change for every website. It could also change for different future changes in the source code of the website. The scraping was designed for the code in September 5th, 2020.
- Path: Is the designated path in the computer for the Chrome webdriver (requires previous download).
# Libraries needed
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import time
import re
# Product to search in the website
wordKey = 'monitor'
# Website to scrape
url = 'https://www.newegg.com/global/mx-en/'
# Location of web driver
path = 'C:\Program Files (x86)\chromedriver.exe'
What is meant for page scraping is to collect all the data present in one page of the website. The steps for it are as follows:
- Open the web driving and search for the url defined in the parameter section
Page loaded in webdriver
- Search for the product selected in the wordKey variable
Search of the product
- Get the source code of the page of the first page of results and start to scan the code for specific elements, in this case: Brand, price, rating, reviews and description.
Since we only search in the first page, we only get the first 36 products.
# Setting the web driver
driver = webdriver.Chrome(path)
# The url is entered into the webdriver
driver.get(url)
# The search box is finded
search = driver.find_element_by_id("SearchBox2020")
# The word is entered into the search box
search.send_keys(wordKey)
search.send_keys(Keys.RETURN)
# The lists of elements to search for are created
titles = []
ratings = []
reviews = []
brands = []
prices = []
#The exception handling is made to avoid crashes
try:
# The loading wait is defined before the search
app = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "app"))
)
# The items source code is finded
listed = app.find_element_by_class_name("list-wrap")
items = listed.find_elements_by_class_name("item-container")
#For every item present in the source code the code is scanned for the elements we are looking for
for item in items:
title = item.find_element_by_class_name("item-title")
titles.append(title.text)
review = item.find_element_by_class_name("item-rating")
reviews.append(review.text)
rating = item.find_element_by_class_name("item-rating")
ratings.append(rating.get_attribute("title"))
divbrand = item.find_element_by_class_name("item-branding")
abrand = divbrand.find_element_by_tag_name("a")
brand = abrand.find_element_by_tag_name("img")
brands.append(brand.get_attribute('title'))
price = item.find_element_by_xpath("./div[contains(@class, 'item-action')]/ul/li[contains(@class, 'price-current')]/strong")
prices.append(price.text)
finally:
driver.quit()
# The webdriver is closed
driver.quit()
# A dataframe is created with the lists of elements
df = pd.DataFrame(list(zip(brands, prices, ratings, reviews, titles)), columns=['Brand','Price', 'Rating', 'Reviews', 'Description'])
df.head(10)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Brand | Price | Rating | Reviews | Description | |
---|---|---|---|---|---|
0 | ASUS | 4,233 | Rating + 4 | (1,349) | ASUS TUF Gaming VG24VQ 24" Full HD 1920 x 1080... |
1 | ASUS | 3,787 | Rating + 5 | (76) | ASUS VG245H Black 24" 1ms (GTG) Widescreen 2x ... |
2 | GIGABYTE | 5,570 | Rating + 5 | (84) | GIGABYTE G27FC 27" 165Hz 1080P Curved Gaming M... |
3 | GIGABYTE | 5,570 | Rating + 5 | (44) | GIGABYTE G27F 27" 144Hz 1080P Gaming Monitor, ... |
4 | Acer America | 6,462 | Rating + 4 | (322) | Acer XG270HU omidpx 27" 2K 2560 x 1440 1ms 144... |
5 | GIGABYTE | 8,243 | Rating + 4 | (213) | GIGABYTE G32QC 32" (Actual size 31.5") WQHD 25... |
6 | MSI | 4,010 | Rating + 4 | (158) | MSI Optix G24C 24" (Actual size 23.6") Full HD... |
7 | BenQ | 2,673 | Rating + 5 | (14) | BenQ GL2480 24" Full HD 1920 x 1080 1ms (GTG) ... |
8 | ASUS | 8,912 | Rating + 4 | (147) | ASUS TUF GAMING VG27WQ 27" WQHD 2560 x 1440 (2... |
9 | Acer America | 2,852 | Rating + 4 | (132) | Acer Nitro Gaming Series VG220Q bmiix 22" (21.... |
# The number of elements is displayed as (# of rows, # of columns)
df.shape
(36, 5)
Once we did the process for one page is time to repeat the process for the total number of pages in the results. In order to do that the "Page Scraping" section is encapsulated into a function for later iteration, in this function try-except handling exceptions were introduced for not available values for an item. The process goes like this:
- The page is loaded and the wordKey is entered into the search box just as before.
- Before the scan of elements for the dataset, the number of pages available for the product search is scanned from the bottom part of the page.
Number of pages
- Then the function for scanning information for all the items in the page gets executed (findObjects function).
- The next bottom is searched and clicked from the bottom section of the page.
Next page button
- These steps are repeated n number of times, being n the resulting of pages available - 1
# Setting the web driver
driver = webdriver.Chrome(path)
driver.get(url)
# The wordKey is searched
search = driver.find_element_by_id("SearchBox2020")
search.send_keys(wordKey)
search.send_keys(Keys.RETURN)
# The initial dataframe is created
df = pd.DataFrame()
# Function to iterate the process of finding lists of elements
def findObjects (app):
titles = []
ratings = []
reviews = []
brands = []
prices = []
listed = app.find_element_by_class_name("list-wrap")
items = listed.find_elements_by_class_name("item-container")
for item in items:
try:
title = item.find_element_by_class_name("item-title")
titles.append(title.text)
except:
titles.append('NA')
try:
review = item.find_element_by_class_name("item-rating")
reviews.append(review.text)
except NoSuchElementException:
reviews.append(0)
try:
rating = item.find_element_by_class_name("item-rating")
ratings.append(rating.get_attribute("title"))
except NoSuchElementException:
ratings.append('NA')
try:
divbrand = item.find_element_by_class_name("item-branding")
abrand = divbrand.find_element_by_tag_name("a")
brand = abrand.find_element_by_tag_name("img")
brands.append(brand.get_attribute('title'))
except:
brands.append('NA')
try:
path2 = "./div[contains(@class, 'item-action')]/ul/li[contains(@class, 'price-current')]/strong"
price = item.find_element_by_xpath(path2)
prices.append(price.text)
except:
prices.append(0)
data = list(zip(brands, prices, ratings, reviews, titles))
dfTemp = pd.DataFrame(data, columns=['Brand','Price', 'Rating', 'Reviews', 'Description'])
return (dfTemp)
# The search of lists is iterated n number of times, being n = number of pages - 1
try:
app = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "app"))
)
numPages = app.find_element_by_class_name("list-tool-pagination-text")
numPages = (int(numPages.text.split('/')[1]))
print (numPages)
time.sleep(5)
#goToNextPage(driver)
for i in range (numPages-1):
time.sleep(10)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[title^='Next']"))).click()
app = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "app"))
)
dfTemp = findObjects(app)
df = df.append(dfTemp)
time.sleep(10)
finally:
driver.quit()
# The webdriver is closed
driver.quit()
34
# The complete dataset is extracted from the page
df.head(20)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Brand | Price | Rating | Reviews | Description | |
---|---|---|---|---|---|
0 | MSI | 5,706 | Rating + 4 | (9) | MSI Optix MAG270VC2 27" Full HD 1920 x 1080 1m... |
1 | SAMSUNG | 2,273 | Rating + 4 | (3) | SAMSUNG LS24F354FHNXZA 24" (Actual size 23.5")... |
2 | Acer America | 4,565 | Rating + 3 | (36) | Acer ED322QR Pbmiipx 32" (Actual size 31.5") F... |
3 | Acer America | 4,991 | Rating + 3 | (22) | Acer ED270R 27" Black 1920 x 1080 Widescreen 1... |
4 | Acer America | 2,301 | Rating + 3 | (18) | Acer KG221Q Abmix 22" (Actual size 21.5") 1ms ... |
5 | BenQ | 10,427 | Rating + 5 | (16) | BenQ EW3270U 32" (Actual size 31.5") 3840 x 21... |
6 | Acer America | 9,176 | Rating + 4 | (112) | Acer Predator XB1 XB241H bmipr 24" Full HD 192... |
7 | ViewSonic | 13,279 | Rating + 5 | (11) | ViewSonic ELITE XG270QG 27" Quad HD 2560 x 144... |
8 | MSI | 5,911 | Rating + 5 | (8) | MSI Optix G32C4 31.5" 1920 x 1080 1 ms (MPRT) ... |
9 | Acer America | 7,504 | Rating + 5 | (7) | Acer Nitro XZ322Q Pbmiiphx 31.5" FULL HD 165Hz... |
10 | LG Electronics | 9,318 | Rating + 5 | (6) | LG 27GN750-B 27'' Full HD 1920 x 1080 1ms (GTG... |
11 | Acer America | 17,206 | Rating + 4 | (42) | Acer Predator X34 Pbmiphzx 34" 3440 x 1440 Ult... |
12 | Acer America | 2,666 | Rating + 5 | (3) | Acer V227Q bi 22" (Actual size 21.5") Full HD ... |
13 | BenQ | 6,125 | Rating + 5 | (2) | BenQ ZOWIE XL2731 27" Full HD 1920 x 1080 1ms ... |
14 | SAMSUNG | 7,958 | Rating + 4 | (16) | SAMSUNG C32JG56 32" WQHD 2560 x 1440 2K Resolu... |
15 | BenQ | 4,514 | Rating + 4 | (17) | BenQ ZOWIE XL Series XL2411P Dark Gray 24" 144... |
16 | ViewSonic | 3,865 | Rating + 4 | (14) | ViewSonic VX2458-MHD 24" Full HD 1920 x 1080 1... |
17 | SAMSUNG | 17,120 | Rating + 4 | (13) | SAMSUNG LC32G75TQSNXZA 32" (Actual Size 31.5")... |
18 | ASUS | 6,806 | Rating + 4 | (10) | ASUS TUF Gaming VG328H1B 32" Full HD 1920 x 10... |
19 | ASUS | 10,094 | Rating + 4 | (10) | ASUS TUF Gaming VG32VQ 32" (Actual size 31.5")... |
# The resulting dataset contains 1174 products
df.shape
(1174, 5)
# The dataset is saved into a CSV file
df.to_csv("data.csv")
After collecting all the data, we have created a file with all the products available in the page for the wordKey entered, but these information of the products is not in the appropriate format for the analysis. Most of the variable are in text format, and the description contains more information but it is in raw text.
# Data file loaded
df = pd.read_csv("data.csv")
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | Brand | Price | Rating | Reviews | Description | |
---|---|---|---|---|---|---|
0 | 0 | MSI | 5,706 | Rating + 4 | (9) | MSI Optix MAG270VC2 27" Full HD 1920 x 1080 1m... |
1 | 1 | SAMSUNG | 2,273 | Rating + 4 | (3) | SAMSUNG LS24F354FHNXZA 24" (Actual size 23.5")... |
2 | 2 | Acer America | 4,565 | Rating + 3 | (36) | Acer ED322QR Pbmiipx 32" (Actual size 31.5") F... |
3 | 3 | Acer America | 4,991 | Rating + 3 | (22) | Acer ED270R 27" Black 1920 x 1080 Widescreen 1... |
4 | 4 | Acer America | 2,301 | Rating + 3 | (18) | Acer KG221Q Abmix 22" (Actual size 21.5") 1ms ... |
# Description of the products in raw format
for desc in (df['Description'][:3]):
print ('- '+desc+'\n')
- MSI Optix MAG270VC2 27" Full HD 1920 x 1080 1ms (MPRT) / 4ms (GTG) 165 Hz HDMI, DisplayPort FreeSync (AMD Adaptive Sync) Curved Gaming Monitor
- SAMSUNG LS24F354FHNXZA 24" (Actual size 23.5") Full HD 1920 x 1080 4ms (GTG) VGA HDMI AMD FreeSync Flicker Free Technology Super Slim Design LED Backlit Gaming Monitor
- Acer ED322QR Pbmiipx 32" (Actual size 31.5") Full HD 1920 x 1080 4ms (GTG) 144Hz 2xHDMI DisplayPort Built-in Speakers AMD FreeSync Backlit LED Curved Gaming Monitor
First of all, we will extract the information from the raw data in the 'Description' column. Despite it's an unstructured dataset we can look for repetitive patterns just as shown below:
Sequence:
[Brand] [Model] [size] [dimension] [Frequency] [HDMI|VGA|USB]- Brand: It's the first word of every description. It's a word with only upper and lower case letters with no numbers
- Model: It's one or multiple words following the brand. It's a sequence of words between the brand and the screen size. Also is an alphanumeric value/values.
- Screen size: It should be a number followed by a ("), (”) or ('') as reference for the units.
- Dimension: It is a two digit number followed by an (x) and another two digit number. There are some optional spaces in between them.
- Frequency: It should be a number just before a (Hz) written in upper case, lower case or a mix of them.
- HDMI: If the description contains the word HDMI
- VGA: If the description contains the word VGA
- USB: If the description contains the word USB
# Function that returns a pattern for every element in a list
def lookFor(elements, pattern, ifnot):
# elements: List of elements where the pattern is going to be looked for (for every element)
# pattern: Regex pattern to search in that list (elements)
# ifnot: Fill element for cases where the pattern is not found
patternList = []
for element in elements:
try:
patternList.append(re.search(pattern, element).group(0))
except:
patternList.append(ifnot)
return patternList
# Searching for the selected patterns in the Description column
df['Brand2'] = lookFor(df['Description'].values, '^[a-zA-Z]*', np.nan)
df['Model'] = lookFor(df['Description'].values, '\s.*?(?=(?:\\d{1,3}\\.?\\d{0,3}["|”|\'\'])|$)', np.nan)
df['Size'] = lookFor(df['Description'].values, '\\d{1,3}\\.?\\d{0,3}["|”|\'\']', np.nan)
df['Dimensions'] = lookFor(df['Description'].values, '[\d]*[\s?]x[\s?][\d]*', np.nan)
df['HDMI'] = lookFor(df['Description'].values, '[Hh][Dd][Mm][Ii]', 'No')
df['VGA'] = lookFor(df['Description'].values, '[Vv][Gg][Aa]', 'No')
df['USB'] = lookFor(df['Description'].values, '[UU][Ss][Bb]', 'No')
After the extraction of information from the Description column, there are still some changes to make to transform the data into the desired format:
- Price: Remove all commas and change the data type to float
- Rating: Remove the string "Rating +" and change the data type to float
- Reviews: Remove commas and extract the text between parenthesis. Change the data type to float too
- Size: Remove ("), (”) or ('') to get only the inches length of the screen.
- Width, Height: Split the dimensions (number x number) into two different columns named width and height.
- HDMI: Replace HDMI/NAN to Yes/No depending if is a feature for the column.
- VGA: Replace VGA/NAN to Yes/No depending if is a feature for the column.
- USB: Replace USB/NAN to Yes/No depending if is a feature for the column.
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | Brand | Price | Rating | Reviews | Description | Brand2 | Model | Size | Dimensions | HDMI | VGA | USB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | MSI | 5,706 | Rating + 4 | (9) | MSI Optix MAG270VC2 27" Full HD 1920 x 1080 1m... | MSI | Optix MAG270VC2 | 27" | 1920 x 1080 | HDMI | No | No |
1 | 1 | SAMSUNG | 2,273 | Rating + 4 | (3) | SAMSUNG LS24F354FHNXZA 24" (Actual size 23.5")... | SAMSUNG | LS24F354FHNXZA | 24" | 1920 x 1080 | HDMI | VGA | No |
2 | 2 | Acer America | 4,565 | Rating + 3 | (36) | Acer ED322QR Pbmiipx 32" (Actual size 31.5") F... | Acer | ED322QR Pbmiipx | 32" | 1920 x 1080 | HDMI | No | No |
3 | 3 | Acer America | 4,991 | Rating + 3 | (22) | Acer ED270R 27" Black 1920 x 1080 Widescreen 1... | Acer | ED270R | 27" | 1920 x 1080 | HDMI | No | No |
4 | 4 | Acer America | 2,301 | Rating + 3 | (18) | Acer KG221Q Abmix 22" (Actual size 21.5") 1ms ... | Acer | KG221Q Abmix | 22" | 1920 x 1080 | HDMI | No | No |
# Transforming the data into the desired format
df['Price'] = df['Price'].replace(',','', regex=True).astype(float)
df['Rating'] = df['Rating'].replace('Rating \+', '', regex=True).astype(float)
df['Reviews'] = (df['Reviews'].replace(',','', regex=True)
.replace('[(|)]', '', regex=True).astype(int))
df['Size'] = df['Size'].replace('["|”|\'\']','', regex=True).astype(float)
df[['Width', 'Height']] = df["Dimensions"].str.split("x", n = 1, expand = True)
df['HDMI'] = df['HDMI'].replace('HDMI','Yes', regex=True)
df['VGA'] = df['VGA'].replace('VGA','Yes', regex=True)
df['USB'] = df['USB'].replace('USB','Yes', regex=True)
df = df[['Brand','Brand2','Model','Price','Rating','Reviews','Size','Width','Height','HDMI','VGA','USB','Description']]
df.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Brand | Brand2 | Model | Price | Rating | Reviews | Size | Width | Height | HDMI | VGA | USB | Description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MSI | MSI | Optix MAG270VC2 | 5706.0 | 4.0 | 9 | 27.0 | 1920 | 1080 | Yes | No | No | MSI Optix MAG270VC2 27" Full HD 1920 x 1080 1m... |
1 | SAMSUNG | SAMSUNG | LS24F354FHNXZA | 2273.0 | 4.0 | 3 | 24.0 | 1920 | 1080 | Yes | Yes | No | SAMSUNG LS24F354FHNXZA 24" (Actual size 23.5")... |
2 | Acer America | Acer | ED322QR Pbmiipx | 4565.0 | 3.0 | 36 | 32.0 | 1920 | 1080 | Yes | No | No | Acer ED322QR Pbmiipx 32" (Actual size 31.5") F... |
3 | Acer America | Acer | ED270R | 4991.0 | 3.0 | 22 | 27.0 | 1920 | 1080 | Yes | No | No | Acer ED270R 27" Black 1920 x 1080 Widescreen 1... |
4 | Acer America | Acer | KG221Q Abmix | 2301.0 | 3.0 | 18 | 22.0 | 1920 | 1080 | Yes | No | No | Acer KG221Q Abmix 22" (Actual size 21.5") 1ms ... |
# The tidy dataset is saved into a new file
df.to_csv("tidyData.csv", index=False)
For the analysis, the principal will be analyzed but we'll focus the most on a brand comparison of the top selling brands of monitors in the NewEgg e-commerce website. For the analysis we will study:
- Price range of products
- Price-rating relationship
- Principal sizes of monitors
- Top brands by offered products
- Top brands' prices
- Top brands' ratings
plt.figure(figsize=(16,6))
plt.title("Price distribution")
ax = sns.distplot(df['Price'])
ax.set(xlabel='Price (MXN)', ylabel='Frequency')
plt.show()
Just for a quick view, the distribution is skewed to the right. This is caused by the most expensive equipment, but we can see that the most frequent prices range from 4,000(MXN) and 5,000(MXN).
plt.figure(figsize=(16,6))
plt.title("Price-rating relationship")
ax = sns.boxplot(x="Rating", y="Price", data=df)
ax.set(xlabel='Star rating', ylabel='Price (MXN)')
plt.show()
Also a comparison is made between price and rating. Since we don't have an appropriate way of measuring the quality of the product, the rating and feedback from the users are an approximation to this. From the boxplot above, we can say that the 5 star rating can be found] in all range of prices but mainly from prices in the 2,500 (MXN) and 9,000 (MXN). Another important observations is that the two star rating is more frequent in higher prices than the 1 or 3 star ratings. As expected, the 1 star rating evaluation is found in lower prices, mostly prices bellow 8,000 (MXN). Despite these observations, there is no clear observable relation since all the boxes are not significantly different from each other.
plt.figure(figsize=(16,6))
plt.title("Principal sizes of monitors")
g = sns.distplot(df[df['Size'].notnull()]['Size'])
g.set(xticks=range(1,53))
g.set(xlabel='Size in inches', ylabel='Frequency')
plt.show()
From the plot, we can state than the most frequent sizes in sale are 21.5", 24" and 27" monitors. Also we can see that there are no screen sizes of 11", 20.5", 26" and 30". Finally there are more variation in size in the small screen size than the bigger screen sizes. For the big screen size brand bet with the 44" screen size.
dfCount = pd.value_counts(df['Brand'].values, sort=True).head(10).to_frame()
dfCount.reset_index(inplace=True)
dfCount.columns = ['Brand', 'Count']
topBrands = dfCount['Brand'].values
plt.figure(figsize=(16,6))
plt.title("Most frequent brands in monitor selling")
plt.grid(alpha=0.3)
ax = sns.barplot(x="Brand", y="Count", data=dfCount, palette="GnBu_d")
Since there are a lot of different brands, we will focus on the most frequent brands in the dataset, this is the brands that have bigger number of models available in the market. As we see, these brands correspond to the bigger enterprise, but there are also new enterprises entering to the market, not as known as the others, that is the case of Eyoyo. HP has the biggest offer having almost the triple of tha last place that is Philips.
dfTop = df[df['Brand'].isin(list(topBrands))].copy()
plt.figure(figsize=(16,6))
plt.title("Top brands' prices")
ax = sns.boxplot(x="Brand", y="Price", data=dfTop)
ax.set(xlabel='Brand', ylabel='Price (MXN)')
plt.show()
Now we can get a better sense of the data by plotting the distribution of prices of the top brands found. We can see that Samsung, ViewSonic and Philips offer the most wide range of prices, being Philips a little bit more expensive than Samsung. Studying Eyoyo we can state that is a brand of cheaper equipment, also Acer offers a equipment with a low price but Eyoyo has the lowest prices. On the other hand, Philips and LG have the most espensive prices among them. We can state from the median position that at elast 50% of the total products' prices are lower than 6,000 (MXN).
plt.figure(figsize=(16,6))
plt.title("Top brands' ratings")
ax = sns.boxplot(x="Brand", y="Rating", data=dfTop)
From the boxplot of prices, we can see that for most of the brands the rates range between 4 and 5 stars. The best rated brand is HP because most of the brands have their first percentile (25% of ratings) between 3 and 4 stars, meanwhile HP 100% of rates are between 4 and 5 rates. Also for ASUS ans DELL ratings, there are 100% rated with 4 star, that means that there only a few cases with 5 star ratings (showed as outliers in the plot). The worst rated brand is Eyoyo since 25% of their rates are below 3 stars, and more than 50% of the rates are not a 5 star rating, this is awful in comparison with the other brands.
From all the analysis made above, we can state that:
- The most frequent range of price for a monitor is between 4,000(MXN) and 5,000(MXN), with at least 50% of the products below 6,000(MXN).
- For the high value monitors is more frequent the 5 rating meanwhile for lower price monitors are lower star ratings.
- The most frequent screen sizes are 21.5", 24" and 27".
- Establish the best brand is a complicated assignment, but we can have a better sense according to the price of the monitor.
- For lower prices Eyoyo is not the best option, Acer it's a little bit more expensive but it totally makes up in the quality, as is shown in the ratings.
- For average prices HP has the best satisfaction rating among all the brands in all places but the prices of the brand range are pretty average.
- For higher prices Philips is the best option but not for a big difference, Philips was chosen because the absence of outlier low ratings.