WEB-SCRAPING 🕸️

I utilized Beautifulsoup to collect data to make data entry into the company's wordpress site easier.

Data Isn't Always Easy to Access 🔍

We want to get Rotten Tomatoes' audience scores and the number of audience reviews to add to our dataset. However, this is not easily accessible from the website and to get this data we will need to do web scraping, which allows us to extract data from websites using code.

How Does Web Scraping Work? 🤔

Website data is written in HTML (HyperText Markup Language) which uses tags to structure the page. Because HTML and its tags are just text, the text can be accessed using parsers 🕵️. We'll be using a Python parser called Beautiful Soup.

Let's get started by exploring the structure of HTML files.

Accessing the HTML 📥

Manual Access The quick way to get HTML data is by saving the HTML file to your computer manually. You can do this by clicking Save in your browser.

Programmatic Access Programmatic access is preferred for scalability and reproducibility. Two options include:

Downloading HTML file programmatically. We'll explore this code in more detail later

import requests
url = 'https://www.rottentomatoes.com/m/et_the_extraterrestrial'
response = requests.get(url)

# Save HTML to file

with open("et_the_extraterrestrial.html", mode='wb') as file:
    file.write(response.content)

Working with the response content live in your computer's memory using the BeautifulSoup HTML parser

# Work with HTML in memory
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')

Accessing HTML 🌐

Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So just use these codes and pretend like you saved them yourself with one of the methods described above.

More Information ℹ️ There are some ethical issues involved in web scraping. Read this thoughtful article to learn more: Towards Data Science: Ethics in Web Scraping

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Table Scraping from web		Table Scraping from web
amazon books scraping		amazon books scraping
README.md		README.md
jumia.ipynb		jumia.ipynb
products master.csv		products master.csv
products.csv		products.csv
project.ipynb		project.ipynb
todays work.csv		todays work.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WEB-SCRAPING 🕸️

Data Isn't Always Easy to Access 🔍

How Does Web Scraping Work? 🤔

Accessing the HTML 📥

Accessing HTML 🌐

About

Releases

Packages

Languages

Trust-Ayeni/WEB-SCRAPING-

Folders and files

Latest commit

History

Repository files navigation

WEB-SCRAPING 🕸️

Data Isn't Always Easy to Access 🔍

How Does Web Scraping Work? 🤔

Accessing the HTML 📥

Accessing HTML 🌐

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages