Author: P e t e r.
Please note: This article assumes that you know the basic python syntax. Also the code is available here.
- Installing Python + libraries.
- Installing VsCode and configuring Python
- Installing libraries
- Introduction to Requests.
- Example of Requests.
- Introduction to Beautiful Soup.
- Example of Beautiful Soup.
- What is lazy-loading and how to overcome it.
- Introduction to Selenium.
- Example of Selenium.
- Displaying information using PyPlot
- Getting started with NodeJS.
- What is NodeJS?
- Why use NodeJS?
- Installing NodeJS.
- Installing NodeJS modules
- Introduction to Puppeteer.
- Example of Puppeteer.
- What is Browser Automation.
- Introduction to browser automation with Selenium
- Example of browser automation with Selenium
Installing python is really simple. Just visit the official python web page and download the latest version and open the installer. During installation please make sure you have checked the box that adds python to the PATH. If you don't see the checkbox, add it manually. Here is an example. If this link is broken just google 'How to add python to PATH'.
The next step for you it to grab your text editor so we can start coding some cool stuff. Personally I prefer using Vim or Vscode. I would recommend you to use VsCode since it is more beginner-friendly. After installing Vscode from the website, open it up and you should be greeted with a fancy looking UI. Now we are going to connect Python with Vscode otherwise we our text editor (Vscode) is useless. I will be explaining how to install the extension, but if you find it difficult to follow you can also visit the official tutorial here. Upon being greeted with the main User Interface, navigate to the extensions tab. (Picture below of the extensions tab icon)
Search python in the textbox then install the first one. It might take some time to install.
After installed press ctrl+shift+p
(You can also go the the View tab and select Open command palette). A window from the top should appear, type in Python: Select Interpreter
and press Enter or Click on that option.
Then select the recommended one. Great! Now we have Python installed and connected with Vscode.
We can now create python files and run them within Vscode.
By now you should know what command prompt is. We will be using it to install the required libraries. At the start of each section I will show you how you can install the required libraries. Some might not require installation, but I will state how when it does.
The Requests library is a pre-installed (sometimes) python library. If you don't have the module installed or just want to make sure do the following:
- Open command prompt.
- Type in
pip install requests
- Look at what the output says. It will state whether the requirements are already met (already installed) or not. Either way it should be installed.
Requests is used to make HTTP requests to web pages / ip addresses to be able to retrieve or send data. We use this library only for basic web scraping that doesn't involve the need for extreme methods of scraping. You would also need to know how to work with strings to use this library as your main method of scraping since Requests grabs the web page's front-end code as a whole string. The string formatting has to be done manually. When we reach 4. Introduction to Beautiful Soup you'll see how we work around this tedious problem.
We can start of by creating a new python folder called "Python" and then create a file inside of it and then open it in Vscode and call it something similar to main.py or test.py.
Now let's open the file and firstly import the modules/libraries we need for web scraping in this case we are using Requests. The first line of code should look like this:
We are importing the Requests module (it's basically a python script containing pre-defined functions that you can use almost anywhere).
Now we have a list of functions we can use to do some scraping. We would like to grab a web page's code first so that we can use that information for statistical purposes, marketing research etc.
In the above image we are using the HTTP GET
requests using the requests.get(url)
function to make grab the html source of the url which we are storing in a variable called r
. If we were to print the variable r
this is what it will return when we run the file:
This is the HTML code of the website url we provided to the GET
request, stored as a string.
Now that we have the data captured from the website we can use it. Also forgot to mention you can run python files by pressing the run icon
top right.
Let's go back a few steps. Web scraping requires you to know what you want to achieve. This is my thought process when it comes to scraping.
- I have a deep look into the website and it's mechanics.
- I start to look for unique, important values as well as repeating values. Both can be used in different use cases.
Below is my analysis of the website we are currently testing.
The items circled in red
are repeating values. The items circled in blue
are important values. In this scenario we can use the important values to monitor the results of the books and use the repeating values to store the information about it. To be honest this scenario doesn't really have a great "important value", but we will use it anyways. You will be seeing better ones as we progress to different websites to scrape.
Now we will scrape the important value
instead of the repeating values
only becuase it's a tedious process to do a lot of complex scraping with manual string methods as I mentioned previously, but I'll do it once.
First start of by grabbing the element's tags beginning index like this:
What we do is get the number of the character where the <strong>
element is first occured, because when you look in your browser's inspector tool (right-click, inspect element)
You'll see that the important value
has a tag around it called "strong" and you would also notice that it is the first occurance of the tag "strong", but if there were previous occurences of that tag then this wouldn't have worked becuase this code grabs the first occurence. Hopefully I explained it in a way you could understand.
Now we will grab the "strong" closing tag which looks like </strong>
Now we have 2 numbers. We know where the element starts and end in the whole string. Now we cut
the text out like below.
And now if we print the part that we cut out of the main string we get this:
The reason I use +len("<strong>")
is becuase it grabs the number of the starting point of the element's text so when you do it without that part like below:
stringCut = sourceCode[begin:end]
print(stringCut)
Then it will output this:
<strong>1000
But when you add the length of the word <strong>
you get:
1000
And that's it. This is webscraping in a nutshell.
"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."
To install this library you'll have to open command prompt and type in the following pip install beautifulsoup4
and then press Enter
.
Now we will add the library to our Python script.
We say as bs
just to add our own abbreviation to the class BeautifulSoup
so now we can just use bs
everywhere.
Let's continue where we last left off. Since beatiful soup is used to dissasemble a website's source code, it doesn't grab the source code this means that we will still be using the requests library to grab the source code and feed it to beautiful soup. So you should have this as your code currently:
And now we have the variable soureCode
which stores the code and we will read it in beautiful soup like this:
Alright. Now let's grab the "repeating values" which are the book names. We will do this by using Beautiful Soup's find_all()
function which can fetch any tag names etc. Since each name elements has a tag h3
it will be really easy grabbing them all.
In the image above we use the bs.find_all()
function with the parameter set as h3
to find all the h3 tagged elements on the web page. If you don't know how I got h3
right-click on webpage and open the developer tools (right-click -> Inspect element) and then you will see that the name has an h3 tag. Now we have a loop that will go through all the elements
that we found. Now we just want the text of those elements, not the element itself. so if we want to print the name of each book we will do this
you have to put .text
becuase otherwise it will print the entire element. You can change it and see the difference. And that's the basics of scraping with Beautiful Soup. You can have a look at 9. Displaying information using PyPlot
if you want to see how we place the fetched information into visual representations and graphs. Again I want to remind you that all the files and code is uploaded to this github repo which you can visit here
You have 2 different types of websites. Static web pages and Dynamic web pages. Static web pages loads all the content that you see using html,css etc. This type of content doesn't require any loading in, because it is pre-loaded if that makes sense. Dynamic web pages has content that need to be loaded in and you'll see web pages like these everywhere YouTube is an example of a dynamic web page, because the videos needs to be loaded in, becuase our computers doesn't have the capacity or power to show every video that exists on the YouTube, it would break. That's why YouTube only loads in the videos that need to be loaded in and thus we call it a dynamic web page. These pages require different methods of scraping since the previous 2 libraries we used can only read pre-loaded code
from a website and since the videos aren't pre-loaded we can't really fetch their information. The workaroud for this is to either use browser extensions and plugins which can read the content after it is loaded in or by using a browser automating tool such as selenium or puppetteer which are the two ones we'll be using.
Selenium is an open source umbrella project for a range of tools and libraries aimed at supporting browser automation. It provides a playback tool for authoring functional tests across most modern web browsers, without the need to learn a test scripting language. We can use this tool to overcome lazy-loading.
Let's start building our first selenium scraper. We can create a new python script for this application called seleniumScrape.py
or someting similar. Now let's open the file in VsCode. Now as usual let's start importing all the required modules/libraries. Open up your command prompt or terminal or whatever console you use and type in pip install selenium
and press Enter. Great now we can start building out first advanced scraper!
For this module we will need a webdriver. There's a few you can choose from:
- Chrome
- Edge
- Firefox
- Safari
For this article we will be using the Chrome web driver. You can download and install it here
If you don't know how to install or where to put the .exe file. Go to your boot drive folder (normall called C:/
) and place the chromedriver.exe
inside there. Now we can import the modules in the new script we created
Now we can create a variable called driver which will fetch the installed driver. Please note that this method of grabbing the webdriver is deprecated. I will update this article when it stops working.
Now we can ask selenium to use the webdriver and to open a website and store the result of the website in a variable. For this example we open the web version of youtube as our scraping website. Please note that this is for educational purposes only.
We have a problem. Since this is the base url and will only give us recommended video results and we want searched results we will have to change the url, but first let's go over what a url is and how it works. This answer was given by ChatGPT btw. I do not have a deep enough understanding to give a simple answer.
Let's first understand how a URL works and how we can modify a url to give different outputs to our browser window.
A URL (Uniform Resource Locator) is a string of text that specifies the location of a resource on the internet. It is like an address that specifies where something is located. URLs are used to access web pages, but they can also be used to access other types of resources, such as images and videos.
There are several parts that make up a URL:
Protocol: This specifies the type of resource that the URL is pointing to. The most common protocol is http, which stands for HyperText Transfer Protocol. Other protocols include https, ftp, file, and mailto.
Domain: This is the main part of the URL and specifies the name of the website that the resource is located on. For example, the domain for Google is google.com.
Path: This specifies the location of the resource within the website. It is like a file path on a computer, and it can include subdirectories and individual files.
Query string: This is a set of key-value pairs that are appended to the end of the URL and are used to pass additional information to the server. They are usually separated from the rest of the URL by a ? character, and each key-value pair is separated by a & character.
Fragment: This is an optional part of the URL that specifies a specific location within a resource. It is separated from the rest of the URL by a # character.
Here is an example of a complete URL:
https://www.example.com/path/to/resource?key1=value1&key2=value2#fragment
In this example:
The protocol is https
.
The domain is www.example.com
.
The path is /path/to/resource
.
The query string is key1=value1&key2=value2
.
The fragment is fragment
.
Using the information above we will modify the url in our code to search for a specific word or sentence, in this case cats
. Here's the fixed line of code:
Now it will grab the source code of the website which contains the video resutls for the term "cats".
Now that selenium has read the site it can find elements that take time to load in (lazy-loaded elements). When looking at youtube we know that videos are one of the main elements that takes time to load in. This included the title of the video, the thumbnail, the video uploader etc. Now we are going to scrape video titles and save them in a .txt file which we can open anytime to look into. Let's start by using our web browsers inspect element
tool (right-click then choose the inspect element option) and then look for what the element for the video titles look like.
Looking at the image above if we use the inspect element
tool on the video title we can see the element has a unique id (so far we think it is unique). So let's scrape all the elements that have the same ID as this one, because then we will be able to scrape all of those video titles from our search.
Now we will ask Selenium to read all the video titles from the web page by doing the following.
First we use the find_elements method from the driver and here we can specify what we want to search for specifically. Don't forget to add the By
class into the code as well since we will be using it.
Then use the find_elements method.
This will return a Python list containing all the elements it found. Let's iterate through the list and save all the items in a .txt file for future usage.
Now if we run this it will give us something similar to this:
This is the list of all video titles on that page. Now we will place this into a .txt file. Create a file named youtubeResults.txt
then go back to your code and change the following. This is what we had.
It printed the results now we change it so that the results are saved.
If we break this down into multiple steps you would have the first line where you open the file and you would notice the second parameter to the open() function which is w
. This indicates that we wish to write to the file and add text to it. The next step is where we replace the print function with the file.write()
function which will place the video titles in the file and then add a /n
after it. This means that we want the video titles to be place below one another in the text file and not all in one line. Lastly we use the file.close()
to close the file when we are done using it. Learning this is useful for when clients want a list generated from scraped data so that it can be represented in multiple ways. Also if you got errors in your output it may be because some video title character wasn't able to be written to the file. I'll leave that part for you to figure out.
The pyplot module is a part of the matplotlib library in Python, which is a popular data visualization library used for creating various types of charts and plots. We use this to represent our scraped data visually. For this example we will be using PyPlot to show the most played game currently out of a few listed ones. We will be scraping information from the website: https://steamdb.info . We will start as follows:
Make sure the matplotlib module is installed by using pip:
Now that we have matplotlib let's import the pyplot library from it
We will be using Selenium once again for scraping the information, because as mentioned in the previous section we use this library to scrape Dynamic web pages
. This website ( https://steamdb.info ) is dynamic because it loads the numbers in automatically. Lets import Selenium and get it running.
We now visit the website ourselfs to use inspect element
on it. We would like to know what type of element these numbers are:
Open up inspect element
( by right-clicking on the numbers and choosing inspect element ). Now we have a look the the number's code
We can see that the element has a class attribute which give it a grouping identity. Let's see if the other numbers also have that class
Great! They do. Now we know that these are repeating values and they have a repeating attribute (class). We will be using the className to grab all the numbers, but first we need the names of these game too otherwise the numbers are pretty much useless. Using the inspect element
again we can see that the numbers of the game's currently playing and the name of the game is all wrapped in one html element:
So if we grab the element which keeps this information we will have the game's name and currently playing. We will now make a loop to grab all the games and their information using the element's class which is app
.
This will grab all the elements with the class app
and store it in a variable. Now we make the loop and print all the games information:
Ok great we now get information like: game's name, currently playing and peaked playing. We will now sort this information so that we can use it as input for pyplot.
--- More coming soon --- --- I need to redo the Selenium installation since it changed ---