Pandas library is made for handling data with columns and rows. Often, you would need to scrape HTML tables from web pages.
This guide demonstrates how to read HTML tables with pandas with a few simple steps.
To install pandas, we recommend that you use Anaconda. Alternatively, you can install pandas without Anaconda using PIP. You can also install Jupyter Notebook with PIP as follows:
pip install pandas
pip install notebook
import pandas as pd
Use the read_html
function to parse tables from a webpage. This function returns a list
of DataFrames
url = 'https://en.wikipedia.org/wiki/List_of_wealthiest_Americans_by_net_worth'
dfs = pd.read_html(url)
df = dfs[0]
You can use a different parser, such as BeautifulSopup by setting flavor='bs4'
dfs = pd.read_html(url, flavor='bs4')
df.head()
In this example, the date contains other info that needs to be cleaned up:
df['Date of birth(age)'] = df['Date of birth(age)'].str.replace(r'\(.*\)', '', regex=True)
Next, convert this obj
datatype to a datetime64
datatype as follows:
df['Date of birth(age)'] = pd.to_datetime(df['Date of birth(age)'])
You can use the match
parameter to find only the tables that contain the desired text.
url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
dfs = pd.read_html(url, flavor='bs4', match='Source\(s\) of wealth')
read_html
will return the entire table in a data frame. To get a specific column, use pandas filtering as follows:
df[['Name']]
See the following example:
url = 'https://en.wikipedia.org/wiki/Billionaire'
dfs = pd.read_html(url, flavor='bs4',match='known billionaires')
Usually, if you want to skip rows, you can use the skiprows
parameter:
dfs = pd.read_html(url, skiprows=1)
In this case, we will have to remove one header row as follows:
df.droplevel(0,axis=1)
Use the to_csv
method of the data frame object:
df.to_csv('file_name.csv',index=False)