This is a Ruby-based demo project designed to teach beginners how to perform both basic and interactive web scraping. This project leverages Watir for browser automation and Nokogiri for HTML parsing. By following this tutorial, you'll learn how to navigate websites, handle login forms, and extract valuable data efficiently.
Purpose:
This project serves as an educational tool to help you understand the fundamentals of web scraping using Ruby. It provides hands-on experience with automating interactions on a live website and extracting structured data from it.
- WebScrapingDemo
For this demo, we will be using the FireFrog Banking website, which is specifically designed for testing and educational purposes.
Demo Website:
https://demo.testfire.net/index.jsp
Features of the Demo Site:
- Interactive Login Page: Allows you to practice automating the login process.
- Account Overview: View account balances and recent transactions.
- Demo Data: The site contains predefined data suitable for scraping exercises.
Demo Credentials:
- Username:
admin
- Password:
admin
Usage:
- Navigate to the Login Page:
- Enter Credentials:
- Username:
admin
- Password:
admin
- Username:
- Access Account Information:
- After logging in, you can navigate to various sections to practice scraping different types of data such as account balances, recent transactions, credits, and debits.
Security Notice:
- The demo site is publicly accessible and intended solely for educational purposes. Do not use real personal information or credentials when interacting with this site.
Before you begin, ensure you have the following installed on your machine:
- Ruby: The programming language used for this project.
- Bundler: A Ruby gem for managing project dependencies.
- Git: For cloning the repository.
-
Using Homebrew:
Homebrew is a popular package manager for MacOS. If you don't have Homebrew installed, you can install it by running the following command in your Terminal:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
Install Ruby:
Once Homebrew is installed, you can install Ruby by running:
brew install ruby
-
Update PATH:
After installation, ensure your system can locate the Ruby binaries. Add the following lines to your shell configuration file (e.g.,
.bash_profile
,.zshrc
):export PATH="/usr/local/opt/ruby/bin:$PATH"
Then, apply the changes:
source ~/.bash_profile # or source ~/.zshrc
-
Verify Installation:
ruby -v
You should see the Ruby version installed.
-
Using Package Manager:
The installation command may vary based on your Linux distribution.
-
Ubuntu/Debian:
sudo apt update sudo apt install ruby-full build-essential
-
Fedora:
sudo dnf install ruby ruby-devel
-
Arch Linux:
sudo pacman -S ruby
-
-
Verify Installation:
ruby -v
You should see the Ruby version installed.
-
Using RubyInstaller:
-
Go to the RubyInstaller website.
-
Download the latest Ruby+Devkit installer (e.g., Ruby 3.x.x).
-
Run the installer and follow the on-screen instructions.
-
Ensure you select the option to add Ruby executables to your PATH.
-
After installation, open the Command Prompt and verify:
ruby -v
You should see the Ruby version installed.
-
-
Installing MSYS2 (if prompted):
During the Ruby installation on Windows, you might be prompted to install MSYS2. Follow the prompts to complete the installation, which is necessary for building native Ruby gems.
Git is essential for cloning the repository. If you don't have Git installed, follow the instructions for your operating system.
-
MacOS: Install via Homebrew
brew install git
-
Linux: Install via Package Manager
# Ubuntu/Debian sudo apt install git # Fedora sudo dnf install git # Arch Linux sudo pacman -S git
-
Windows: Download and install from the official website.
First, clone the WebScrapingDemo repository to your local machine.
git clonehttps://github.com/MeetAp/BasicWebScrapingDemo.git
cd BasicWebScrapingDemo
-
Install Bundler:
Bundler manages the project's Ruby gem dependencies. Install it by running:
gem install bundler
-
Install Project Gems:
Navigate to the project directory and install the required gems using Bundler.
bundle install
This command reads the
Gemfile
and installs all the listed gems, such as Nokogiri and Watir.
Here's an overview of the project's directory structure:
├── Gemfile
├── Gemfile.lock
├── README.md
└── scrapers
├── admin_page_scraper.rb
└── homepage_scraper.rb
- Gemfile & Gemfile.lock: Manage and lock gem dependencies.
- README.md: Project documentation.
- scrapers/: Holds the Ruby codebase.
admin_page_scraper.rb
: Scraper for the admin page.homepage_scraper.rb
: Scraper for the homepage.
You can run the scraper scripts directly using the ruby
command. This method is straightforward and works across all operating systems.
ruby scrapers/admin_page_scraper.rb
ruby scrapers/homepage_scraper.rb
Explanation:
ruby
: The Ruby interpreter.lib/scrapers/admin_page_scraper.rb
: Path to the Admin Page Scraper script.
When you run this command, Ruby executes the specified script, and the scraper performs its designated tasks, such as extracting data and displaying it in the console.
Common Issues:
- Script/File Not Found: Ensure you're in the project root directory when running these commands.
-
Run Admin Scraper via Ruby:
ruby lib/scrapers/admin_page_scraper.rb
-
Run Homepage Scraper via Ruby:
ruby lib/scrapers/homepage_scraper.rb
A headless browser is a web browser without a graphical user interface. It allows you to perform automated web interactions, such as navigating pages and filling out forms, without opening a visible browser window. This is particularly useful for running scripts on servers or environments where a display is not available.
By default, the scraper scripts are set to run in headless mode to improve performance and reduce resource usage. If you prefer to see the browser actions in real-time for debugging or learning purposes, you can easily enable or disable headless mode by modifying a single line of code in each scraper file.
-
Locate the Scraper File:
Navigate to the scraper file you want to configure. For example:
scrapers/admin_page_scraper.rb
scrapers/homepage_scraper.rb
-
Modify the Browser Initialization Line:
Find the line where the Watir browser is initialized. It should look like this:
browser = Watir::Browser.new :chrome, headless: true
-
Enable Headless Mode:
To enable headless mode (browser runs in the background without a UI), ensure the line is:
browser = Watir::Browser.new :chrome, headless: true
-
Disable Headless Mode:
To disable headless mode (browser window will be visible), change the line to:
browser = Watir::Browser.new :chrome, headless: false
Or simply remove the
headless
option, asfalse
is the default value:browser = Watir::Browser.new :chrome
Note: Disabling headless mode will open a new browser window each time you run the scraper, allowing you to observe the automated interactions.
-
Ruby Not Found
Cause: Ruby is not installed or not added to the system PATH.
Solution:
- Follow the Installing Ruby section to install Ruby.
- Ensure Ruby's bin directory is in your system's PATH.
-
Missing Gems
Cause: Required gems are not installed.
Solution:
-
Run:
bundle install
-
-
Script Exceptions
Cause: Errors within the scraper scripts (e.g., network issues, changes in webpage structure).
Solution:
- Review error messages in the console.
- Ensure the target webpage's structure hasn't changed.
- Implement logging for better error tracking (optional).
Contributions are welcome! If you'd like to contribute:
-
Fork the Repository: Click the "Fork" button at the top right of the repository page.
-
Create a Feature Branch:
git checkout -b feature/YourFeatureName
-
Commit Your Changes:
git commit -m "Add your message here"
-
Push to the Branch:
git push origin feature/YourFeatureName
-
Open a Pull Request: Go to the original repository and create a pull request with your changes.
This project is licensed under the MIT License.
Happy Scraping!