Skip to content

UsamaIslam/urdu_ocr_dataset_generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

urdu_ocr_dataset_generation

Dataset generation for Urdu OCR.

Requirenments:

  1. Jupyter Notebook
  2. Scrapy
  3. Pandas
  4. Sellenium

How to Use:

  • Download Repository
  • Go into the folder named 'bbcurdu'
  • open command prompt and enter command `scrapy crawl bbc -o filename.csv`. It will scrape bbcurdu news titles for current page and save it in filename.csv
  • Copy this filename.csv in main directory
  • Open jupyter Notebook in main directory, in ln[28] you can change the column names to either "content_news" or "title_headlines".
  • run all cells
  • once done with running all cells open "data_set.py" file and copy paste your jupyter notebook token URL in "data_set.py". You will need drivers for the purticular browser you are using sellenium. Drivers For [Firefox](https://github.com/mozilla/geckodriver/releases) For [Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads) others can be found [here](https://www.seleniumhq.org/download/). Download driver and place it in main directory.
  • Then run "python data_set.py"
  • It will create two directories "images" and "texts" with dataset.

Releases

No releases published

Packages

No packages published