Merge pull request #1 from amhanson9/master

Add GUI and Executable
rcdm-uga · Apr 28, 2021 · d8eb62c · d8eb62c
2 parents 9cd1930 + 945a0f4
commit d8eb62c
Show file tree

Hide file tree

Showing 5 changed files with 391 additions and 61 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,10 @@
+# PyCharm files
+.idea
+
+# Files for testing
+test_files
+
+# Files for building the executable
+__pycache__
+build
+*.spec
diff --git a/Command Line Instructions.md b/Command Line Instructions.md
@@ -0,0 +1,41 @@
+# Command Line Instructions
+
+In addition to the available Windows executable with a Graphical User Interface (GUI), this program may be run via the command line using the dlg_json2csv.py file. This file was created in 2019 and is minimally maintained, other than to correct errors. The focus of development is the Windows executable.
+
+This program was created to be an intermediate step to pull item(s) from the [Digital Library of Georgia's](https://dlg.usg.edu) (DLG) API and compile it into a CSV file. This script is specifically used to import the CSV with the CSVImport plugin. Please take a look at the word document `DLG_Omeka_API_Pipeline.pdf` because it explains the entire pipeline. That should give you a good feel for the task this script completes. This allows you to have the data for each item but not have to store each item on your own. You will be pulling these from DLG. 
+
+
+### A Brief Description
+The DLG's API returns a json file of the item(s) you're looking for with all of the
+associated metadata and if copyright persist a link to the direct file. *Depending
+on the copyright and the host of the item, the API may return a blank field instead
+of a direct link. In this case, this program will generate the thumbnail that way
+Omeka's plugin will not skip it when trying to read it in.*
+
+The specifics about Omeka's CSVImport will be in `DLG_Omeka_API_Pipeline.pdf` file.
+
+### Other Files
+  * DLG_Mapping.\* If you want to make any changes to the column headers in your output csv, either update this csv or create your own and use the `--mapping` argument. 
+  * sample_urls.txt is just a sample file that will succesfully run thorugh the program. Each of the three URLs are of different cases, illustrating that it can handle any type of URL from the DLG website. (Besides https://dlg.usg.edu obviously.)
+
+### Dependencies
+python 3+:
+  1. pandas v 0.25.1+
+  2. requests 2.22.0+
+
+The rest come preinstalled with python.
+
+
+### How to Run
+This program is ran from the command line, thus you will need to move the command
+prompt to this folder and run the following command:
+
+`python dlg_json2list.py --input <txt file> --output <name of output csv>`
+
+Lastly, the command line arguments:
+  * `--input`: REQUIRED. The txt file that contains the url(s) to be parsed. Please make sure that you do not put any line breaks (or new lines) inside the url. There needs to be one url per line.
+  * `--output`: REQUIRED. The name of the output csv you want these URL's to be added.
+  * `--encode`: [Default: utf-8] If you want to change the encoding of the csv.
+  * `--mapping`: [Default: DLG_Mapping.csv] The csv that contains the column mapping to change the column names of the csv instead of naming them what DLG names them.
+
+To get a description, just run `python dlg_json2csv.py --help` for a similar description.
diff --git a/README.md b/README.md
@@ -1,4 +1,6 @@
-This program was created to be an intermediate step to pull item(s) from the [Digital Library of Georgia's](https://dlg.usg.edu) (DLG) API and compile it into a CSV file. This script is specifically used to import the CSV with the CSVImport plugin. Please take a look at the word document `DLG_Omeka_API_Pipeline.pdf` because it explains the entire pipeline. That should give you a good feel for the task this script completes. This allows you to have the data for each item but not have to store each item on your own. You will be pulling these from DLG. 
+# DLG API Parser
+
+This program was created to be an intermediate step to pull item(s) from the [Digital Library of Georgia's](https://dlg.usg.edu) (DLG) API and compile it into a CSV file. This script is specifically used to import the CSV with the CSVImport plugin, although it can be adapted for other purposes by changing the mapping csv. Please take a look at the word document `DLG_Omeka_API_Pipeline.pdf` because it explains the entire pipeline. That should give you a good feel for the task this script completes. This allows you to have the data for each item but not have to store each item on your own. You will be pulling these from DLG. 
 
 
 ### A Brief Description
@@ -11,27 +13,26 @@ Omeka's plugin will not skip it when trying to read it in.*
 The specifics about Omeka's CSVImport will be in `DLG_Omeka_API_Pipeline.pdf` file.
 
 ### Other Files
-  * DLG_Mapping.\* If you want to make any changes to the column headers in your output csv, either update this csv or create your own and use the `--mapping` argument. 
-  * sample_urls.txt is just a sample file that will succesfully run thorugh the program. Each of the three URLs are of different cases, illustrating that it can handle any type of URL from the DLG website. (Besides https://dlg.usg.edu obviously.)
+   * **Command Line Instructions**: The script was originally designed to run via the command line and can still be operated that way instead of using the Windows executable for any who prefer the command line or are working in a Mac environment.
 
-### Dependencies
-python 3+:
-  1. pandas v 0.25.1+
-  2. requests 2.22.0+
 
-The rest come preinstalled with python.
+   * **DLG_Mapping.csv**: Indicates the fields from the DLG JSON that should be included in the CSV and what the columns should be named. If you want to make any changes to the column headers in your output CSV, either update this CSV or create your own.
 
 
-### How to Run
-This program is ran from the command line, thus you will need to move the command
-prompt to this folder and run the following command:
+   * **DLG_Mapping.xls**: Provides details about each field in the DLG JSON.
+
 
-`python dlg_json2list.py --input <txt file> --output <name of output csv>`
+   * **DLG_Omeka_API_Pipeline**: A complete workflow using this script to export information from DLG about selected images and import it into Omeka for creating digital exhibits. The Word and PDF versions are the same information.
 
-Lastly, the command line arguments:
-  * `--input`: REQUIRED. The txt file that contains the url(s) to be parsed. Please make sure that you do not put any line breaks (or new lines) inside the url. There needs to be one url per line.
-  * `--output`: REQUIRED. The name of the output csv you want these URL's to be added.
-  * `--encode`: [Default: utf-8] If you want to change the encoding of the csv.
-  * `--mapping`: [Default: DLG_Mapping.csv] The csv that contains the column mapping to change the column names of the csv instead of naming them what DLG names them.
 
-To get a description, just run `python dlg_json2csv.py --help` for a similar description.
+   * **sample_urls.txt** is just a sample file that will successfully run through the program. Each of the three URLs are of different cases, illustrating that it can handle any type of URL from the DLG website. (Besides https://dlg.usg.edu)
+
+### How to Run
+1. Download the executable and save it to your local machine. 
+2. Save a copy of DLG_Mapping.csv or the mapping CSV you want to use in the same folder as the executable.
+3. Double-click on the executable to start the program.
+4. Enter the required information into the program.
+   * Path to file with DLG URLs: the text file with the URLs you wish to include in the CSV
+   * Folder to save output: any folder on your local machine, where the CSV and the script log are saved
+   * Name for the output CSV: whatever name the output CSV should have. You may include the file extension (.csv) or have the script add it.
+5. Click Submit.
diff --git a/dlg_json2csv.py b/dlg_json2csv.py
@@ -59,56 +59,44 @@ def dlg_json2list(url_list):
             #every page. This entire else statement handles that.
 
             total_pages = json_dict['response']['pages']['total_pages']
-            current_page = json_dict['response']['pages']['current_page']
-            next_page = json_dict['response']['pages']['next_page']
 
-            while True:
-                #This loop will add each dictionary to the list and the prepare the
-                #next url for the next iteration
+            # Saves the results from the first page of the API call to the list.
+            for item in json_dict['response']['docs']:
+                list_json.append(item)
 
+            # If there are multiple pages, calculates the api_url for all the other pages and adds them to the list.
+            # Stops when the total number of pages is reached.
+            if total_pages > 1:
 
-                for dict in json_dict['response']['docs']:
-                    list_json.append(dict)
+                # Range produces a sequence of numbers from 2 - last page number.
+                for page in range(2, total_pages + 1):
+
+                    # Create the api_url for the next page.
+                    page_str = 'page=' + str(page)
+                    if type(re.search(r'page=\d+', api_url)) == re.Match:
+                        api_url = re.sub(r'page=\d+', page_str, api_url)
+                    else:
+                        # For the first iteration, which doesn't have 'page=\d' yet.
+                        page_str = '?' + page_str + '&'
+                        api_url = re.sub(r'\?', page_str, api_url)
+
+                    # Grabbing the response and JSON for the new api_url.
+                    try:
+                        response = requests.get(api_url)
+                        json_dict = response.json()
+                    except:
+                        print('Something happened on page {} of this URL: {}'.format(page, re.sub('\.json','',api_url)))
+
+                    # Saves the response to the list.
+                    for item in json_dict['response']['docs']:
+                        list_json.append(item)
 
-                next_page_str = 'page=' + str(next_page)
 
-                #changing the page number in the search results
-                if type(re.search('page=\d+',api_url)) == re.Match:
-                    api_url = re.sub('page=\d+',next_page_str,api_url)
-                else:
-                    #Should only be entered the first iteration, the remaining links
-                    #should already contain 'page=\d' from previous iteration
-                    next_page_str = '?' + next_page_str + '&'
-                    api_url = re.sub('\?',next_page_str,api_url)
-
-                #Grabbing the response and json
-
-                try:
-                    #Just continuing to check for a potential error.
-                    response = requests.get(api_url)
-                    json_dict = response.json()
-                except:
-                    print('Something happened on page {} of this URL: {}'.format(next_page+1, re.sub('\.json','',api_url)))
-
-                #Updating variables
-                current_page = json_dict['response']['pages']['current_page']
-                next_page = json_dict['response']['pages']['next_page']
-
-                '''
-                This is the contiton that will end the while loop. So once current_page
-                is the same at total_pages, grab the last amount of dictionaries and
-                break the loop
-                '''
-                if current_page == total_pages:
-                        for dict in json_dict['response']['docs']:
-                            list_json.append(dict)
-                        break
     #Error Check. list_json should have 1 or more items inside. otherwise exit.
     if len(list_json) < 1:
         print('Was not able to grab any of the URLs. Please check them.')
         sys.exit()
 
-
     '''
     This loop with iterate through each item of list_json to convert each
     item into a string so when creating the CSV, the excess qoutation marks and
@@ -144,7 +132,6 @@ def dlg_json2list(url_list):
                         item[key] = requests.get(item[key]).url
                     except:
                         print(item[key])
-
     return list_json
 
 if __name__ == '__main__':
@@ -177,10 +164,8 @@ def dlg_json2list(url_list):
         for line in dlg_urls:
             url_list.append(line.strip())
 
-
     #Grabbing the complete list of jsons from the provided URLs
     list_json = dlg_json2list(url_list)
-
     df = pd.DataFrame.from_dict(list_json)
 
     #Initalizing the DLG Mapping dict