Analysis_main.ipynb: It loads preprocessed paper information from a pickle file, calculates various statistics for each search term (including sample count, date range, token and type counts, and type-token ratio), and creates a line graph showing the number of papers published per year for each search term. The visualization covers papers from 1900 to 2023, allowing for comparison of publication trends across different medical topics over time.
ArxivAPI.py: It reads search terms from a file, creates an SQLite database, queries the arXiv API for each term, processes the XML responses to extract paper details, and inserts the data into the database. The script handles both single and multi-word queries, avoids duplicates, and continues until reaching a maximum number of results or processing all available papers. It automates the collection of research paper metadata from arXiv based on specified search terms, creating a local database for further analysis.
DBviewing.ipynb: It connects to SQLite databases, retrieves information about tables and their contents, and performs various operations such as merging databases, filtering data based on specific search words, and exporting the results to CSV files. The script examines table structures, counts rows, and displays sample data from both PubMed articles and arXiv papers. It then creates a new filtered database with selected search terms and exports the filtered data to CSV format for further analysis.
Human_labelset.py: This code prepares data for an annotation task comparing medical research abstracts before and after 2005 for various target words. It loads preprocessed data, filters it by target word, splits it into pre-2005 and post-2005 subsets, randomly samples 10 abstracts from each period, and combines them into a single dataframe. The process is repeated for each target word, creating a dataset that pairs older and newer abstracts for comparison. The resulting data is saved as a CSV file, enabling analysis of how research focus and language use in medical abstracts have changed over time.
KGExtractor.py: This code extracts MeSH (Medical Subject Headings) terms from a SQLite database, processes them, and then uses these terms to query Wikidata for related information. It performs entity searches using Wikidata's API, fetches RDF triples for matched entities, and uses SPARQL queries to retrieve specific properties and values for each entity. The results are stored in a dictionary and periodically saved to a JSON file. The process includes error handling, rate limiting, and batch processing to manage large datasets efficiently. The final output is a comprehensive JSON file containing Wikidata information for the medical terms, which can be used for further analysis or data enrichment.
LLMlabeling_7B.py: This code uses the Llama-2-7b-chat-hf model to analyze semantic drift in medical terms. It loads preprocessed data for various target words, samples abstracts from before and after 2005, and generates prompts for each pair. The Llama model then analyzes these prompts to determine if the meaning of the target word has changed over time. The results, including the model's answers, are compiled into a DataFrame and saved as a CSV file. This process automates the detection of semantic changes in medical terminology across different time periods.
PubMedFiltering.py: This code performs data processing and sampling on medical research databases. It includes functions for sampling data from SQLite databases, parsing keywords from a file, filtering articles based on keywords, and counting keyword occurrences. The script can sample data from PubMed and arXiv databases, create filtered datasets, and generate keyword count statistics. It uses SQLite for database operations and CSV for output, with progress tracking for long-running processes. The main functionality allows for flexible extraction and analysis of medical research articles.
PubMedFTPLoad.py: This code downloads the PubMed baseline dataset from the NCBI FTP server. It connects to the server, navigates to the PubMed baseline directory, and retrieves a list of files. It then creates a local directory named "PubMed" if it doesn't exist. The script downloads each file in binary mode, skipping files that already exist locally. It uses a progress bar to show the download progress. After downloading all files, it closes the FTP connection. This script efficiently automates the process of retrieving large-scale medical research data for local analysis.
PubMedReading.py: This code processes PubMed XML data files and stores the extracted information in a SQLite database. It reads gzipped XML files, parses them to extract article details (PMID, title, abstract, publication year, and MeSH terms), and inserts this data into an SQLite database. The script uses ElementTree for XML parsing and handles large files efficiently. It processes multiple files in a directory, updating the database with each file. The progress is tracked using a progress bar. This script automates the conversion of raw PubMed data into a structured database format, facilitating easier access and analysis of medical research information.
RoBERTaFinetuning.py (Optional code): This code fine-tunes a RoBERTa language model on medical abstract data for each year from 1970 onwards. It uses a custom dataset class to handle the abstracts, processes the data year by year, and trains a separate model for each year. The training process includes masked language modeling, with 15% of tokens randomly masked. After each year's training, the model is saved and memory is cleared to prepare for the next year. This approach allows for the analysis of language changes in medical abstracts over time by creating year-specific models.
SearchWord.txt: The list of selected words for the disability.