news_extract

Python module to extract articles from NexisUni and Factiva.

Requirements

Python 3
pandas

Installation

pip install news_extract

Overview

news_extract allows the output of the NexisUni and Factiva databases to be imported into Python. Note, you must export your documents manually first! This module does not scrape the databases directly; rather, it extracts articles and associated metadata from pre-exported output files. To use it, you must subscribe to at least one of these databases and use the following instructions to export your articles from each database:

NexisUni export instructions

Make sure you are exporting full documents with no attachments, not just the results list.
Export in RTF format. (Note: you can export up to 100 articles at a time if you create an individual NexisUni account and change your personal settings accordingly.)
Save documents in a single file.
Uncheck all options on the "Formatting Options" tab.

Factiva export instructions

For Factiva, you must export your documents using the Firefox browser.
After conducting your search, click the "View Selected Articles" button that looks like an eye.
On the right, click the "Display Options" text and select "Full Article/Report plus Indexing."
Click the "Format for Saving" button that looks like a 3.5" floppy disk and select "Article Format."
On the resulting page, select "Save Page As..." from the Firefox menu.
In the "Save as type" dropdown, select "Text Files" and save your file.
This animated gif shows how to do steps 2-4 (warning: French)

Once you've exported your file(s), you can do the following:

import news_extract as ne
nu_file = 'results1.rtf' #file exported from NexisUni
fc_file = 'results2.txt' #file exported from Factiva
nu_data = ne.nexis_rtf_extract(nu_file)
fc_data = ne.factiva_extract(fc_file)

print(nu_data[0].keys()) #view field names for NexisUni articles
print(fc_data[0].keys()) #view field names for first Factiva article

for i in nu_data:
    print(i['HEADLINE']) #show all NexisUni headlines
for i in fc_data:
    print(i['HD']) #show all Factiva headlines

Output

Both nexis_rtf_extract and factiva_extract return lists of dicts wherein each dict corresponds to an article. The dict keys are field names, while the dict values are the metadata. One major difference between the two functions is that nexis_rtf_extract outputs the same set of metadata for all articles, while factiva_extract auto-extracts the specific field names and values attached to each article. This is due to differences in how the two types of export files are formatted.

Combining Factiva and NexisUni output

Converting fieldnames

You can use the function fix_fac_fieldnames to convert Factiva fieldnames to their longer and more descriptive NexisUni equivalents like so:

#note that this will only convert eight common field names, leaving the rest intact
fc_converted = ne.fix_fac_fieldnames(fc_data)

Merging Factiva and NexisUni data into a single Pandas variable

If you want to analyze data from NexisUni and Factiva in the same project, here's how to do it:

nu_plus_fc = nu_data + fc_converted
combined = ne.news_export(nu_plus_fc)

The news_export function performs several operations, including removing duplicates (using a custom algorithm based on the Jaccard coefficient and time of publication) and resolving conflicts between articles with different metadata fields. For the latter, the function attempts to export all fields included in at least half the articles by default. This proportion can be adjusted using the field_threshold parameter, which accepts proportions between 0 and 1. 0 will attempt to include every metadata field present in at least one article, while 1 will include only those fields present in all articles.

By default, news_export returns a Pandas DataFrame containing the output data. You can save individual JSON files to disk (i.e. one article per file) by setting the to_pandas parameter to False.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
build/lib/news_extract		build/lib/news_extract
dist		dist
news_extract.egg-info		news_extract.egg-info
news_extract		news_extract
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

news_extract

Requirements

Installation

Overview

NexisUni export instructions

Factiva export instructions

Output

Combining Factiva and NexisUni output

Converting fieldnames

Merging Factiva and NexisUni data into a single Pandas variable

About

Releases

Packages

Languages

License

dfreelon/news_extract

Folders and files

Latest commit

History

Repository files navigation

news_extract

Requirements

Installation

Overview

NexisUni export instructions

Factiva export instructions

Output

Combining Factiva and NexisUni output

Converting fieldnames

Merging Factiva and NexisUni data into a single Pandas variable

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages