These scripts use Python to search for terms across multiple documentation repositories. (Repositories are assumed to use the metadata formats for docs.microsoft.com.)
-
Make sure you have Python 3 installed. Download from https://www.python.org/downloads.
-
Run
pip install -r requirements.txt
to install needed libraries. (If you want to use a virtual environment instead of your global environment, runpython -m venv .env
then.env\scripts\activate
before runningpip install
.)
Inventories are driven by a JSON configuration file. This repo contains a few example configurations in config.json
, config_python.json
, and config_js.json
. You can create additional files are necessary.
-
Specify the repos you want to search in the
content
collection of the config file. For each element:repo
is a name for the repo (by convention, we use the GitHub org/repo name).path
is the location of the cloned repo on your local computer. Leavepath
blank to skip the repo.url
is the base URL for the published articles of the docset. Theurl
is used to auto-generate full URLs in the output files.exclude_folders
is a collection of folder names to omit from the inventory, such asincludes
folders and other folders that aren't actively maintained (such asvs2015
in the Visual Studio repo.)
-
In the
inventory
section, specify distinct inventories, each of which generates a separate set of inventory files.name
is a case-insensitive name for the inventory. NOTE: don't use spaces or hyphens in the name, or any other character that's not allowed in a filename. We recommend using letters and numbers.terms
is an array of Python regular expressions to use as search terms.
-
By default, the script saves results in an
InventoryData
folder. You can customize this folder by setting theINVENTORY_RESULTS_FOLDER
environment variable. -
At a command prompt, run
python take_inventory.py --config <config-file>
. Omitting--config <config-file>
defaults toconfig.json
. -
When the script is complete, you'll see four files in the results folder for each inventory in the config file:
<name>_<date>_<sequential_int>.csv
contains one line per search term instance.<name>_<date>_<sequential_int>-metadata.csv
, generated byextract_metadata.py
(run automatically fromtake_inventory.py
), adds various metadata values extracted from the source files to the results.<name>_<date>_<sequential_int>-consolidated.csv
, generated byconsolidate.py
(also run automatically), collapses the output fromextract_metadata.py
into one line per file with a count column for each term and count columns for each classification tag (where the term is found)<name>_<date>_<sequential_int>-scored.csv
, generated byscore.py
(also run automatically), applies a scoring algorithm to the output fromconsolidate.py
--seescore.py
for the details. The scripts adds a single "score" column to the new output file, and automatically omits any file with a score of zero. The result here is a file that has "articles of interest" for the inventory in question.
The
<sequential_int>
value starts at 0001 and is incremented each time you run the script on the same day. This is so subsequent runs on the same day produce distinct output.