Skip to content

A project to build and maintain a comprehensive listing of the public websites of the U.S. federal government.

License

Notifications You must be signed in to change notification settings

GSA/federal-website-index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Federal Website Index

The goal of this project is to assemble an accurate, up-to-date list of the .gov public websites of the federal government. It turns out that there are a lot of sources to consider, but this repository will explain the process used and reference the source datasets. This effort is a part of the Site Scanning program.

The end product, a Federal Website Index, can be found here and is automatically updated every week on Wednesday at 6pm ET. It is then used by the Site Scanning program to serve as its list of Target URLs.

Background

Virtually all of the ~300 agencies that make up the US federal government maintain one or more websites (e.g. www.state.gov, space.commerce.gov). We know what .gov domains exist and which agency operates them because the .gov registry makes this information public, but that only tells us what domains exist (e.g. state.gov, commerce.gov). Each domain may actually have hundreds of distinct websites (e.g. statecollection.census.gov and opportunity.census.gov, which are each different websites than www.census.gov). This project tries to assemble a comprehensive list of all distinct federal websites available to the public.

Caveats

  • The full extent of federal websites include not just .gov sites, but also .mil websites, a small number of .fed.us websites, and some number of .com/.org/.net/etc. websites. For practical purposes, this project does not currently include those. While it is difficult to quantify the number of federal websites on those other domains, we're able to approximate their scale and do know that .gov websites make up the vast majority of federal websites. Therefore, with the caveat of not including .mil/.fed.us/.com/.org/etc websites, this project should offer a largely complete view of the websites operated by the Federal government.
  • There are also many .gov domains and websites used by state, tribal, and local governments and some are included in our source data. This project excludes those by using the list of federal .gov domains as a canonical list of the .gov domains (and thus websites) that are operated by the federal government. We then remove websites from our index if they have a base domain that is not on that list of federal .gov domains.

Summary Of Methodology

Here's the process we use to build the website index:

  • Download, combine, and deduplicate some of the below datasets.
  • Remove websites that contain certain character strings that we've found almost always indicate a non-public website, such as admin. or staging..
  • Use the list of federal .gov domains to assign each website an agency and bureau
  • Use the OMB list of agency and bureau codes to match and add website agency and bureau codes.
  • Remove any websites that do not have a base domain that is on the list of federal .gov domains.

A more detailed description of the process can be found here - [actual source code here].

The list of datasets that are currently used to build the target URL list is here.

Datasets Used To Generate The Target URL List

Other Datasets That Are Under Consideration For Use

Site Scanning Program Links

The Site Scanning program automates a wide range of scans of public federal websites and generates data about website health and best practices.

Feedback

If you have questions or want to give feedback, please leave an issue here or email site-scanning@gsa.gov.

About

A project to build and maintain a comprehensive listing of the public websites of the U.S. federal government.

Resources

License

Stars

Watchers

Forks

Languages