title | author |
---|---|
Defining a common template to describe replication packages in the social sciences |
Lars Vilhuber and Miklós Koren and C D and E F |
The effective reproducibility of published science has been the subject of interest, if not controversy, over the past four decades. (Dewald et al, 1986, Anderson Dewald 1994, King 1995, Hamermesh 2007, Koenker and Zeileis, 2009, Camerer et al 2016). Recent studies have shown that not enough articles can be reproduced even when most code and data are required to be provided on journal websites (Chang and Li 2017, Kingi et al 2018), as has been the practice in many economics journals since the mid-2000s (AEA, 2005) and political science journals in the 2010s (AJPS 2012, QJPS ??). The latest trend is to actively enforce reproducibility prior to publication: AJPS 2015, AEA 2018, REStud 20xx, EJ 2019).
The authors of this article are responsible for conducting pre-publication checks for 11 journals in Economics. All of our journals, as a condition for publication, require verification of the completeness of the replication package and the clarity of the description provided by authors on how to reproduce the tables and figures in the paper. Some of the journals go one step further, and actually attempt to trace the provenance of data, and execute computer code that processes raw data and yields the tables and figures in the paper. Over the past 12 months, our teams have conducted pre-publication checks on over 300 articles.
In conducting these checks, we have identified a few key problems that make the process harder than it needs be. Even though several articles experienced corrections as a consequence of the process, no articles have had fundamental problems that led to a revision of the editorial decision. However, many have compliance issues that make reproducibility checks onerous and costly. Mostly, we believe that this is due to a lack of understanding of what the reproduciblity check entails, and to the absence of a common method to provide information that at times is critical to being able to consistently reproduce the analysis reported in the paper. By requiring multiple rounds of review, the already lengthy publication process is lengthened, a consequence that editors are understandably concerned about.
In this article, we propose a standardized template document for describing replication packages (commonly referred to as the "README"), addressing what we have found to be the key factors to ensure transparency and reproducibility of the analysis described in social science papers. This template, once filled out, can be used in submitting to any one of the participating journals, and in fact, should be acceptable to most journals that scrutinize replication packages. It encourages authors to provide a format-free document, created early in the scientific process, but usable regardless of which journal the research ultimately gets submitted to. The proposed README should be usable by any social science article that has some computational component, whether it relies on survey data collected from a small sample, uses simulated data generated by the code directly, or leverages cloud computing resources and tera-scale data, and whether the data used are open-access, restricted-use, or confidential.^[Reference the Project TIER README documentation, as well as any other that we are aware of. Emphasis here is that the journal defines it, not one or more research groups.]
In the remainder of this document, we describe and discuss the various elements of our proposed standardized README template. The README template can be found at https://social-science-data-editors.github.io/template_README/. An acceptable README should be exhaustive in at least four elements: how to obtain the data needed for the particular analysis (provenance), the resources needed to process the data (requirements), the steps that a reproducer needs to undertake in order to combine data, code, and resources to re-generate the tables and figures that support the analyis described in the paper (instructions), and an indication of the mapping between code outputs, and tables and figures in the manuscript (output list).
Some authors might presume that the primary audience for the README is the team of replicators at a journal. While those might be the first ones to view a README, they are certainly not the last. Data editors are just that: editors of a document that will be made available to a much broader audience. We consider that the audience is anybody, not necessarily from the narrow discipline of the paper's authors, who wishes to understand the mechanics of generating the observable output from the analysis. The audience may learn certain aspects from the replication package - how to obtain data, how to implement a particular econometric technique, how to code a particular bootstrap procedure. To convey that information, we have found that a highly concise style is most useful. While a well-written article should have full sentences, flowing paragraphs, and an agreeable literary style, the most effective READMEs we have seen are short, focused, using lists and bullet points extensively. They should be written in the chronological order that steps need to be undertaken.
The information about provenance of the data is often referred to as the "data availability statement (DAS)." While some journals explicitly incorporate the data (and code) availability statements into the article itself, most economics journals do not. Nevertheless, this information is important. The provenance section describes two possibly divergent pieces of information: how did the authors obtain the data (when they started the analysis), and how can a replicator obtain the data today. Often, the two are one and the same, but they can diverge, and the authors should try to be aware of that. The Data Citation Principles (CITATION) suggest that authors cite data the same way they would cite articles that are intellectual sources for their article. For each data source, the in-text citation and an entry in the bibliography are the minimum information required. While in principle, data citations should contain all elements needed to obtain the data, the citation styles used in the social sciences usually obscure such information. Thus, a more comprehensive, verbose description of the criteria, combined with a data citation, is required. A DAS goes beyond a typical data citation, as it describes additional information necessary for the obtention of the data. These may include required registrations, memberships, application procedures, monetary cost, or other qualifications, beyond a simple URL for download which is typically part of Data Citations. Authors should not assume that readers have prior knowledge about how to obtain data, as that knowledge might only be "common" within narrow definitions of discipline and time.
While the DAS describes where "data" can be found, the replication package (potentially after obtaining the data not included in the replication package because of their restricted access) will contain specific files. Not all data files will be source data -- authors might choose to provide data files that are computationally costly to obtain, but are generated from the source data, as a convenience; or some journals may require the authors to include the analysis data, but let to their discretion whether to include the source data or a detailed description of the data cleaning process followed in order to obtain the anaysis data. Not all file names will make it obvious which data source the data are related to. Therefore, all files included in the package should be listed, with notes identifying the specific source data, or whether it is a intermediate or a final analysis data file.
While for simple replication packages, computational requirements may appear to be trivial -- a laptop and some common software--, this is not always so -- expensive commercial software and a super computer cluster. In order to assess the complexity of the task of replicating, authors should specify each of the following elements:
- Software used, including version number as used. If the code is expected to run with a lower version number, that should be added.
- Any additional packages, including their version number or similar, as used.
- The computer hardware specification as used by the author, in terms of OS, CPU generation and quantity, memory and necessary disk space. If multiple computers were used, the specification for each should be identified.
- The wall-clock time needed to run all or parts of the code, given the provided computer hardware, expressed in appropriate units (minutes, days, weeks).
In certain disciplines, it is customary to cite software, as well as packages added to the software. In those cases, citations should be used here, and added to the bibliography.
The first two sections ensure that the data and software necessary to conduct the replication are known. Authors should then provide human-readable instructions to conduct the replication. This may be simple, or may involve many complicated steps. It should be a simple list, without excess prose, in a strictly linear sequence. A reasonable set of separate steps may be described. The instructions might specify how to prepare the computing infrastructure, for instance, if using compute clusters. Very important, authors should specify which non-standard packages are required in addition to the base software, whether these come from open-source repositories (CRAN for R, SSC for Stata, PyPi for Python, Github for Julia) or from commercial vendors (e.g., linear solvers or compilers). The processing should be separated into data preparation and analysis. If there are more than 4-5 individual steps, we suggest that a main program be used, which can take many forms ("project" files, make
, etc.).
A particularly useful way of structuring and presenting the computational steps of the analysis is to list for each exhibit how they were created. Each table and figure appearing in the manuscript, and any (significant) in-text numbers not otherwise present in tables, should be listed together with the name of the script that generates them. We recommend creating each exhibit using a separate script. We also recommend saving each table and figure as a separate file, which can then be included in the manuscript. This helps authors avoid errors of copying and pasting numbers from their computer screen to their word processor (a practice that is still surprisingly common).
In general, while explicit instructions on how replicators might modify the code for their own interests are appreciated and encouraged, all analyses used for the paper should be scripted. Manual modifications to code or parameters should be limited to changing the project directory, when such changes cannot also be automated. There should be no manual interventions required unless absolutely unavoidable.
The template README is available in a variety of formats, to ensure broad adoption regardless of document format chosen by authors. We provide Word, PDF, Markdown, and
The README is under a CC-BY-NC license. Usage by commercial entities is allowed, reselling it is not.