When doing empirical analysis, you may encounter the following issue about file dependency.
-
Suppose
code.py
(orcode.do
,code.sas
,code.m
,code.R
, etc) reads data fromindata.csv
, does some data cleaning, and then saves the intermediary data asoutdata.csv
. -
After using
outdata.csv
to run some statistical tests, you want to change the data cleaning procedure a bit, so you modifycode.py
. -
If there are only one code file and two data files, you will easily remember to re-run
code.py
to update the output dataoutdata.csv
. -
However, suppose that
outdata.csv
is then used bycode2.py
to writefinaldata.csv
. Then, people may easily forget to re-runcode2.py
as well to updatefinaldata.csv
. -
As a result, this may cause the illusion that results change after you run the same set of code twice. For example, you forget to update
finaldata.csv
initially, but then accidentally updatefinaldata.csv
some time later. Then, you find that results change after you run the same set of code.
To resolve this issue, I build this simple package to check file dependencies
based on last modified time. Users can define file dependencies, such as code.py
using pre1.csv
and pre2.csv
as input to write post1.csv
and post2.csv
.
Then, the function in the package will check if the last modified times of both
pre1.csv
and pre2.csv
are before that of code.py
and the last modified
times of both post1.csv
and post2.csv
are after that of code.py
. If any
file dependency is broken, the broken ones will be printed or saved to a file.
Use pip
to install the package as follows:
pip install filedep
Import the package using
import filedep
The key function is check_dep(deps, outfile=sys.stdout, reterr=False)
.
The first argument is a list of dependencies (defined below). The second
argument specifies where to print error information if any file dependency is
broken. The default is sys.stdout
. The third argument specifies if broken
dependencies are returned from the function. This is mainly for testing
purposes. The default is False
, i.e., broken dependencies are only printed.
The file dependencies have to be provided by the user using the format defined
below. In the template
folder, there is a template to define dependencies
and use check_dep()
function to check.
The following code creates several empty files:
import filedep
import time
import os
from os.path import join as pj
PATH = r'C:\test_check_dep'
if not os.path.exists(PATH):
os.mkdir(PATH)
def touch(filepath):
if os.path.exists(filepath):
os.utime(filepath)
else:
with open(filepath, 'a') as f:
pass
# Touch files in a specific order
touch(pj(PATH, 'pre11.csv'))
time.sleep(.1)
touch(pj(PATH, 'pre12.csv'))
time.sleep(.1)
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post11.csv'))
time.sleep(.1)
touch(pj(PATH, 'post12.csv'))
time.sleep(.1)
# Define dependencies
deps = [
(
[
pj(PATH, 'pre11.csv'),
pj(PATH, 'pre12.csv')
],
pj(PATH, 'code1.py'),
[
pj(PATH, 'post11.csv'),
pj(PATH, 'post12.csv'),
],
),
]
filedep.check_dep(deps)
In deps
, we define a single dependency as follows: under the directory
C:\test_check_dep
, code1.py
reads pre11.csv
and pre12.csv
to produce
post11.csv
and post12.csv
. Then, the last modified times of both pre11.csv
and pre12.csv
must be before that of code1.py
and those of both post11.csv
and post12.csv
must be after that of code1.py
Since the dependency is satisfied by construction, the output is
All file dependencies are verified!
The following code creates several empty files and define two broken dependencies:
import filedep
import time
import os
from os.path import join as pj
PATH = r'C:\test_check_dep'
if not os.path.exists(PATH):
os.mkdir(PATH)
def touch(filepath):
if os.path.exists(filepath):
os.utime(filepath)
else:
with open(filepath, 'a') as f:
pass
# Touch files in a specific order
touch(pj(PATH, 'pre11.csv'))
time.sleep(.1)
touch(pj(PATH, 'pre12.csv'))
time.sleep(.1)
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post11.csv'))
time.sleep(.1)
# Note code1.py is newer than post11.csv
touch(pj(PATH, 'code1.py'))
time.sleep(.1)
touch(pj(PATH, 'post12.csv'))
time.sleep(.1)
# Define dependencies
deps = [
(
[
pj(PATH, 'pre11.csv'),
],
pj(PATH, 'code1.py'),
[
pj(PATH, 'post11.csv'),
],
),
(
[
pj(PATH, 'pre11.csv'),
pj(PATH, 'pre12.csv'),
],
pj(PATH, 'code1.py'),
[
pj(PATH, 'post11.csv'),
pj(PATH, 'post12.csv'),
],
)
]
filedep.check_dep(deps)
Here, we define 2 dependencies. The second one is the same as that in the
previous example, but the first one defines a simpler dependency: code1.py
uses pre11.csv
to produce post11.csv
. Since by construction post11.csv
is
"touched" before code1.py
, both dependencies are broken. Hence, the output is
There are 2 broken file dependencies!!!
[1]
Last Modified Time
Input:
C:\test_check_dep\pre11.csv : 2021-10-14 14:25:11.011976
Code:
C:\test_check_dep\code1.py : 2021-10-14 14:25:11.451668
Output:
C:\test_check_dep\post11.csv : 2021-10-14 14:25:11.342247
[2]
Last Modified Time
Input:
C:\test_check_dep\pre11.csv : 2021-10-14 14:25:11.011976
C:\test_check_dep\pre12.csv : 2021-10-14 14:25:11.125543
Code:
C:\test_check_dep\code1.py : 2021-10-14 14:25:11.451668
Output:
C:\test_check_dep\post11.csv : 2021-10-14 14:25:11.342247
C:\test_check_dep\post12.csv : 2021-10-14 14:25:11.559796
where the last modified date of each file in each broken dependency is shown.
The first argument of check_dep()
is a list of dependencies. Its format
should be as follows:
- It is a list of tuples.
- Each tuple has three elements.
- The first element is a list of
str
. - The second element is a
str
. - The third element is a list of
str
. - Each
str
is an absolute path of an existing file.
- The first element is a list of
As an example, the following code defines two dependencies:
deps = [
(
['pre1.txt'], 'code1.py', ['post1.txt']
),
(
['pre21.txt', 'pre22.txt'], 'code2.py', ['post21.txt', 'post22.txt']
)
]
- The first one says that
code1.py
usespre1.txt
as input and outputspost1.txt
. As a result, the last modified date of the three files should satisfypre1.txt<=post1.txt
andcode1.py<=post1.txt
. - The second one says that
code2.py
usespre21.txt
andpre22.txt
as input and outputspost21.txt
andpost22.txt
. As a result, the last modified date of the three files should satisfymax(pre21.txt,pre22.txt,code1.py)<=min(post21.txt,post22.txt)
wheremax
(min
) represent the maximum (minimum) date.