-
-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Migration of Fuzz Tests & Integration Scripts From the OSS-Fuzz Project Repo #1901
Initial Migration of Fuzz Tests & Integration Scripts From the OSS-Fuzz Project Repo #1901
Conversation
Migrates the OSS-Fuzz tests and setup scripts from the OSS-Fuzz repository to GitPython's repo as discussed here: gitpython-developers#1887 (comment) These files include the changes that were originally proposed in: google/oss-fuzz#11763 Additional changes include: - A first pass at documenting the contents of the fuzzing set up in a dedicated README.md - Adding the dictionary files to this repo for improved visibility. Seed corpra zips are still located in an external repo pending further discussion regarding where those should live in the long term.
As per discussion in gitpython-developers#1889
Adds additional documentation links and fixes some typos.
- Updates the fuzzing documentation to include steps for working with locally modified versions of the gitpython repository. - Updates the build.sh script to make the fuzz target search path more specific, reducing the risk of local OSS-Fuzz builds picking up files located outside of where we expect them (for example, in a .venv directory.) - add artifacts produced by local OSS-Fuzz runs to gitignore
- Fix typos in the documentation on dictionaries - Link to the fuzzing directory in the main README where it is referenced.
Updates the gitpython project files to enable migrating and maintaining fuzz targets and build scripts upstream. Related PR in the upstream repo: gitpython-developers/GitPython#1901
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this wonderful PR! It's very well thought out and a delight. Particularly the fuzzing/README.md
is a great introduction to how it all works. >[!TIP]
is also new to me, and something I hope to be using more often in future.
About Fuzzing-Fixtures
Overall, the extra-files are just about 2MB in total, and I wonder how stable they will be. If stable, I could imagine including them in this repo.
If there is any concerns about this, gitpython-developers
could fork the repo and make you an admin on it.
About archives for publishing on PyPy
I created a new archive from the branch of this PR with python -m build --sdist --wheel
and didn't see any trace of fuzzing
in the produced archive.
It seems good.
Regarding the OSS fuzz PR
I didn't review it for correctness, but approved it in case it helps getting it merge.
Regarding my review
I just glanced at the shell scripts, and basically ignored the google-provided fuzz targets.
All that's left for me is to figure out where to put the ~2MB of corpus files.
And, thanks again for picking this up :)!
Thanks, I appreciate the kind words 🙂 And yes,
I think a dedicated repo under the Ideally, a seed corpus would be added for each new fuzz target. Obviously that means more files but it doesn't necessarily mean "10 fuzz targets = 10mb". The size & number of interesting inputs will change over time and depending on the complexity/number of code paths in the functionality under test, and it's hard to know what that equates to in terms of per target corpus size at this point. The two zips in my repo right now actually demonstrate that pretty nicely: both of them basically top out their respective coverage quickly but one zip is ~450kb while the other is ~1.5mb. I like how Bitcoin Core handles in https://github.com/bitcoin-core/qa-assets. From what I've seen, that seems like the cleanest and most flexible.
Awesome thanks! |
This repo was created after discussion in PR gitpython-developers#1901.
Thanks @Byron! I've updated the URL to unblock this PR and I'll follow up with a PR to cleanup the new qa-assets repo following the pattern of the Bitcoin one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least one and possibly two changes will be needed to comply with the terms of the Apache License. That license imposes requirements for distributing the licensed work, including, in subsection 4(a):
You must give any other recipients of the Work or Derivative Works a copy of this License
This is to say that merely preserving the license headers (even with other acknowledgement) is not sufficient, and the license text itself needs to be included somewhere in the GitPython repository.
It seems to me that the most hassle-free way to achieve this is to copy the license file from the repository the Apache 2.0 licensed Python scripts originally came from (so long as there is nothing unusual about it) into the fuzzing/
directory, and name it LICENSE-APACHE
.
The reason to put in the fuzzing/
directory rather than in the fuzzing/fuzz-targets/
subdirectory that contains those scripts, is that new files not covered by that license may end up in there later. So putting it there wouldn't really make it more specific. There is also a benefit to having it in a directory that also contains a README.md
that covers, among other code, the code it pertains to--especially if the details about what it covers are also moved from the main readme into that one, as I suggest in a review comment below. But it could go in the fuzzing/fuzz-targets/
subdirectory if you prefer.
It shouldn't go at the top level of the repository, because that could lead to confusion, and also because (at least without further changes) it would end up in built sdist and wheel packages. Currently nothing from the fuzzing/
directory is included in such packages; I have verified this (as has Byron).
The slightly trickier part is that the Apache License also requires that changes be noted. Per 4(b):
You must cause any modified files to carry prominent notices stating that You changed the files
I believe that does apply here. The subtle aspect of this is that this is often not followed, and I think is not expected to be followed, when proposing changes for inclusion in an existing project that the files came from. For example, pull requests on a project that uses the Apache License often do not, and I think are not expected to, add any self-attribution, if the contributors decide they don't need to be attributed individually.
So I think anything that has formerly been published under this license in the OSS-Fuzz repository or other such places, if it included your modifications there, should not require that anything be added to put it here.
But it looks like your changes to the Apache licensed files were not ever included there. If this is where they are first published, then I think a note has to be added. If you want it to be a copyright notice naming you, I think that is fine. If not, then it could just say "This file has been modified by" followed by your name, or "GitPython contributors," or "contributors to GitPython," or something like that.
Please definitely do say if you think I am mistaken about any of that, or even if you are not sure but are worried I may be mistaken.
Besides the above, I think there are no other blockers for merging this. I've posted a number of comments below, some recommending other changes.
- A small number pertain to how the licensing situation is described. Unlike the above, none are important enough to delay merging this. However, those are changes I might not feel comfortable making separately without checking with you first. I recommend at least reading those, and they may lead to changes in this PR. But I emphasize that they do not need to stand in the way of this being merged.
- Most others are technical, and I believe at least one identifies a minor bug, but none are anywhere near serious enough that they need to delay merging, and you can even ignore them completely if you like. For these, if you happen to find any that recommend changes that you would prefer not be made, then you may want to state that, since unlike changes to how the licensing situation is documented, I would not otherwise feel a need to exercise any special caution when making such changes later. (You can also make changes in this PR based on them, but I am not asking you to do that and I don't want to delay this unnecessarily or push you into doing any further work on this that you may not wish to do at this time.)
shutil.rmtree(git_dir) | ||
|
||
os.mkdir(git_dir) | ||
with open(head_file, "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is a specific reason to do otherwise, I recommend specifying an encoding explicitly, for example, encoding="utf-8"
. Otherwise it is unclear if the reason it is omitted is just because it is expected not to make a practical difference across different platforms, or because a difference is actually desired. (I am guessing the reason is the former, because I don't think the fuzz tests run on Windows, which is where these issues usually come into play.) If omitting it is intentional, then it might be worthwhile to include a comment to state the reason.
python3 -m pip install . | ||
|
||
# Directory to look in for dictionaries, options files, and seed corpa: | ||
SEED_DATA_DIR="$SRC/seed_data" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SRC
comes from outside this script and is set in such a way that it is very unlikely to start with -
. I have refrained from mentioning places where --
should be added before arguments that have $SRC
as a prefix if the value of SRC
can begin, even accidentally, with -
. Most commands accept --
, but not all, and I have no objection to omitting it on the occasion it is truly known not to be needed. I'll assume such comments may not be wanted, unless you request them. The same goes for container-environment-boostrap.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
SRC
comes from outside this script and is set in such a way that it is very unlikely to start with-
.
You are correct. I explained the OSS-Fuzz provided env vars a bit more in this comment: #1901 (comment). These variables are certainly a case where --
is truly known not to be needed, because if the omission of it results in these scripts breaking then most (more likely, all) of the other OSS-Fuzz projects would break as well.
That said, I wasn't particularly happy with having the "$SRC/seed_data"
directory specified in two separate scripts, where one implicitly depends on the other creating it. If nothing else, that feels typo prone.
Initially I considered having a third script that just defined the shared variable and source
ing it in each, but that felt like overkill.
Ultimately, I think the best thing to do would be to remove the need for that directory all together. When I get around to the changes I described in a reply to you here: #1903 (comment), I should be able to do all of the dictionary & corpus zip creation inside container-environment-bootstrap.sh
(i.e., during the Docker image build step) and put everything in $OUT/
directly from there. That would allow the removal of the "$SRC/seed_data"
directory completely, and also make the fuzz harness build step more efficient because it wouldn't need to produce the fixture artifacts every time.
I'm open to alternative suggestions though 🙂
I'll leave this comment unresolved for now for my own reference
Thanks for the thorough review @EliahKagan! It's very helpful, and I'm particularly appreciative of the thoughtful explanations and suggestions you shared. I'll push some changes shortly to address your feedback, prioritizing the updates related to the license, so we can get this merged and unblock the downstream OSS-Fuzz PR. As for anything else that you explicitly noted as non-blocking: I'll either address them in a follow-up, or reply to in the |
Addresses feedback and encorperates suggestions from PR gitpython-developers#1901 to ensure that the Apache License requirements are met for the two files that they apply to, and the documentation pertaining to licensing of the files in this repository is clear and concise.
Addresses feedback and encorperates suggestions from PR gitpython-developers#1901 to ensure that the Apache License requirements are met for the two files that they apply to, and the documentation pertaining to licensing of the files in this repository is clear and concise. The contects of LICENSE-APACHE were coppied from the LICENSE file of the OSS-Fuzz repository that the two fuzz harnesses came from as of commit: https://github.com/google/oss-fuzz/blob/c2c0632831767ff05c568e7b552cef2801d739ff/LICENSE
321534c
to
945a767
Compare
@EliahKagan, commit 945a767 addresses the blocking concerns regarding the licensing and documentation of Apache licensed files. Please see the diff of that commit for specifics. Sorry about the force-push; I wanted to make sure the OSS-Fuzz commit at the time I copied the Apache license file from was captured in the commit message.1 Regarding:
I did exactly that. I also checked the diff of the contents of the file included here against https://www.apache.org/licenses/LICENSE-2.0.txt just to be sure; it looks good. Diff Results--- fuzzing/LICENSE-APACHE
+++ fuzzing/LICENSE-APACHE-from-website-src
@@ -1,3 +1,4 @@
+
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
@@ -178,7 +179,7 @@
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "{}"
+ boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
@@ -186,7 +187,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.
- Copyright {yyyy} {name of copyright owner}
+ Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
@@ -198,4 +199,4 @@
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
- limitations under the License.
+ limitations under the License.
\ No newline at end of file CC @Byron Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sorting this out!
It looks good to me, but I will wait for @EliahKagan to double-check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
FYI, I'm hesitant to push more commits here, including via the suggestion comments, because I don't want to keep making the review approvals stale. I'll reply to the remaining as necessary of course, but I'm going to start a new branch for them unless someone suggests otherwise. Lmk if either of you would prefer them on this PR directly. |
I agree. I think this PR is in good shape to be merged, and further refinements can be made in one or more subsequent PRs. This is the case even for future adjustments, if any, that would affect how the licensing situation is described. Everything that was blocking has been fully resolved. |
Prefer executing these files using the OSS-Fuzz or `python` command methods outlined in the `fuzzing/README`. Based on feedback and discussion on: gitpython-developers#1901
This script is meant to be sourced by the OSS-Fuzz file of the same name, rather than executed directly. The shebang may lead to the incorrect assumption that the script is meant for direct execution. Replacing it with this directive instructs ShellCheck to treat the script as a Bash script, regardless of how it is executed. Based @EliahKagan's suggestion and feedback on: gitpython-developers#1901
This script is executed directly, not sourced as is the case with `build.sh`, so it should have an executable bit set to avoid ambiguity. Based @EliahKagan's suggestion and feedback on: gitpython-developers#1901
- Make the link text for the OSS-Fuzz test status URL more descriptive - Fix capitalization of GitPython repository name Based @EliahKagan's suggestion and feedback on: gitpython-developers#1901
Replaces the null character delimiter `-d $'\0'` with the simpler empty string `-d ''` in the fuzzing harness build loop. This changes leverages the Bash `read` builtin behavior to avoid unnecessary complexity and improving script readability. Based @EliahKagan's suggestion and feedback on: gitpython-developers#1901
Based @EliahKagan's suggestion and feedback on: gitpython-developers#1901
A misspelling in the https://github.com/gitpython-developers/qa-assets repository is still present here. It will need to be fixed in that repository first. "corpora" is a difficult word to spell consistently I guess. This made for a good opportunity to improve the phrasing of two other comments at at least. Based @EliahKagan's suggestion and feedback on: gitpython-developers#1901
FWIW @EliahKagan I tackled most of the simpler changes and put up a Draft PR: #1903 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I read the last two comments too late.
However, what follows is still something I consider feasible, nothing is going to be stale. Instead, I prefer to have all the existing comments in one place and resolve conversations.
Never mind :D - see #1903 .
Thanks a lot for keeping working on the refinements. I thought I'd wait until the suggestions are committed (they can be batched), and a few more things are cleaned up.
Regarding the hardcoded /tmp
path that then expects a /tmp/.git
, I agree that this under no circumstances should be running in an unisolated environment. Since the documentation clearly states that docker is needed, and if it's unlikely to impossible to run outside (or to get to that line), then I think it's OK to leave it.
It seems to be common practice in the OSS-fuzz project as well, and portability is not their concern.
Thanks for bearing with us, this PR is already fantastic, and so is working with you on this!
Thanks both! I updated the branch for #1903 and marked it ready for review. I agree the state of I still plan to share in a reply on the unresolved thread here, I just haven't had a chance to yet. I should be able to in a few hours. |
To be clear, Docker is not needed, so this can be the real, unisolated However, I realize I omitted a piece of my reasoning: I think it's easier to fix this after both this PR and google/oss-fuzz#11803 are merged (the latter is not merged yet but looks like it will be merged soon), because that will make it easier to test the fix both outside and inside Docker. (The Docker case is, I believe, more important, since that's how the tests are automatically run on the OSS-Fuzz infrastructure, and running individual tests outside of Docker is, if I understand correctly, mostly of interest when working on the fuzz tests themselves.) For the same reason, I think it is justified for #1903 not to include a fix for this, since the changes in #1903 otherwise could be merged at any time and do not need to wait on google/oss-fuzz#11803. |
Ok, sorry for the delay. Below are my thoughts regarding About the OSS-Fuzz EnvironmentYou both touched on it, but for the sake of shared understanding, the scripts and tests look like they do in part because of the OSS-Fuzz container environment implementation details. The key points to understand are:
Build Stage Execution Environment
Everything else is read-only. Fuzzer Execution Stage Environment
Regarding the Fuzz TestsWhen I initially started on this, my primary objective was really only about fixing the build. I did try to clean them up a bit in the process and apply some best practices, but those changes were mostly incidental and I intentionally avoided any significant refactors. After we decided to bring them here, my focus shifted to lowering the barrier to entry for contributors because of the nature of the project. In particular, I tried to distill the useful information from various place that I haven't seen documented in one place yet. One such tip is the direct execution section, but in hindsight I overlooked the environment assumptions in On
|
Updates the gitpython project files to enable migrating and maintaining fuzz targets and build scripts upstream. Related PR in the upstream repo: gitpython-developers/GitPython#1901 `project.yaml` updates: - @Byron, the maintainer of GitPython, is added as the primary contact. - @EliahKagan and myself are added to the `auto_ccs` list as discussed with @Byron here: gitpython-developers/GitPython#1889 (comment) - @DavidKorczynski I removed what I believe is your email from the `vendor_ccs` because it looked like you were included as the default when no other contacts were listed. If this was a mistake on my part and you want to remain listed as a CC, please let me know and I'll correct it. Thanks!
Thanks so much for the follow-up - I feel well-informed now! As OSS-fuzz strongly relies on docker, maybe it would be easiest to 'go with the flow' and provide a dockerfile as part of the As second step, one could see if improving the tempdir usage would fix certain issues that you have been seeing, including to make using OSS fuzz safe for running it directly. With that said, I also think that… I'd really want you to do more with |
Adds a Dockerfile to enable easily executing the fuzz targets directly inside a container environment instead of directly on a host machine. This addresses concerns raised in PR gitpython-developers#1901 related to how `fuzz_tree.py` writes to the real `/tmp` directory of the file system it is executed on as part of setting up its own test fixtures, but also makes for an easier to use development workflow. See this related comment on PR gitpython-developers#1901 for additional context: gitpython-developers#1901 (comment)
I've added a simple Dockerfile and I instructions for using it and a warning about not using Docker in #1904. Let me know what you think!
I'm planning to take a deeper at this. Even without the
I'm definitely interested in taking you up on that! I've been interested in learning more about Rust for a while and I've found working on fizzers a great way to get familiar with new projects and languages, so you can likely expect to see me pop up in that project sooner or later 😁 In other news, the downstream OSS-Fuzz PR was merged and the Gitpython builds are succeeding! https://introspector.oss-fuzz.com/project-profile?project=gitpython The Coverage build is still reporting failures, but that is expected until ClusterFuzz generates enough corpora to run the analysis which I expect will happen in the next day or two. Exciting to see! |
These files are already BSD-3-Clause even without the headers, but adding these comments and the `LICENSE-BSD` symlink to the root level `LICENSE` file are helpful to reinforce that there are only two particular files in the `fuzzing/` that are not under BSD-3-Clause. See: gitpython-developers#1901 (comment)
While discussing adding similar license comments to the shell scripts introduced in PR gitpython-developers#1901, it was noticed that the shell scripts in the repository root directory did not have such comments and suggested that we could add them when the scripts in the `fuzzing/` directory were updated, so this commit does just that. See: gitpython-developers#1901 (comment)
As discussed in the initial fuzzing integration PR[^1], `fuzz_tree.py`'s implementation was not ideal in terms of coverage and its reading/writing to hard-coded paths inside `/tmp` was problematic as (among other concerns), it causes intermittent crashes on ClusterFuzz[^2] when multiple workers execute the test at the same time on the same machine. The changes here replace `fuzz_tree.py` completely with a completely new `fuzz_repo.py` fuzz target which: - Uses `tempfile.TemporaryDirectory()` to safely manage tmpdir creation and tear down, including during multi-worker execution runs. - Retains the same feature coverage as `fuzz_tree.py`, but it also adds considerably more from much smaller data inputs and with less memory consumed (and it doesn't even have a seed corpus or target specific dictionary yet.) - Can likely be improved further in the future by exercising additional features of `Repo` to the harness. Because `fuzz_tree.py` was removed and `fuzz_repo.py` was not derived from it, the Apache License call outs in the docs were also updated as they only apply to the singe `fuzz_config.py` file now. [^1]: gitpython-developers#1901 (comment) [^2]: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=68355
As discussed in #1889, this PR introduces the changes necessary to migrate the OSS-Fuzz fuzz tests and configuration scripts from the OSS-Fuzz repository into GitPython's repo.
Summary of Changes
Most of the changes proposed here are the same as described in the discussion thread linked above. Beyond that, there are a few minor differences and potentially non-obvious details that I want to call out here:
Directory Structure
Initially, I proposed a single, flat "
fuzzing/
" directory at the top level, but when I moved the files, I felt that a few sub-directories would help with organization.The
fuzzing/READEME.md
file introduced in this PR has a section describing them and some additional documentation.Updates to Fuzz Targets
The fuzz test files in this PR include the changes that were originally proposed in google/oss-fuzz#11763.
Seed Data
Part of efficiency improvements I proposed in google/oss-fuzz#11763 included the addition of dictionary files (described in the new README) and seed corpus zip files.
I stored these in a repo that I maintain: https://github.com/DaveLak/oss-fuzz-inputs/tree/main/gitpython, because OSS-Fuzz has understandable concerns about seed data bloating the size of their repo.
In this PR:
Next Steps
@Byron & @EliahKagan, please:
fuzzing/
directory doesn't get picked up by the publishing process, I'd appreciate it! I did what I could to test it locally but I'm not familiar enough with the release process to be confident.Thanks! I'll address feedback any feedback as quickly as possible.