-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse and import both rpm and deb packages metadata #9101
base: master
Are you sure you want to change the base?
Parse and import both rpm and deb packages metadata #9101
Conversation
9cf2fd5
to
d12d51a
Compare
lzreposync will be a spacewalk-repo-sync replacement written in Python. It uses a src layout and a pyproject.toml. The target Python version is 3.11, compatibility with older Python versions is explicitly not a goal.
4e0c10e
to
9740fbe
Compare
👋 Hello! Thanks for contributing to our project. If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code. Reference tests: KNOWN ISSUES Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience. For more tips on troubleshooting, see the troubleshooting guide. Happy hacking! |
This commit completes almost all the logic and use cases of the new lazy reposync. **Note** that this commit will be restructured and possibly divided into smaller and more convenient commits. This commit is for review purposes.
Seemingly this error happened because we reached the maximum number of unclosed db connections. And thought that this might be due to the fact that the close() method in the Database class was not implemented, and the rhnSQL.closeDB() was not closing any connection. However, we're still hesitating about whether this is the root cause of the problem, because the old(current) reposync is was using it without any error.
6c07a22
to
ea003b7
Compare
This is the latest and almost the final version of the lzreposync service. (gpg sig check not complete) It contains pretty much all the necessary tests, including the ones for updates/patches import. Some of the remaining 'todos' are either for code enhancements or some unclear concepts that will be discussed with the team. Of course, this commit will be split into smaller ones later after rebase.
ea003b7
to
913f21c
Compare
path VARCHAR(1000), | ||
remote_path VARCHAR(1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason, except 'we did it like that before', to limit path to 1000 characters?
In postgresql varchar is implemented using TEXT datatype with string length check on top of it. So using limited VARCHAR is actually a tiny little slower then just using TEXT and has no space or indexing benefit.
So if there is no reason to limit this to 1000 characters, please use TEXT as a datatype here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely..thanks for the review,
I did not have any specific reason for using VARCHAR(1000)
...I'll change it to TEXT
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first part, I will continue but want to send some comments already. Besides what I wrote, I suggest to go over the pylint
comments and try to resolve the complaints instead of disabling the checks. lzreposync
is a new program and should meet our current standards better than the old code 😅
|
||
from lzreposync.repo_dto import RepoDTO | ||
from spacewalk.common.rhnConfig import cfg_component | ||
from spacewalk.server import rhnSQL, rhnChannel | ||
|
||
|
||
# stolen from python/spacewalk/server/test/misc_functions.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything copied will need to be maintained in two places. Can you add a TODO comment to move this to a common place? You don't have to do the move, a comment we can easily search for later is good.
finally: | ||
rhnSQL.closeDB() | ||
|
||
|
||
# Stolen from python/spacewalk/satellite_tools/reposync.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
def get_compatible_arches(channel_id): | ||
"""Return a list of compatible package arch labels for this channel""" | ||
def get_compatible_arches(channel_label): | ||
"""Return a list of compatible package arch labels for the given channel""" | ||
rhnSQL.initDB() | ||
h = rhnSQL.prepare( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This SQL query could use an update to JOIN ... ON
``` sh | ||
$ pip install -e . | ||
``` | ||
3. Install other required dependencies (required by spacewalk and other modules) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add these to pyproject.toml
as dependencies, then step 2 will install them as well.
pip install rpm | ||
pip install psycopg2-binary | ||
``` | ||
4. Add a path configuration file (**Important!**) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct fix will be that we build python311 versions of these libraries and install them as RPMs. We'll do that later, it's not for you (and especially not for this PR).
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, i.e. not in this PR, we should align the CLI interface as much as possible with spacewalk-repo-sync
. We don't have to support all the flags, a subset is fine. Right now we have conflicting flags (e.g. -d
means debug here but dry-run in spacewalk-repo-sync
)
# pylint: disable-next=consider-using-f-string | ||
arch = "(noarch|{})".format(args.arch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of silencing pylint, we should follow the rules we have set it for it. In this case it's as simple as:
# pylint: disable-next=consider-using-f-string | |
arch = "(noarch|{})".format(args.arch) | |
arch = f"(noarch|{args.arch})" |
|
||
from lzreposync.packages_parser import PackagesParser | ||
from lzreposync.translation_parser import TranslationParser | ||
from spacewalk.satellite_tools.syncLib import log, log2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a heads-up: logging in reposync currently suffers from a split-brain problem. We have rhnLog
(with syncLib
defining wrapper functions) one the one hand and logging
on the other. The latter is also used by 3rd-party libraries (e.g. requests
). It's one of the things on our list to clean up, so it does not matter much what you use now. That said, I suggest to stick to one of them to make it easy for yourself.
|
||
|
||
# pylint: disable-next=missing-class-docstring | ||
class DEBMetadataParser: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class only has one method, I would move that to a "free" function at the module level. Adding another layer makes the name longer but does not give us any benefit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to add some abstraction when using OOP...
For example in function import_packages_in_batch_from_parser() (used in the tests) we're making use of this abstraction to import packages using abstract parser
objects that implement the parse_packages_metadata()
method.
And I'm afraid that using "free functions might add some complexity especially when passing a function as a parameter to another function instead of an object, in the above scenario.
However I'm not sure whether it's the best way to approach things in our case.
(pylint
also complains about using one method within a class R0903: Too few public methods (1/2)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use import_packages_in_batch_from_parser
, you first instantiate a parser object. The parser object is only used to then call the single public method it has. In other words, at the call time of import_packages_in_batch_from_parser
, the caller has to decide which parser to pass (rpm or deb). Therefore the caller could also pass specific functions (e.g. parse_rpm_metadata()
or parse_deb_metadata()
) to import_packages_in_batch_from_parser
, or call these generator functions themselves and pass the generator object instead.
In simplified/pseudo code:
# parsing code
def parse_rpm_metadata(primary_xml, repository_url, arch_filter, filelists_xml, cache_dir):
primary_parser = PrimaryParser(primary_xml, repository_url, arch_filter)
filelists_parser = FilelistsParser(filelists_xml, cache_dir, arch_filter)
filelitsts_parser.parse_filelists()
for package in primary_parser.parse():
...
yield rpm_package
# import code
def import_in_batches(packages, batch_size, channel=None, compatible_archs=None):
total = 0
for i, batch in enumerate(batched(packages, batch_size)):
failed, *_ = import_package_batch(
to_process=batch,
compatible_archs=compatible_archs,
batch_index=i,
to_link=True,
channel=channel
)
total += failed
return total
# test code
def test_parse_and_import():
packages = parse_rpm_metadata(primary, repo_url, arch_filter, filelists)
failed = import_in_batches(packages, 20)
assert ...
I like this more than passing a function because we decouple the different concerns. Import code shouldn't need to bother with getting the data to import, that's someone else's job.
That said, in Python it's not a problem to pass functions around, they are first-class objects, e.g.:
>>> import random
>>> def foo():
... print("Hello, I'm Foo")
...
>>> def maybe(f):
... if random.randint(0, 1) == 1:
... f()
... else:
... print("No luck. Maybe next time")
...
>>> for _ in range(10):
... maybe(foo)
...
Hello, I'm Foo
No luck. Maybe next time
No luck. Maybe next time
No luck. Maybe next time
No luck. Maybe next time
Hello, I'm Foo
No luck. Maybe next time
Hello, I'm Foo
Hello, I'm Foo
No luck. Maybe next time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good explanation..and the approach is much better too.
Thanks.
|
||
|
||
# pylint: disable-next=missing-class-docstring | ||
class DebRepo(Repo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consolidate the debian repo parsing. Right now it's split into repo_plugins.deb_src
, common.repo
and this deb_repo
(which appears to be a copy of the class in deb_src
- Removed some todos. - Changed some sql queries with equivalent ones using JOIN...ON. - Some other minor cleanup
Optimized some code by changing classes and methods in some logics with free functions. Consolidated the debian repo parsing.
Completed the gpg signature check for rpm repositories, mainly for the repomd.xml file. This is done by downloading the signature file from the remote rpm repo, and executing 'gpg verify' to verify the repomd.xml file against its signature using the already added gpg keys on the filesystem. So, if you haven't already added the required gpg keyring on your system, you'll not be able to verify the repo. You should ideally run this version directly on the uyuni- server, because the gpg keyring will probably be present there.
makedirs() in uyuni.common.fileutils now accepts relative paths that consist of only a directory name or paths with trailing slashes.
Refactor: Allow more input variants in makedirs()
Completed the gpg signature check for debian repositories. If you haven't already added the required gpg keyring on your system, you'll not be able to verify the repo, and you'll normally get a GeneralRepoException. You should ideally run this version directly on the uyuni- server, because the gpg keyring will probably be present there.
What does this PR change?
Parse both rpm and deb packages/repositories' metadata files and import the data to the database.
GUI diff
No difference.
Documentation
No documentation needed: only internal and user invisible changes
DONE
Test coverage
No tests: still in progress
DONE
Links
Issue(s): #
Port(s): # add downstream PR(s), if any
Changelogs
Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository
If you don't need a changelog check, please mark this checkbox:
If you uncheck the checkbox after the PR is created, you will need to re-run
changelog_test
(see below)Re-run a test
If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:
Before you merge
Check How to branch and merge properly!