Parse and import both rpm and deb packages metadata #9101

waterflow80 · 2024-07-25T04:55:34Z

What does this PR change?

Parse both rpm and deb packages/repositories' metadata files and import the data to the database.

GUI diff

No difference.

DONE

Documentation

No documentation needed: only internal and user invisible changes
DONE

Test coverage

No tests: still in progress
DONE

Links

Issue(s): #
Port(s): # add downstream PR(s), if any

DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

Re-run test "changelog_test"
Re-run test "backend_unittests_pgsql"
Re-run test "java_pgsql_tests"
Re-run test "schema_migration_test_pgsql"
Re-run test "susemanager_unittests"
Re-run test "javascript_lint"
Re-run test "spacecmd_unittests"

Before you merge

Check How to branch and merge properly!

lzreposync will be a spacewalk-repo-sync replacement written in Python. It uses a src layout and a pyproject.toml. The target Python version is 3.11, compatibility with older Python versions is explicitly not a goal.

github-actions · 2024-08-02T16:10:06Z

👋 Hello! Thanks for contributing to our project.
Acceptance tests will take some time (aprox. 1h), please be patient ☕
You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/9101/checks
Once tests finish, if they fail, you can check 👀 the cucumber report. See the link at the output of the action.
You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/9101/checks.

If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.

Reference tests:

KNOWN ISSUES

Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.

For more tips on troubleshooting, see the troubleshooting guide.

Happy hacking!
⚠️ You should not merge if acceptance tests fail to pass. ⚠️

This commit completes almost all the logic and use cases of the new lazy reposync. **Note** that this commit will be restructured and possibly divided into smaller and more convenient commits. This commit is for review purposes.

Seemingly this error happened because we reached the maximum number of unclosed db connections. And thought that this might be due to the fact that the close() method in the Database class was not implemented, and the rhnSQL.closeDB() was not closing any connection. However, we're still hesitating about whether this is the root cause of the problem, because the old(current) reposync is was using it without any error.

This is the latest and almost the final version of the lzreposync service. (gpg sig check not complete) It contains pretty much all the necessary tests, including the ones for updates/patches import. Some of the remaining 'todos' are either for code enhancements or some unclear concepts that will be discussed with the team. Of course, this commit will be split into smaller ones later after rebase.

aaannz · 2024-08-20T09:42:52Z

schema/spacewalk/common/tables/rhnPackage.sql

    path             VARCHAR(1000),
+    remote_path	     VARCHAR(1000),


Is there a reason, except 'we did it like that before', to limit path to 1000 characters?
In postgresql varchar is implemented using TEXT datatype with string length check on top of it. So using limited VARCHAR is actually a tiny little slower then just using TEXT and has no space or indexing benefit.

So if there is no reason to limit this to 1000 characters, please use TEXT as a datatype here.

Absolutely..thanks for the review,
I did not have any specific reason for using VARCHAR(1000)...I'll change it to TEXT instead.

agraul

This is the first part, I will continue but want to send some comments already. Besides what I wrote, I suggest to go over the pylint comments and try to resolve the complaints instead of disabling the checks. lzreposync is a new program and should meet our current standards better than the old code 😅

agraul · 2024-08-20T09:01:48Z

python/lzreposync/src/lzreposync/db_utils.py


 from lzreposync.repo_dto import RepoDTO
 from spacewalk.common.rhnConfig import cfg_component
 from spacewalk.server import rhnSQL, rhnChannel


+# stolen from python/spacewalk/server/test/misc_functions.py


Anything copied will need to be maintained in two places. Can you add a TODO comment to move this to a common place? You don't have to do the move, a comment we can easily search for later is good.

agraul · 2024-08-20T09:03:58Z

python/lzreposync/src/lzreposync/db_utils.py

+    finally:
+        rhnSQL.closeDB()
+
+
 # Stolen from python/spacewalk/satellite_tools/reposync.py


agraul · 2024-08-20T09:05:02Z

python/lzreposync/src/lzreposync/db_utils.py

-def get_compatible_arches(channel_id):
-    """Return a list of compatible package arch labels for this channel"""
+def get_compatible_arches(channel_label):
+    """Return a list of compatible package arch labels for the given channel"""
    rhnSQL.initDB()
    h = rhnSQL.prepare(


This SQL query could use an update to JOIN ... ON

agraul · 2024-08-22T09:54:00Z

python/lzreposync/README.md

+``` sh
+$ pip install -e .
+```
+3. Install other required dependencies (required by spacewalk and other modules)


We should add these to pyproject.toml as dependencies, then step 2 will install them as well.

agraul · 2024-08-22T09:56:46Z

python/lzreposync/README.md

+pip install rpm
+pip install psycopg2-binary
+```
+4. Add a path configuration file (**Important!**)


The correct fix will be that we build python311 versions of these libraries and install them as RPMs. We'll do that later, it's not for you (and especially not for this PR).

agraul · 2024-08-22T10:01:28Z

python/lzreposync/src/lzreposync/__init__.py

+
+
+def main():
+    parser = argparse.ArgumentParser(


In the future, i.e. not in this PR, we should align the CLI interface as much as possible with spacewalk-repo-sync. We don't have to support all the flags, a subset is fine. Right now we have conflicting flags (e.g. -d means debug here but dry-run in spacewalk-repo-sync)

agraul · 2024-08-22T10:03:31Z

python/lzreposync/src/lzreposync/__init__.py

+        # pylint: disable-next=consider-using-f-string
+        arch = "(noarch|{})".format(args.arch)


Instead of silencing pylint, we should follow the rules we have set it for it. In this case it's as simple as:

Suggested change

# pylint: disable-next=consider-using-f-string

arch = "(noarch|{})".format(args.arch)

arch = f"(noarch|{args.arch})"

agraul · 2024-08-22T10:07:24Z

python/lzreposync/src/lzreposync/deb_metadata_parser.py

+
+from lzreposync.packages_parser import PackagesParser
+from lzreposync.translation_parser import TranslationParser
+from spacewalk.satellite_tools.syncLib import log, log2


Just a heads-up: logging in reposync currently suffers from a split-brain problem. We have rhnLog (with syncLib defining wrapper functions) one the one hand and logging on the other. The latter is also used by 3rd-party libraries (e.g. requests). It's one of the things on our list to clean up, so it does not matter much what you use now. That said, I suggest to stick to one of them to make it easy for yourself.

agraul · 2024-08-22T10:09:05Z

python/lzreposync/src/lzreposync/deb_metadata_parser.py

+
+
+#  pylint: disable-next=missing-class-docstring
+class DEBMetadataParser:


This class only has one method, I would move that to a "free" function at the module level. Adding another layer makes the name longer but does not give us any benefit

I wanted to add some abstraction when using OOP...
For example in function import_packages_in_batch_from_parser() (used in the tests) we're making use of this abstraction to import packages using abstract parser objects that implement the parse_packages_metadata() method.

And I'm afraid that using "free functions might add some complexity especially when passing a function as a parameter to another function instead of an object, in the above scenario.

However I'm not sure whether it's the best way to approach things in our case.
(pylint also complains about using one method within a class R0903: Too few public methods (1/2))

To use import_packages_in_batch_from_parser, you first instantiate a parser object. The parser object is only used to then call the single public method it has. In other words, at the call time of import_packages_in_batch_from_parser, the caller has to decide which parser to pass (rpm or deb). Therefore the caller could also pass specific functions (e.g. parse_rpm_metadata() or parse_deb_metadata()) to import_packages_in_batch_from_parser, or call these generator functions themselves and pass the generator object instead.

In simplified/pseudo code:

# parsing code def parse_rpm_metadata(primary_xml, repository_url, arch_filter, filelists_xml, cache_dir): primary_parser = PrimaryParser(primary_xml, repository_url, arch_filter) filelists_parser = FilelistsParser(filelists_xml, cache_dir, arch_filter) filelitsts_parser.parse_filelists() for package in primary_parser.parse(): ... yield rpm_package # import code def import_in_batches(packages, batch_size, channel=None, compatible_archs=None): total = 0 for i, batch in enumerate(batched(packages, batch_size)): failed, *_ = import_package_batch( to_process=batch, compatible_archs=compatible_archs, batch_index=i, to_link=True, channel=channel ) total += failed return total # test code def test_parse_and_import(): packages = parse_rpm_metadata(primary, repo_url, arch_filter, filelists) failed = import_in_batches(packages, 20) assert ...

I like this more than passing a function because we decouple the different concerns. Import code shouldn't need to bother with getting the data to import, that's someone else's job.
That said, in Python it's not a problem to pass functions around, they are first-class objects, e.g.:

>>> import random >>> def foo(): ... print("Hello, I'm Foo") ... >>> def maybe(f): ... if random.randint(0, 1) == 1: ... f() ... else: ... print("No luck. Maybe next time") ... >>> for _ in range(10): ... maybe(foo) ... Hello, I'm Foo No luck. Maybe next time No luck. Maybe next time No luck. Maybe next time No luck. Maybe next time Hello, I'm Foo No luck. Maybe next time Hello, I'm Foo Hello, I'm Foo No luck. Maybe next time

Very good explanation..and the approach is much better too.
Thanks.

agraul · 2024-08-22T10:10:37Z

python/lzreposync/src/lzreposync/deb_repo.py

+
+
+#  pylint: disable-next=missing-class-docstring
+class DebRepo(Repo):


We should consolidate the debian repo parsing. Right now it's split into repo_plugins.deb_src, common.repo and this deb_repo (which appears to be a copy of the class in deb_src

- Removed some todos. - Changed some sql queries with equivalent ones using JOIN...ON. - Some other minor cleanup

Optimized some code by changing classes and methods in some logics with free functions. Consolidated the debian repo parsing.

Completed the gpg signature check for rpm repositories, mainly for the repomd.xml file. This is done by downloading the signature file from the remote rpm repo, and executing 'gpg verify' to verify the repomd.xml file against its signature using the already added gpg keys on the filesystem. So, if you haven't already added the required gpg keyring on your system, you'll not be able to verify the repo. You should ideally run this version directly on the uyuni- server, because the gpg keyring will probably be present there.

makedirs() in uyuni.common.fileutils now accepts relative paths that consist of only a directory name or paths with trailing slashes.

Refactor: Allow more input variants in makedirs()

Completed the gpg signature check for debian repositories. If you haven't already added the required gpg keyring on your system, you'll not be able to verify the repo, and you'll normally get a GeneralRepoException. You should ideally run this version directly on the uyuni- server, because the gpg keyring will probably be present there.

github-actions bot added schema_migration_test_pgsql database schema_migration_test_oracle labels Jul 25, 2024

waterflow80 force-pushed the parse-and-import-debian-packages-md branch 3 times, most recently from 9cf2fd5 to d12d51a Compare August 2, 2024 01:44

waterflow80 changed the title ~~Parse and import debian packages metadata~~ Parse and import both rpm and deb packages metadata Aug 2, 2024

Create lzreposync subdirectory

860a02f

lzreposync will be a spacewalk-repo-sync replacement written in Python. It uses a src layout and a pyproject.toml. The target Python version is 3.11, compatibility with older Python versions is explicitly not a goal.

waterflow80 force-pushed the parse-and-import-debian-packages-md branch from 4e0c10e to 9740fbe Compare August 2, 2024 16:09

github-actions bot added python3 java java_lint_checkstyle java_pgsql_tests proxy API susemanager_unittests ui javascript_lint secondary-features ci-pipelines testing ruby_rubocop webapp salt-actions-implementations data-model xmlrpc test-framework old-ui build-validation-features labels Aug 2, 2024

waterflow80 requested a review from m-czernek August 4, 2024 17:15

agraul requested review from agraul and removed request for m-czernek August 5, 2024 08:17

waterflow80 added 3 commits August 15, 2024 03:00

Complete code for lzreposync version 0.1

4c7db58

This commit completes almost all the logic and use cases of the new lazy reposync. **Note** that this commit will be restructured and possibly divided into smaller and more convenient commits. This commit is for review purposes.

Complete tests for lzreposync service

4f8a070

waterflow80 force-pushed the parse-and-import-debian-packages-md branch from 6c07a22 to ea003b7 Compare August 19, 2024 16:53

waterflow80 force-pushed the parse-and-import-debian-packages-md branch from ea003b7 to 913f21c Compare August 19, 2024 16:58

aaannz reviewed Aug 20, 2024

View reviewed changes

agraul reviewed Aug 22, 2024

View reviewed changes

waterflow80 and others added 14 commits August 26, 2024 01:46

Optimize code and do some cleanup

8a49313

- Removed some todos. - Changed some sql queries with equivalent ones using JOIN...ON. - Some other minor cleanup

Optimize and consolidate code

6ccb3bf

Optimized some code by changing classes and methods in some logics with free functions. Consolidated the debian repo parsing.

Fix cachedir path formatting issue

2157d56

fixup! Complete lzreposync service entry point

dceccec

fixup! Optimize code and do some cleanup

2a79e72

fixup! Optimize and consolidate code

2f4c998

fixup! Complete latest version

9a95e5a

fixup! Optimize and consolidate code

fed31fe

fixup! Add remote_path column

b890b1c

Refactor: Allow more input variants in makedirs()

7d0c57c

makedirs() in uyuni.common.fileutils now accepts relative paths that consist of only a directory name or paths with trailing slashes.

Merge pull request #1 from agraul/refactor-makedirs

7aa602c

Refactor: Allow more input variants in makedirs()

fixup! Refactor: Allow more input variants in makedirs()

4042fc4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse and import both rpm and deb packages metadata #9101

Parse and import both rpm and deb packages metadata #9101

waterflow80 commented Jul 25, 2024 •

edited

Loading

github-actions bot commented Aug 2, 2024

aaannz Aug 20, 2024

waterflow80 Aug 20, 2024

agraul left a comment

agraul Aug 20, 2024

agraul Aug 20, 2024

agraul Aug 20, 2024

agraul Aug 22, 2024

agraul Aug 22, 2024

agraul Aug 22, 2024

agraul Aug 22, 2024

agraul Aug 22, 2024

agraul Aug 22, 2024

waterflow80 Aug 26, 2024

agraul Aug 26, 2024

waterflow80 Aug 26, 2024

agraul Aug 22, 2024

		# pylint: disable-next=consider-using-f-string
		arch = "(noarch\|{})".format(args.arch)

	# pylint: disable-next=consider-using-f-string
	arch = "(noarch\|{})".format(args.arch)
	arch = f"(noarch\|{args.arch})"



		# pylint: disable-next=missing-class-docstring
		class DEBMetadataParser:



		# pylint: disable-next=missing-class-docstring
		class DebRepo(Repo):

Parse and import both rpm and deb packages metadata #9101

Are you sure you want to change the base?

Parse and import both rpm and deb packages metadata #9101

Conversation

waterflow80 commented Jul 25, 2024 • edited Loading

What does this PR change?

GUI diff

Documentation

Test coverage

Links

Changelogs

Re-run a test

Before you merge

github-actions bot commented Aug 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agraul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

waterflow80 commented Jul 25, 2024 •

edited

Loading