Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse and import both rpm and deb packages metadata #9101

Open
wants to merge 39 commits into
base: master
Choose a base branch
from

Conversation

waterflow80
Copy link
Collaborator

@waterflow80 waterflow80 commented Jul 25, 2024

What does this PR change?

Parse both rpm and deb packages/repositories' metadata files and import the data to the database.

GUI diff

No difference.

  • DONE

Documentation

  • No documentation needed: only internal and user invisible changes

  • DONE

Test coverage

  • No tests: still in progress

  • DONE

Links

Issue(s): #
Port(s): # add downstream PR(s), if any

  • DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

  • No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

  • Re-run test "changelog_test"
  • Re-run test "backend_unittests_pgsql"
  • Re-run test "java_pgsql_tests"
  • Re-run test "schema_migration_test_pgsql"
  • Re-run test "susemanager_unittests"
  • Re-run test "javascript_lint"
  • Re-run test "spacecmd_unittests"

Before you merge

Check How to branch and merge properly!

@waterflow80 waterflow80 force-pushed the parse-and-import-debian-packages-md branch 3 times, most recently from 9cf2fd5 to d12d51a Compare August 2, 2024 01:44
@waterflow80 waterflow80 changed the title Parse and import debian packages metadata Parse and import both rpm and deb packages metadata Aug 2, 2024
lzreposync will be a spacewalk-repo-sync replacement written in Python.
It uses a src layout and a pyproject.toml. The target Python version is
3.11, compatibility with older Python versions is explicitly not a goal.
@waterflow80 waterflow80 force-pushed the parse-and-import-debian-packages-md branch from 4e0c10e to 9740fbe Compare August 2, 2024 16:09
Copy link
Contributor

github-actions bot commented Aug 2, 2024

👋 Hello! Thanks for contributing to our project.
Acceptance tests will take some time (aprox. 1h), please be patient ☕
You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/9101/checks
Once tests finish, if they fail, you can check 👀 the cucumber report. See the link at the output of the action.
You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/9101/checks.

If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.

Reference tests:

KNOWN ISSUES

Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.

For more tips on troubleshooting, see the troubleshooting guide.

Happy hacking!
⚠️ You should not merge if acceptance tests fail to pass. ⚠️

@agraul agraul requested review from agraul and removed request for m-czernek August 5, 2024 08:17
This commit completes almost all the logic and use cases
of the new lazy reposync.

**Note** that this commit will be restructured and possibly
divided into smaller and more convenient commits.
This commit is for review purposes.
Seemingly this error happened because we reached the maximum number
of unclosed db connections. And thought that this might be due to
the fact that the close() method in the Database class was not
implemented, and the rhnSQL.closeDB() was not closing any connection.

However, we're still hesitating about whether this is the root cause
of the problem, because the old(current) reposync is was using it
without any error.
@waterflow80 waterflow80 force-pushed the parse-and-import-debian-packages-md branch from 6c07a22 to ea003b7 Compare August 19, 2024 16:53
This is the latest and almost the final version of the
lzreposync service. (gpg sig check not complete)

It contains pretty much all the necessary tests,
including the ones for updates/patches import.

Some of the remaining 'todos' are either for code
enhancements or some unclear concepts that
will be discussed with the team.

Of course, this commit will be split into smaller
ones later after rebase.
@waterflow80 waterflow80 force-pushed the parse-and-import-debian-packages-md branch from ea003b7 to 913f21c Compare August 19, 2024 16:58
Comment on lines 58 to 59
path VARCHAR(1000),
remote_path VARCHAR(1000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason, except 'we did it like that before', to limit path to 1000 characters?
In postgresql varchar is implemented using TEXT datatype with string length check on top of it. So using limited VARCHAR is actually a tiny little slower then just using TEXT and has no space or indexing benefit.

So if there is no reason to limit this to 1000 characters, please use TEXT as a datatype here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely..thanks for the review,
I did not have any specific reason for using VARCHAR(1000)...I'll change it to TEXT instead.

Copy link
Member

@agraul agraul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first part, I will continue but want to send some comments already. Besides what I wrote, I suggest to go over the pylint comments and try to resolve the complaints instead of disabling the checks. lzreposync is a new program and should meet our current standards better than the old code 😅


from lzreposync.repo_dto import RepoDTO
from spacewalk.common.rhnConfig import cfg_component
from spacewalk.server import rhnSQL, rhnChannel


# stolen from python/spacewalk/server/test/misc_functions.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything copied will need to be maintained in two places. Can you add a TODO comment to move this to a common place? You don't have to do the move, a comment we can easily search for later is good.

finally:
rhnSQL.closeDB()


# Stolen from python/spacewalk/satellite_tools/reposync.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

def get_compatible_arches(channel_id):
"""Return a list of compatible package arch labels for this channel"""
def get_compatible_arches(channel_label):
"""Return a list of compatible package arch labels for the given channel"""
rhnSQL.initDB()
h = rhnSQL.prepare(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This SQL query could use an update to JOIN ... ON

``` sh
$ pip install -e .
```
3. Install other required dependencies (required by spacewalk and other modules)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add these to pyproject.toml as dependencies, then step 2 will install them as well.

pip install rpm
pip install psycopg2-binary
```
4. Add a path configuration file (**Important!**)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct fix will be that we build python311 versions of these libraries and install them as RPMs. We'll do that later, it's not for you (and especially not for this PR).



def main():
parser = argparse.ArgumentParser(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, i.e. not in this PR, we should align the CLI interface as much as possible with spacewalk-repo-sync. We don't have to support all the flags, a subset is fine. Right now we have conflicting flags (e.g. -d means debug here but dry-run in spacewalk-repo-sync)

Comment on lines 140 to 141
# pylint: disable-next=consider-using-f-string
arch = "(noarch|{})".format(args.arch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of silencing pylint, we should follow the rules we have set it for it. In this case it's as simple as:

Suggested change
# pylint: disable-next=consider-using-f-string
arch = "(noarch|{})".format(args.arch)
arch = f"(noarch|{args.arch})"


from lzreposync.packages_parser import PackagesParser
from lzreposync.translation_parser import TranslationParser
from spacewalk.satellite_tools.syncLib import log, log2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads-up: logging in reposync currently suffers from a split-brain problem. We have rhnLog (with syncLib defining wrapper functions) one the one hand and logging on the other. The latter is also used by 3rd-party libraries (e.g. requests). It's one of the things on our list to clean up, so it does not matter much what you use now. That said, I suggest to stick to one of them to make it easy for yourself.



# pylint: disable-next=missing-class-docstring
class DEBMetadataParser:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class only has one method, I would move that to a "free" function at the module level. Adding another layer makes the name longer but does not give us any benefit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to add some abstraction when using OOP...
For example in function import_packages_in_batch_from_parser() (used in the tests) we're making use of this abstraction to import packages using abstract parser objects that implement the parse_packages_metadata() method.

And I'm afraid that using "free functions might add some complexity especially when passing a function as a parameter to another function instead of an object, in the above scenario.

However I'm not sure whether it's the best way to approach things in our case.
(pylint also complains about using one method within a class R0903: Too few public methods (1/2))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use import_packages_in_batch_from_parser, you first instantiate a parser object. The parser object is only used to then call the single public method it has. In other words, at the call time of import_packages_in_batch_from_parser, the caller has to decide which parser to pass (rpm or deb). Therefore the caller could also pass specific functions (e.g. parse_rpm_metadata() or parse_deb_metadata()) to import_packages_in_batch_from_parser, or call these generator functions themselves and pass the generator object instead.

In simplified/pseudo code:

# parsing code
def parse_rpm_metadata(primary_xml, repository_url, arch_filter, filelists_xml, cache_dir):
    primary_parser = PrimaryParser(primary_xml, repository_url, arch_filter)
    filelists_parser = FilelistsParser(filelists_xml, cache_dir, arch_filter)
    filelitsts_parser.parse_filelists()
    for package in primary_parser.parse():
        ...
        yield rpm_package

# import code
def import_in_batches(packages, batch_size, channel=None, compatible_archs=None):
    total = 0
    for i, batch in enumerate(batched(packages, batch_size)):
        failed, *_ = import_package_batch(
            to_process=batch,
            compatible_archs=compatible_archs,
            batch_index=i,
            to_link=True,
            channel=channel
        )
        total += failed
    return total

# test code
def test_parse_and_import():
    packages = parse_rpm_metadata(primary, repo_url, arch_filter, filelists)
    failed = import_in_batches(packages, 20)
    assert ...

I like this more than passing a function because we decouple the different concerns. Import code shouldn't need to bother with getting the data to import, that's someone else's job.
That said, in Python it's not a problem to pass functions around, they are first-class objects, e.g.:

>>> import random
>>> def foo():
...     print("Hello, I'm Foo")
... 
>>> def maybe(f):
...     if random.randint(0, 1) == 1:
...        f() 
...     else:
...        print("No luck. Maybe next time")
... 
>>> for _ in range(10):
...     maybe(foo)
... 
Hello, I'm Foo
No luck. Maybe next time
No luck. Maybe next time
No luck. Maybe next time
No luck. Maybe next time
Hello, I'm Foo
No luck. Maybe next time
Hello, I'm Foo
Hello, I'm Foo
No luck. Maybe next time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good explanation..and the approach is much better too.
Thanks.



# pylint: disable-next=missing-class-docstring
class DebRepo(Repo):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consolidate the debian repo parsing. Right now it's split into repo_plugins.deb_src, common.repo and this deb_repo (which appears to be a copy of the class in deb_src

waterflow80 and others added 14 commits August 26, 2024 01:46
- Removed some todos.
- Changed some sql queries with equivalent ones using
JOIN...ON.
- Some other minor cleanup
Optimized some code by changing classes and methods
in some logics with free functions.

Consolidated the debian repo parsing.
Completed the gpg signature check for rpm repositories,
mainly for the repomd.xml file.

This is done by downloading the signature file from the
remote rpm repo, and executing 'gpg verify' to verify the
repomd.xml file against its signature using the already
added gpg keys on the filesystem.

So, if you haven't already added the required gpg keyring
on your system, you'll not be able to verify the repo.

You should ideally run this version directly on the uyuni-
server, because the gpg keyring will probably be present
there.
makedirs() in uyuni.common.fileutils now accepts relative paths that
consist of only a directory name or paths with trailing slashes.
Refactor: Allow more input variants in makedirs()
Completed the gpg signature check for debian repositories.

If you haven't already added the required gpg keyring
on your system, you'll not be able to verify the repo,
and you'll normally get a GeneralRepoException.

You should ideally run this version directly on the uyuni-
server, because the gpg keyring will probably be present
there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment