Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronizing rpm repositories using channel lookup #9052

Closed
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
4af1dd0
importing the batches of parsed packages
waterflow80 Jun 17, 2024
bcd6506
update the import_package_batch function
waterflow80 Jun 17, 2024
67a6d2a
setting a todo for checking the existence of the package before parsing
waterflow80 Jun 17, 2024
8d1f676
added a set of ignored attributes that will be handled later
waterflow80 Jun 17, 2024
4f34429
update the attribute mapping
waterflow80 Jun 17, 2024
bb0d1c8
map the dependency attribute name with correct name
waterflow80 Jun 17, 2024
6c145e5
cast some of the data to the correct format
waterflow80 Jun 17, 2024
b6f9bb5
update the parser using rpm_header
waterflow80 Jun 17, 2024
5d9d2ce
fix typo
waterflow80 Jun 17, 2024
2236cd2
temporary changes in importLib classes in order to make the import wo…
waterflow80 Jun 17, 2024
3183789
updated the ignored attribute list
waterflow80 Jun 17, 2024
7f0a84f
fixed xml element's value read problem
waterflow80 Jun 19, 2024
5fae52f
some updates in fake data usage for the parser
waterflow80 Jun 19, 2024
38c98d5
created a filelists parser that parses the content of the filelists.x…
waterflow80 Jun 19, 2024
52f3143
created a metadata_parser that uses both primary_parser and filelists…
waterflow80 Jun 19, 2024
f96f8c4
update the import call arguments
waterflow80 Jul 1, 2024
e207e36
ignoring packages with arch: "aarch64_ilp32". importLib couldn't proc…
waterflow80 Jul 1, 2024
7217738
some comments and logs
waterflow80 Jul 1, 2024
046dc88
check for batch size before import
waterflow80 Jul 1, 2024
e939546
setting some exceptions and todos
waterflow80 Jul 1, 2024
7e17747
update imports
waterflow80 Jul 1, 2024
d779b6f
updated the way we handle complex rpm attributes
waterflow80 Jul 1, 2024
4476ce5
tiny reformat for readability
waterflow80 Jul 1, 2024
39b0884
set the non-empty condition for the packages dict objects before pars…
waterflow80 Jul 1, 2024
4324fc7
update some comments and logging
waterflow80 Jul 1, 2024
81d20a7
updated the way we parse complex attribute elements
waterflow80 Jul 1, 2024
47c0de3
updated the way we set text element in primary parser
waterflow80 Jul 1, 2024
f1b0f0f
tiny reformat of the primary parser
waterflow80 Jul 1, 2024
f4908af
setting fake data for currently unknown attributes
waterflow80 Jul 1, 2024
060c0d6
removed unused exception
waterflow80 Jul 1, 2024
433ecad
Mapping the rpm flag values to the right numbers
waterflow80 Jul 2, 2024
f836d51
added information about the number of failed packages
waterflow80 Jul 2, 2024
bdd9984
separated the creation/formatting of packages from the importing of p…
waterflow80 Jul 2, 2024
37eb735
update exceptions
waterflow80 Jul 2, 2024
71a3d0e
set back the initial __init__DB() parameters (previously we've tempor…
waterflow80 Jul 3, 2024
551a4eb
set back the initial pg db parameters (previously we've temporarily h…
waterflow80 Jul 3, 2024
f92bea5
update logging
waterflow80 Jul 3, 2024
838b53b
update imports
waterflow80 Jul 3, 2024
270da57
currently ignore any batch importing error and just skip to the next
waterflow80 Jul 3, 2024
6f703d8
add the import_signature boolean parameter in PackageImport class con…
waterflow80 Jul 3, 2024
4e4f42c
reformatted the code using pylint
waterflow80 Jul 4, 2024
5ba0ce6
reformatted the packageImport.py code using pylint
waterflow80 Jul 4, 2024
3116c48
updated the exception handling for failing packages - still needs review
waterflow80 Jul 4, 2024
2948e0e
fixed linting format
waterflow80 Jul 4, 2024
8ed594e
removed old tests and planned new tests
waterflow80 Jul 4, 2024
512cdf7
fix linting for test file
waterflow80 Jul 4, 2024
a5458d2
ignoring fullFileList function
waterflow80 Jul 4, 2024
eecd794
reformat logging and fixed linting
waterflow80 Jul 4, 2024
d7f50ec
added an important todo
waterflow80 Jul 4, 2024
28a6643
code optimization
waterflow80 Jul 5, 2024
8ed0baf
update logging
waterflow80 Jul 5, 2024
df85696
added a todo
waterflow80 Jul 5, 2024
a846bc7
update the cache dir path for filelists
waterflow80 Jul 9, 2024
c47c44f
removed unused primary test file
waterflow80 Jul 9, 2024
f1f72e5
added test cases for primary parser, filelist parser, and metadata pa…
waterflow80 Jul 9, 2024
dfe82bb
deleted unnecessary xml test files
waterflow80 Jul 9, 2024
0c71bf5
implemented the 'filter with arch' functionality for rpm packages import
waterflow80 Jul 9, 2024
752ba34
added unit tests for the rpm packages import with arch filter
waterflow80 Jul 9, 2024
d8d3b36
added 'noarch' to be always included in the import
waterflow80 Jul 9, 2024
5e52527
completed parsing the location/href => remote_path attribute from pri…
waterflow80 Jul 10, 2024
40895cd
Corrected an error log message
waterflow80 Jul 10, 2024
e436da1
updated the given url path, and updated the remote_url value with the…
waterflow80 Jul 11, 2024
12452a9
some url formatting
waterflow80 Jul 14, 2024
13a4a7e
explicitly specify argument names when calling a method
waterflow80 Jul 14, 2024
b29ef17
synchronizing repositories of a given channel given the channel id
waterflow80 Jul 14, 2024
49aee7c
extracted the import into a reusable function
waterflow80 Jul 15, 2024
209696e
change the arch filter to use the channel_arch value instead
waterflow80 Jul 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/lzreposync/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[build-system]
requires = ["setuptools", "setuptools-scm"]
bild-backend = "setuptools.build_meta"
build-backend = "setuptools.build_meta"

[project]
name = "lzreposync"
Expand Down
72 changes: 65 additions & 7 deletions python/lzreposync/src/lzreposync/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
# pylint: disable=missing-module-docstring

import argparse
import logging
from itertools import islice

from lzreposync import db_utils
from lzreposync.import_utils import import_package_batch
from lzreposync.rpm_repo import RPMRepo


# TODO: put this function in a better location
def batched(iterable, n):
if n < 1:
raise ValueError('n must be at least one')
raise ValueError("n must be at least one")
iterator = iter(iterable)
while batch := tuple(islice(iterator, n)):
yield batch
Expand All @@ -25,9 +29,10 @@ def main():
"--url",
"-u",
help="The target url of the remote repository of which we'll "
"parse the metadata",
"parse the metadata",
dest="url",
type=str,
default=None,
)

parser.add_argument(
Expand Down Expand Up @@ -66,11 +71,64 @@ def main():
type=int,
)

parser.add_argument(
"-a",
"--arch",
help="A filter for package architecture. Can be a regex, for example: 'x86_64', '(x86_64|arch_64)'",
default=".*",
dest="arch",
type=str,
)

parser.add_argument(
"--channel",
help="The channel id of which you want to synchronize repositories",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the channel id, but the channel label

dest="channel",
type=int,
default=None,
)

# TODO encapsulate everything in a class LzRepoSync

args = parser.parse_args()
arch = args.arch
if arch != ".*":
# pylint: disable-next=consider-using-f-string
arch = "(noarch|{})".format(args.arch)

logging.getLogger().setLevel(args.loglevel)
rpm_repository = RPMRepo(args.name, args.cache, args.url) # TODO args.url should be args.repo, no ?
packages = rpm_repository.get_packages_metadata() # packages is a generator
for batch in batched(packages, args.batch_size):
print(f"Importing a batch of {len(batch)} packages...")
# TODO: complete the import
if args.url:
rpm_repository = RPMRepo(args.name, args.cache, args.url, arch)
packages = rpm_repository.get_packages_metadata() # packages is a generator
failed = 0
for i, batch in enumerate(batched(packages, args.batch_size)):
failed += import_package_batch(batch, i)
logging.debug("Completed import with %d failed packages", failed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably move this into a function to share it with the channel case.


else:
# No url specified
if args.channel:
channel_id = args.channel
target_repos = db_utils.get_repositories_by_channel_id(channel_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No one knows the channel ID, you cannot ask that as a command line parameter. What users will pass here is the channel label.

for repo in target_repos:
if repo.repo_type == "yum":
rpm_repository = RPMRepo(
repo.repo_label, args.cache, repo.source_url, arch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the channel arch instead of arch here: the architecture for a channel is provided by the database.

)
logging.debug("Importing package for repo %s", repo.repo_label)
failed = 0
packages = rpm_repository.get_packages_metadata()
for i, batch in enumerate(batched(packages, args.batch_size)):
failed += import_package_batch(batch, i)
logging.debug(
"Completed import for repo %s with %d failed packages",
repo.repo_label,
failed,
)

else:
# TODO: handle repositories other than rpm
logging.debug("Not supported repo type: %s", repo.repo_type)
continue
else:
logging.error("Either --url or --channel must be specified")
16 changes: 16 additions & 0 deletions python/lzreposync/src/lzreposync/channel_dto.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# pylint: disable=missing-module-docstring

from typing import List

from lzreposync.repo_dto import RepoDTO


class ChannelDTO:
"""
A temporary data structure to hold some minor channel information
"""

def __init__(self, label, repositories: List[RepoDTO], channel_arch=None):
self.label = label
self.repositories = repositories
self.channel_arch = channel_arch
37 changes: 37 additions & 0 deletions python/lzreposync/src/lzreposync/db_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# pylint: disable=missing-module-docstring

from lzreposync.repo_dto import RepoDTO
from spacewalk.server import rhnSQL


def get_repositories_by_channel_id(channel_id):
"""
Fetch repositories information form the database, and return a list of
RepoDTO objects
"""
rhnSQL.initDB()
h = rhnSQL.prepare(
"""
select s.id, s.source_url, s.metadata_signed, s.label as repo_label, cst.label as repo_type_label
from rhnContentSource s,
rhnChannelContentSource cs,
rhnContentSourceType cst
where s.id = cs.source_id
and cst.id = s.type_id
and cs.channel_id = :channel_id"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could join with the rhnChannel table and get the repositories by channel label. In which case you would need to copy the channel architecture in each RepoDTO

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)
h.execute(channel_id=int(channel_id))
sources = h.fetchall_dict()
repositories = map(
lambda source: RepoDTO(
repo_id=source["id"],
repo_label=source["repo_label"],
repo_type=source["repo_type_label"],
source_url=source["source_url"],
metadata_signed=source["metadata_signed"],
),
sources,
)
rhnSQL.closeDB()

return list(repositories)
120 changes: 120 additions & 0 deletions python/lzreposync/src/lzreposync/filelists_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# pylint: disable=missing-module-docstring

import gzip
import logging
import os
import re
import shutil
import xml.etree.ElementTree as ET
from xml.dom import pulldom


def map_attribute(attr):
attr_map = {"ver": "version", "rel": "release"}
return attr_map.get(attr, attr)


def cache_xml_node(node, cache_dir):
"""
Saving the content of the given xml node into a xml file in the given cache directory
node: of type xml.dom.minidom.Element
"""
pkgid = node.getAttributeNode("pkgid").value

xml_content = node.toxml()
cache_file = os.path.join(cache_dir, pkgid)

if not os.path.exists(cache_dir):
logging.debug("Creating cache directory: %s", cache_dir)
os.makedirs(cache_dir)

with open(cache_file, "w", encoding="utf-8") as pkg_files:
logging.debug("Caching file %s", cache_file)
pkg_files.write(xml_content)


# pylint: disable-next=missing-class-docstring
class FilelistsParser:
def __init__(self, filelists_file, cache_dir="./.cache", arch_filter=".*"):
"""
filelists_file: In gzip format
"""
self.filelists_file = filelists_file
self.cache_dir = cache_dir
self.arch_filter = arch_filter
self.num_packages = -1 # The number of packages in the given filelist file
self.num_parsed_packages = 0 # The number packages parsed
self.parsed = False # Tell whether the filelists file has been parsed or not

def parse_filelists(self):
"""
Parse the given filelists.xml file (in gzip format) and save the filelist information
of each package in a separate file, where the name of the file is the 'pkgid' with no extension,
for eg the file name should be like: 1c51349b5b35baa58f4941528d25a1306e84b71109051705138dc3577a38bad4
"""

with gzip.open(self.filelists_file) as gz_filelists:
doc = pulldom.parse(gz_filelists)
for event, node in doc:
if event == pulldom.START_ELEMENT and node.tagName == "filelists":
# saving the num of packages contained in the filelists file
num_packages = node.getAttributeNode("packages").value
self.num_packages = num_packages

elif event == pulldom.START_ELEMENT and node.tagName == "package":
doc.expandNode(node)
pkg_arch = node.getAttributeNode("arch").value
if re.fullmatch(self.arch_filter, pkg_arch): # Filter by arch
# Save the content of the package's filelist info in cache directory
cache_xml_node(node, self.cache_dir)
self.num_parsed_packages += 1

self.parsed = True

def get_package_filelist(self, pkgid):
"""
Read the filelist information for the package with the given pkgid,
parse the information and return a dict containing the filelist info
"""

filelist_path = os.path.join(self.cache_dir, pkgid)

# Read the cached filelist file
if not os.path.exists(filelist_path):
logging.debug("No filelist file found for package %s", pkgid)
if not self.parsed:
logging.debug("Parsing filelists file...")
self.parse_filelists()
self.parsed = True
else:
logging.error("Couldn't find filelist file for package %s", pkgid)
return

with open(
os.path.join(self.cache_dir, pkgid), "r", encoding="utf-8"
) as filelist_xml:
tree = ET.parse(filelist_xml)
root = tree.getroot()

filelist = {}
filelist["pkgid"] = pkgid
filelist["files"] = []
# Setting version information (normally it is the same as the one in primary.xml file for the same package)
for attr in ("ver", "epoch", "rel"):
try:
filelist[map_attribute(attr)] = root[0].attrib[attr]
except KeyError as key:
logging.debug("missing %s information for package %s", key, pkgid)

for file in root[1:]:
filelist["files"].append(file.text)

return filelist

def clear_cache(self):
"""
Remove the cached filelist files from the cache directory, including the cache directory
"""
if os.path.exists(self.cache_dir):
logging.debug("Removing %s directory and its content")
shutil.rmtree(self.cache_dir)
97 changes: 0 additions & 97 deletions python/lzreposync/src/lzreposync/importUtils.py

This file was deleted.

Loading
Loading