Skip to content

Commit

Permalink
Merge pull request #39 from MITLibraries/maintenance-09-2024
Browse files Browse the repository at this point in the history
Maintenance 09 2024
  • Loading branch information
ehanson8 authored Sep 20, 2024
2 parents 50b8009 + 0021d04 commit d171f9b
Show file tree
Hide file tree
Showing 15 changed files with 1,223 additions and 651 deletions.
35 changes: 13 additions & 22 deletions .github/pull-request-template.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,30 @@
### What does this PR do?

Describe the overall purpose of the PR changes. Doesn't need to be as specific as the
individual commits.

### Helpful background context

Describe any additional context beyond what the PR accomplishes if it is likely to be
useful to a reviewer.

Delete this section if it isn't applicable to the PR.
### Purpose and background context
Describe the overall purpose of the PR changes and any useful background context.

### How can a reviewer manually see the effects of these changes?

Explain how to see the proposed changes in the application if possible.

Delete this section if it isn't applicable to the PR.

### Includes new or updated dependencies?
YES | NO

### Changes expectations for external applications?
YES | NO

### What are the relevant tickets?

Include links to Jira Software and/or Jira Service Management tickets here.
- Include links to Jira Software and/or Jira Service Management tickets here.

### Developer

- [ ] All new ENV is documented in README (or there is none)
- [ ] All new ENV is documented in README
- [ ] All new ENV has been added to staging and production environments
- [ ] All related Jira tickets are linked in commit message(s)
- [ ] Stakeholder approval has been confirmed (or is not needed)

### Code Reviewer

- [ ] The commit message is clear and follows our guidelines
(not just this pull request message)
### Code Reviewer(s)
- [ ] The commit message is clear and follows our guidelines (not just this PR message)
- [ ] There are appropriate tests covering any new functionality
- [ ] The documentation has been updated or is unnecessary
- [ ] The changes have been verified
- [ ] The provided documentation is sufficient for understanding any new functionality introduced
- [ ] Any manual tests have been performed **or** provided examples verified
- [ ] New dependencies are appropriate or there were no changes

2 changes: 1 addition & 1 deletion .python-version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.11
3.12
8 changes: 4 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ RUN apt-get update \
&& add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update

# Install Python 3.11
RUN apt-get install -y python3.11 python3.11-venv python3.11-dev
# Install Python
RUN apt-get install -y python3.12 python3.12-venv python3.12-dev

# Install pip for Python 3.11
# Install pip for Python
RUN apt-get install -y python3-pip

# Upgrade pip and install pipenv
Expand All @@ -28,7 +28,7 @@ RUN pip3 install --upgrade pip \
# Setup python virtual environment
WORKDIR /browsertrix-harvester
COPY Pipfile /browsertrix-harvester/Pipfile
RUN pipenv install --python 3.11
RUN pipenv install --python 3.12

# Copy full browstrix-harvester app
COPY pyproject.toml /browsertrix-harvester/
Expand Down
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ ECR_URL_DEV:=222053980223.dkr.ecr.us-east-1.amazonaws.com/browsertrix-harvester-
SHELL=/bin/bash
DATETIME:=$(shell date -u +%Y%m%dT%H%M%SZ)

help: ## Print this message
@awk 'BEGIN { FS = ":.*##"; print "Usage: make <target>\n\nTargets:" } \
/^[-_[:alpha:]]+:.?*##/ { printf " %-15s%s\n", $$1, $$2 }' $(MAKEFILE_LIST)

### Dependency commands ###
install: # install python dependencies
pipenv install --dev
Expand Down
16 changes: 8 additions & 8 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,30 @@ verify_ssl = true
name = "pypi"

[packages]
bs4 = "*"
click = "*"
sentry-sdk = "*"
warcio = "*"
requests = "*"
pandas = "*"
bs4 = "*"
requests = "*"
sentry-sdk = "*"
smart-open = {version = "*", extras = ["s3"]}
warcio = "*"
yake = "*"

[dev-packages]
black = "*"
coverage = "*"
coveralls = "*"
ipython = "*"
mypy = "*"
pandas-stubs = "*"
pre-commit = "*"
pytest = "*"
ruff = "*"
safety= "*"
pre-commit = "*"
ipython = "*"
types-beautifulsoup4 = "*"
pandas-stubs = "*"

[requires]
python_version = "3.11"
python_version = "3.12"

[scripts]
harvester = "python -c \"from harvester.cli import main; main()\""
Expand Down
1,740 changes: 1,150 additions & 590 deletions Pipfile.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions harvester/cli.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""harvester.cli"""

# ruff: noqa: FBT001

import logging
Expand Down
1 change: 1 addition & 0 deletions harvester/exceptions.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""browsertrix_harvest.exceptions."""

# ruff: noqa: N818


Expand Down
24 changes: 14 additions & 10 deletions harvester/metadata.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
"""harvester.parse"""

# ruff: noqa: N813

import logging
import time
import xml.etree.ElementTree as etree
import xml.etree.ElementTree as ET

import pandas as pd
import smart_open # type: ignore[import]
Expand Down Expand Up @@ -118,15 +119,16 @@ def generate_metadata(self, include_fulltext: bool = False) -> "CrawlMetadataRec

# augment with metadata parsed from the website's HTML content
html_content = wacz_client.get_website_content(
row.filename, row.offset, decode=True
str(row.filename), str(row.offset), decode=True
)
html_metadata = self.get_html_content_metadata(html_content)
metadata.update(html_metadata)

# augment again with data parsed from, and including, HTML fulltext
metadata.update(
self.parse_fulltext_fields(
row.text, include_fulltext=include_fulltext
row.text, # type:ignore[arg-type]
include_fulltext=include_fulltext,
)
)

Expand Down Expand Up @@ -198,9 +200,11 @@ def parse_fulltext_fields(
fulltext = self._remove_fulltext_whitespace(raw_fulltext)
return {
"fulltext": fulltext if include_fulltext else None,
"fulltext_keywords": self._generate_fulltext_keywords(fulltext)
if include_fulltext_keywords
else None,
"fulltext_keywords": (
self._generate_fulltext_keywords(fulltext)
if include_fulltext_keywords
else None
),
}

@property
Expand Down Expand Up @@ -242,15 +246,15 @@ def to_xml(self) -> bytes:
<record>...</record>, ...
</records>
"""
root = etree.Element("records")
root = ET.Element("records")
for _, row in self.metadata_df.iterrows():
item = etree.Element("record")
item = ET.Element("record")
root.append(item)
for col in self.metadata_df.columns:
cell = etree.Element(col)
cell = ET.Element(col)
cell.text = str(row[col])
item.append(cell)
return etree.tostring(root, encoding="utf-8", method="xml")
return ET.tostring(root, encoding="utf-8", method="xml")

def write(self, filepath: str) -> None:
"""Serialize metadata records in various file formats.
Expand Down
1 change: 1 addition & 0 deletions harvester/utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""harvester.utils"""

# ruff: noqa: ANN401

import os
Expand Down
33 changes: 20 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,15 @@ markers = [
]

[tool.ruff]
target-version = "py311"
target-version = "py312"

# set max line length
line-length = 90

# enumerate all fixed violations
show-fixes = true

[tool.ruff.lint]
select = ["ALL", "PT"]

ignore = [
Expand All @@ -35,6 +43,7 @@ ignore = [
"D103",
"D104",
"D415",
"G004",
"PLR0912",
"PLR0913",
"PLR0915",
Expand All @@ -47,23 +56,21 @@ ignore = [
# allow autofix behavior for specified rules
fixable = ["E", "F", "I", "Q"]

# set max line length
line-length = 90

# enumerate all fixed violations
show-fixes = true

[tool.ruff.flake8-annotations]
[tool.ruff.lint.flake8-annotations]
mypy-init-return = true

[tool.ruff.flake8-pytest-style]
[tool.ruff.lint.flake8-pytest-style]
fixture-parentheses = false

[tool.ruff.per-file-ignores]
"tests/**/*" = ["ANN", "ARG001", "S101"]
[tool.ruff.lint.per-file-ignores]
"tests/**/*" = [
"ANN",
"ARG001",
"S101",
]

[tool.ruff.pycodestyle]
[tool.ruff.lint.pycodestyle]
max-doc-length = 90

[tool.ruff.pydocstyle]
[tool.ruff.lint.pydocstyle]
convention = "google"
6 changes: 3 additions & 3 deletions tests/fixtures/lib-website-homepage.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ exclude:
# prevent RESOURCES / ASSETS from getting retrieved; URL requests
blockRules:
- ".*googlevideo.com.*"
- ".*cdn.pw-60-mitlib-wp-network.pantheonsite.io/media/.*"
- ".*cdn.libraries.mit.edu/media/.*"
- "\\.(jpg|png)$"
depth: 1
maxPageLimit: 20
timeout: 30
scopeType: "domain"
seeds:
- url: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
sitemap: https://pw-60-mitlib-wp-network.pantheonsite.io/sitemap.xml
- url: https://www-test.libraries.mit.edu/sitemap.xml
sitemap: https://www-test.libraries.mit.edu/sitemap.xml
1 change: 1 addition & 0 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""tests.test_parser"""

# ruff: noqa: S108

from unittest.mock import call, mock_open, patch
Expand Down
1 change: 1 addition & 0 deletions tests/test_metadata.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""tests.test_metadata"""

# ruff: noqa: SLF001, PD002, PD901

from unittest.mock import mock_open, patch
Expand Down
1 change: 1 addition & 0 deletions tests/test_wacz.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""tests.test_wacz"""

# ruff: noqa: SLF001, PD901

import logging
Expand Down

0 comments on commit d171f9b

Please sign in to comment.