This is a Python package that serves to hold commonly re-implemented utilities such ach IO operations.
From the root directory of the repo, run:
git config --local core.hooksPath .githooks/
This will set this repo up to use the git hooks in the .githooks/
directory. The hook runs ruff format --check
and ruff check
to prevent commits that are not formatted correctly or have errors. The hook intentionally does not alter the files, but informs the user which command to run.
This package is configured to use optional dependencies based on what you are doing with the code.
As a user, you would install the code with only the dependencies needed to run it:
pip install .
To work on the docs:
pip install -e .[docs]
To work on tests:
pip install -e .[tests]
To run the linter and githook:
pip install -e .[lint]
The docs, tests, and linter packages can be installed together with:
pip install -e .[dev]
Some utilities need additional packages that aren't relevant to all projects. To install everything, run:
pip install -e .[all]
or to include datetime utilities:
pip install -e .[datetime]
You are likely including this on another project, in this case you should include the git url when installing. For manual installs:
pip install "dri-utils[all] @ git+https://github.com/NERC-CEH/dri-utils.git"
or if including it in your dependencies
dependencies = [
"another-package",
...
"dri-utils[all] @ git+https://github.com/NERC-CEH/dri-utils.git"
]
The S3Reader
class reads files from an S3 object store using the boto3
client. It simply requires a client to work with to initialize. Provisioning of credentials is left to the user and their s3_client
from driutils.io.aws import S3Reader
import boto3
client = boto3.client("s3")
reader = S3Reader(client)
# Request a file from AWS S3
object_bytes = reader.read(bucket="my-bucket", key="Path/to/file")
The S3Writer
operates in the same way as the reader but supplies a write
method instead of read
. The body
argument is expected by AWS to be a bytes
object, which is left to the user to provide.
from driutils.io.aws import S3Writer
import boto3
client = boto3.client("s3")
body = b"I'm a byte encoded document"
writer = S3Writer(client)
# Submit a file to AWS S3
writer.wite(bucket="my-bucket", key="Path/to/file", body=body)
The S3ReaderWriter
behaves the same as the prior classes but supplies both commands in one class
The DuckDB classes use the duckdb python interface to read files from local documents or S3 object storage - this comes with the capacity to use custom s3 endpoints.
To read a local file:
from driutils.io.duckdb import DuckDBFileReader
reader = DuckDBFileReader()
query = "SELECT * FROM READ_PARQUET('myfile.parquet');"
result = reader.read(query)
# Result will be a <DuckDBPyConnection object>
# Get your desired format such as polars like:
df = result.pl()
# Or pandas
df = result.df()
# Close the connection
reader.close()
Alternatively, use a context manager to automatically close the connection:
...
with DuckDBFileReader() as reader:
df = reader.read(query, params).df()
To read from an S3 storage location there is a more configuration available and there is 3 use cases supported:
- Automatic credential loading from current environment variables
- Automatic credential loading from an assumed role
- Authentication to a custom s3 endpoint, i.e. localstack. This currently assumes that credentials aren't needed (they aren't for now)
The reader is instantiated like this:
from driutils.duckdb import DuckDBS3Reader
# Automatic authentication from your environment
auto_auth_reader = DuckDBS3Reader("auto")
# Automatic authentication from your assumed role
sts_auth_reader = DuckDBS3Reader("sts")
# Custom url for localstack
endpoint = "http://localhost:<port>"
custom_url_reader = DuckDBS3Reader(
"custom_endpoint",
endpoint_url=endpoint,
use_ssl=False
)
# Custom url using https protocol
endpoint = "https://a-real.s3.endpoint"
custom_url_reader = DuckDBS3Reader(
"custom_endpoint",
endpoint_url=endpoint,
use_ssl=True
)
The reader.read()
in the background forwards a DuckDB SQL query and parameters to fill arguments in the query with.
The S3Writer
uploads files to S3 using a pre-existing S3Client
which is left to the user to resource, but is commonly implemented as:
import boto3
from driutils.write import S3Writer
s3_client = boto3.client('s3', endpoint_url="an_optional_url")
content = "Just a lil string"
writer = S3Writer(s3_client)
writer.write(
bucket_name="target-bucket",
key="path/to/upload/destination",
body=content
)
There is a logging module here that defines the base logging format used for all projects, to use it add:
from driutils import logger
logger.setup_logging()
The module driutils.datetime
contains common utilities for working with dates and times in Python. The methods within are currently simple validation methods. Some of the methods require additional packages that are not needed for all projects, so ensure that the package is installed as pip install .[datetime]
or pip install .[all]
The module driutils.utils
contains utility methods that didn't fit anywhere else and includes things such as ensuring that a list is always returned and removing protocols from URLs.