Skip to content

Commit

Permalink
Convert to the new Safir Kafka metrics system
Browse files Browse the repository at this point in the history
Log events via the new Safir Kafka metrics system instead of using
OpenTelemetry. This uses Safir library support to create Avro schemas
for Kafka and export structured Pydantic events to Kafka.
  • Loading branch information
rra committed Oct 17, 2024
1 parent 4ab901d commit 82287f1
Show file tree
Hide file tree
Showing 47 changed files with 1,051 additions and 831 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ repos:
- id: trailing-whitespace

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.9
rev: v0.7.0
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
Expand Down
5 changes: 5 additions & 0 deletions alembic/gafaelfawr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,8 @@ knownScopes:

# A minimal mapping sufficient for Gafaelfawr to start.
groupMapping: {}

# A minimal metrics configuration so that Gafaelfawr will start.
metrics:
enabled: false
appName: "gafaelfawr"
2 changes: 1 addition & 1 deletion changelog.d/20240814_105528_rra_DM_45518.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
### New features

- Add support for exporting metrics to an OpenTelemetry collector. The initial set of metrics is limited to login metrics, token delegation, and counts of active sessions and user tokens.
- Add support for exporting metrics to Kafka using the new event metrics support in [Safir](https://safir.lsst.io/). The initial set of events is limited to login metrics, authentications to services, and counts of active sessions and user tokens.

### Bug fixes

Expand Down
2 changes: 1 addition & 1 deletion docs/_rst_epilog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@
.. _Keycloak: https://www.keycloak.org/
.. _Kopf: https://kopf.readthedocs.io/en/stable/
.. _mypy: https://mypy.readthedocs.io/en/stable/
.. _OpenTelemetry: https://opentelemetry.io/
.. _Phalanx: https://phalanx.lsst.io/
.. _pre-commit: https://pre-commit.com
.. _pytest: https://docs.pytest.org/en/latest/
.. _RFC 2307bis: https://datatracker.ietf.org/doc/html/draft-howard-rfc2307bis-02
.. _Ruff: https://docs.astral.sh/ruff/
.. _Safir: https://safir.lsst.io/
.. _Sasquatch: https://sasquatch.lsst.io/
.. _scriv: https://scriv.readthedocs.io/en/latest/
.. _semver: https://semver.org/
.. _structlog: https://www.structlog.org/en/stable/
Expand Down
6 changes: 3 additions & 3 deletions docs/dev/internals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ Python internal API
.. automodapi:: gafaelfawr.dependencies.return_url
:include-all-objects:

.. automodapi:: gafaelfawr.events
:include-all-objects:

.. automodapi:: gafaelfawr.exceptions
:include-all-objects:

Expand All @@ -38,9 +41,6 @@ Python internal API
.. automodapi:: gafaelfawr.keypair
:include-all-objects:

.. automodapi:: gafaelfawr.metrics
:include-all-objects:

.. automodapi:: gafaelfawr.middleware.state
:include-all-objects:

Expand Down
5 changes: 3 additions & 2 deletions docs/documenteer.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
title = "Gafaelfawr"
copyright = "2020-2022 Association of Universities for Research in Astronomy, Inc. (AURA)"
copyright = "2020-2024 Association of Universities for Research in Astronomy, Inc. (AURA)"

[project.openapi]
openapi_path = "_static/openapi.json"
Expand Down Expand Up @@ -33,6 +33,8 @@ nitpick_ignore = [
# are generated from the type signatures and can't be avoided. These are
# intentionally listed specifically because I've caught documentation bugs
# by having Sphinx complain about a new symbol.
["py:class", "dataclasses_avroschema.pydantic.main.AvroBaseModel"],
["py:class", "dataclasses_avroschema.main.AvroModel"],
["py:class", "fastapi.applications.FastAPI"],
["py:class", "fastapi.datastructures.DefaultPlaceholder"],
["py:class", "fastapi.exceptions.HTTPException"],
Expand Down Expand Up @@ -84,7 +86,6 @@ bonsai = "https://bonsai.readthedocs.io/en/latest"
cryptography = "https://cryptography.io/en/latest"
jwt = "https://pyjwt.readthedocs.io/en/latest"
kopf = "https://kopf.readthedocs.io/en/stable"
opentelemetry = "https://opentelemetry-python.readthedocs.io/en/latest"
python = "https://docs.python.org/3"
redis = "https://redis-py.readthedocs.io/en/stable"
safir = "https://safir.lsst.io"
Expand Down
34 changes: 28 additions & 6 deletions docs/user-guide/helm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -619,17 +619,39 @@ See :ref:`client-ips` for more details.
Metrics
========

Gafaelfawr can export metrics to an OpenTelemetry_ collector.
Currently, it only supports the insecure gRPC mechanism for sending metrics, and therefore should use a collector within the same Kubernetes cluster.

To enable metrics collection and reporting, set the URL of the metrics collector:
Gafaelfawr can export events and metrics to Sasquatch_, the metrics system for Rubin Observatory.
Metrics reporting is disabled by default.
To enable it, set ``config.metrics.metricsEvents.disable`` to false:

.. code-block:: yaml
config:
metricsUrl: "http://telegraf.telegraf:4317"
metrics:
metricsEvents:
disable: false
Gafaelfawr will then use the Kafka user ``gafaelfawr`` to authenticate to Kafka and push various events.
For a list of all of the events Gafaelfawr exports, see :doc:`metrics`.

There are some additional configuration settings, which normally will not need to be changed:

``config.metrics.metricsEvents.appName``
Name of the application under which to log metrics.
Default: ``gafaelfawr``

``config.metrics.metricsEvents.topicPrefix``
The prefix for events topics.
Generally the only reason to change this is if you're experimenting with new events in a development environment.
Default: ``lsst.square.metrics.events``

``config.metrics.schemaManager.registryUrl``
URL to the Confluent-compatible Kafka schema registry, used to register the schemas for events during startup.
Default: Use the Sasquatch schema registry in the local cluster.

For a list of all of the metrics Gafaelfawr exports, see :doc:`metrics`.
``config.metrics.schemaManager.suffix``
Suffix to add to all registered subjects.
This avoids conflicts with existing registered schemas and may be useful when experimenting with possible event schema changes that are not backwards-compatible.
Default: no suffix

.. _slack-alerts:

Expand Down
54 changes: 26 additions & 28 deletions docs/user-guide/metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,51 +2,49 @@
Metrics
#######

Gafaelfawr optionally supports exporting metrics to an OpenTelemetry_ collector.
Gafaelfawr optionally supports exporting events to Sasquatch_.
To enable this support, see :ref:`config-metrics`.

All metrics are logged with a service name of ``gafaelfawr``.
By default, metrics are logged with an application name of ``gafaelfawr`` and a topic prefix of ``lsst.square.metrics.events``.

If metrics collection is enabled, the following metrics will currently be logged.
More metrics will likely be added in the future.
If event exporting is enabled, the following events will currently be logged.
More events will likely be added in the future.

Frontend metrics
================

The following metrics are logged with the meter name of ``frontend``:
The following events are logged by the Gafaelfawr frontend:

login.attempts (counter)
Count of times Gafaelfawr sends a user to the identity provider to authenticate, not including duplicate redirects when the user already has an authentication in progress.
auth
A user was successfully authenticated to a service.
The username is present as the ``username`` tag.
The service name is present as the ``service`` tag, if known.

login_attempt
Gafaelfawr sent a user to the identity provider to authenticate, not including duplicate redirects when the user already has an authentication in progress.
Duplicates are suppressed by not counting redirects if the ``state`` attribute of the user's cookie is already set.

login.enrollment (counter)
Count of the times Gafaelfawr redirects a user to the enrollment flow.
login_enrollment
Gafaelfawr redirected a user to the enrollment flow.

login.failures (counter)
Count of the times a login fails at the Gafaelfawr end, meaning that either something went wrong in Gafaelfawr itself, with the request to the remote authentication service, or via an error reported by the remote authentication service.
login_failure
A login failed at the Gafaelfawr end, meaning that either something went wrong in Gafaelfawr itself, with the request to the remote authentication service, or via an error reported by the remote authentication service.
This does not count cases where the authentication service never returns the user to us.
It also does not count redirects to the enrollment flow.

login.successes (counter)
Count of the times Gafaelfawr successfully authenticates a user and creates a new session.
The username will be attached as the ``username`` attribute.

login.success_time (gauge)
Total elapsed time in floating point seconds from when Gafaelfawr redirected the user for authentication to when the user successfully authenticated.
The username will be attached as the ``username`` attribute.

request.auth (counter)
Count of successful authentication attempts to a service.
Currently, this only counts authentications to a service that requests delegated tokens.
The username is attached as the ``username`` attribute and the service name is attached as the ``service`` attribute.
login_successe
Gafaelfawr successfully authenticated a user and created a new session.
The username is present as the ``username`` tag.
The length of time from initial redirect to successful authentication is present as the ``elapsed`` field, as a float number of seconds.

State metrics
=============

The following metrics are logged by the Gafaelfawr maintenance cron job with the meter name of ``state``.
The following metrics are logged by the Gafaelfawr maintenance cron job.
These are also logged as events, since current Rubin Observatory infrastructure only supports events, but they are actually metrics and will switch to a metrics system once one is available.

sessions.active_users (gauge)
Number of users with unexpired sessions.
active_user_sessions
Number of users with unexpired sessions, sent in the ``count`` field.

user_tokens.active
Number of active (unexpired) user tokens.
active_user_tokens
Number of active (unexpired) user tokens, sent in the ``count`` field.
9 changes: 4 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,16 @@ dependencies = [
"kopf",
"kubernetes-asyncio",
"jinja2",
"opentelemetry-api",
"opentelemetry-exporter-otlp-proto-grpc",
"opentelemetry-sdk",
"pydantic>2",
"pydantic-settings",
"pydantic-settings!=2.6.0",
"pyjwt",
"pyyaml",
"redis>=4.2.0",
"safir[db,kubernetes]>=6.4.0",
"safir[db,kubernetes] @ git+https://github.com/lsst-sqre/safir@tickets/DM-46821#subdirectory=safir",
"sqlalchemy>=2.0.0",
"structlog",
# Temporary constraint to avoid faststream bug.
"anyio!=4.6.2.post1",
]
dynamic = ["version"]

Expand Down
Loading

0 comments on commit 82287f1

Please sign in to comment.