Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADR] domain configuration (re)design #90

Open
2 of 9 tasks
jadudm opened this issue Jan 14, 2025 · 0 comments
Open
2 of 9 tasks

[ADR] domain configuration (re)design #90

jadudm opened this issue Jan 14, 2025 · 0 comments
Labels
documentation Improvements or additions to documentation

Comments

@jadudm
Copy link
Contributor

jadudm commented Jan 14, 2025

Areas of impact

  • Compliance
  • Content
  • CX
  • Design
  • Engineering
  • Policy
  • Product
  • Process
  • UX

Related documents/links

Context

Jemison will initially deploy driven entirely by configuration files.

The existing design, in JSonnet, organically emerged. However, it looks neither intentional nor maintainable/scalable.

Decision

A table-driven design, that can be edited and verified via tooling, makes more sense. The configuration should be transformed from something people can work with into something that the application can work with. That implies

CSV -> ... -> ... -> JSON

What follows is a proposed design, for discussion, to take into account the needs and challenges we are aware of.

The problem

Jemison needs configuration for multiple reasons.

  1. Services. Jemison is composed of multiple services, each of which may have any number of tunable parameters.
  2. Environments. Jemison runs in multiple environments (local, test (e.g. Circle/CI), dev, staging, and prod). There may be different configuration values for each of those environments.
  3. Domains. Jemison crawls content on domains. We need to be able to identify the domains that we expect Jemison to crawl.
  4. Domain configuration. Any given domain might want have parameters or filters that need to be set. For example, how often do we crawl a given domain? Are there subdomains we should ignore (but there is no sitemap.xml to guide us)? In some respects, the domain configuration is a critical part of the crawler configuration.

At the beginning, we need to be able to edit and maintain these configurations. We want them to be

  1. Robust.
  2. Verifiable/testable.
  3. Computable.

But, we also want them to be human editable.

Possible solutions

  1. GUI. We could run something like Directus in our local stack. We would then have to keep the DB it creates in the repo as a kind of shared state.
  2. JSonnet. Maintain everything in JSonnet. This gives us structured data as well as the ability to write functions to build and test the configuration.
  3. JSON. Maintain everything in raw JSON. Tooling would be built externally.
  4. CSV. Maintain everything in CSV. Tooling would be built externally.

1 provides a path to an editing environment, but is complex to maintain.

2, 3, and 4 are most compatible with Git.

2 provides the richest tooling for manipulation within the configuration language.

3, 4 are fragile.

Proposed solution

We'll start with JSonnet, which is where we are now. It can be easily be compiled to JSON, and can be manipulated directly, via libraries, in all the languages we are likely to work in.

Design

Jemison effectively needs to be driven off a set of databases, which (ideally) change infrequently. We are, therefore, encoding a set of databases as JSonnet.

Flatter data (wide tables) are easier to work with, in some ways, becuase the relationships are clear. However, they typically lead to data duplication.

Hashes/maps/dictionaries are easier to comprehend, but lead to more linking. Those links can be verified, however.

We'll start with dictionaries.

TLDs

Jemison needs to know about TLDs.

tlds.libsonnet:

{
  "gov": 1,
  "mil": 2,
}

Second-level domains

We need to know about our 2LDs. We will ultimately need to know a lot about our 2LDs.

We'll do one file per 2LD. This will ultimately lead to thousands of source configuration files. However:

  1. The files can be easily combined via JSonnet
  2. Properties can be asserted across all of the files, as well as within files.

We will name files in a RFQDN style (Reverse Fully Qualified Domain Name).

gov.gsa.libsonnet:

{
  id: 1,

  tld: "gov", // a key into tlds.libsonnet
  
  // We must explicitly encode the root with the magic key `_root`. 
  subdomains: [
    _root: 0,
    www: 1,
    acquisition: 2,
  ],

  // We will use Go-style camel-case.
  indexFrequency: weekly,

  // We can imagine adding per-domain 
  // configuration as needed to this design.
  ignorePaths: [
    "/something",
    "/another_thing"
  ]
}

Assertions can then be built around these files. For example, we can assert that there is no duplication of

Generated files

We can then use this to functionally generate additional configuration within Jsonnet.

For example, we can now import these files, and generate a list of all of our domains. We can then write assertions that guarantee uniqueness of domain64 values, and that they are all the correct size, etc.

Generated files become templated jsonnet files that are built via Makefile, and ultimately want library support in Jemison for loading/accessing them easily.

Service configs

The current configuration design for the services themselves seems to be working well, and does not (strictly) need a redesign.

additional configurations

Beyond this, there are other files that might need to be introduced.

  • Admin API access control. A simple file mapping api.data.gov key IDs to email addresses for API access control within the team.
  • CF/Cloud.gov access control. For controlling who has access to the CF/Cloud.gov spaces for deployment purposes.

There may be others, but these are likely 1) small and 2) obvious, compared to the core drivers of the application as a whole.

Consequences

The consequence of a change is that it is not strictly forward motion. That said, not having a good design to start will likely slow us down over the coming months.

The benefit to static configuration files is that they are easy to validate/code against. So, this change is not likely to be disruptive. Or, any failures will

@jadudm jadudm added the documentation Improvements or additions to documentation label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant