You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Jemison will initially deploy driven entirely by configuration files.
The existing design, in JSonnet, organically emerged. However, it looks neither intentional nor maintainable/scalable.
Decision
A table-driven design, that can be edited and verified via tooling, makes more sense. The configuration should be transformed from something people can work with into something that the application can work with. That implies
CSV -> ... -> ... -> JSON
What follows is a proposed design, for discussion, to take into account the needs and challenges we are aware of.
The problem
Jemison needs configuration for multiple reasons.
Services. Jemison is composed of multiple services, each of which may have any number of tunable parameters.
Environments. Jemison runs in multiple environments (local, test (e.g. Circle/CI), dev, staging, and prod). There may be different configuration values for each of those environments.
Domains. Jemison crawls content on domains. We need to be able to identify the domains that we expect Jemison to crawl.
Domain configuration. Any given domain might want have parameters or filters that need to be set. For example, how often do we crawl a given domain? Are there subdomains we should ignore (but there is no sitemap.xml to guide us)? In some respects, the domain configuration is a critical part of the crawler configuration.
At the beginning, we need to be able to edit and maintain these configurations. We want them to be
Robust.
Verifiable/testable.
Computable.
But, we also want them to be human editable.
Possible solutions
GUI. We could run something like Directus in our local stack. We would then have to keep the DB it creates in the repo as a kind of shared state.
JSonnet. Maintain everything in JSonnet. This gives us structured data as well as the ability to write functions to build and test the configuration.
JSON. Maintain everything in raw JSON. Tooling would be built externally.
CSV. Maintain everything in CSV. Tooling would be built externally.
1 provides a path to an editing environment, but is complex to maintain.
2, 3, and 4 are most compatible with Git.
2 provides the richest tooling for manipulation within the configuration language.
3, 4 are fragile.
Proposed solution
We'll start with JSonnet, which is where we are now. It can be easily be compiled to JSON, and can be manipulated directly, via libraries, in all the languages we are likely to work in.
Design
Jemison effectively needs to be driven off a set of databases, which (ideally) change infrequently. We are, therefore, encoding a set of databases as JSonnet.
Flatter data (wide tables) are easier to work with, in some ways, becuase the relationships are clear. However, they typically lead to data duplication.
Hashes/maps/dictionaries are easier to comprehend, but lead to more linking. Those links can be verified, however.
We'll start with dictionaries.
TLDs
Jemison needs to know about TLDs.
tlds.libsonnet:
{
"gov": 1,
"mil": 2,
}
Second-level domains
We need to know about our 2LDs. We will ultimately need to know a lot about our 2LDs.
We'll do one file per 2LD. This will ultimately lead to thousands of source configuration files. However:
The files can be easily combined via JSonnet
Properties can be asserted across all of the files, as well as within files.
We will name files in a RFQDN style (Reverse Fully Qualified Domain Name).
gov.gsa.libsonnet:
{
id: 1,
tld: "gov", // a key into tlds.libsonnet
// We must explicitly encode the root with the magic key `_root`.
subdomains: [
_root: 0,
www: 1,
acquisition: 2,
],
// We will use Go-style camel-case.
indexFrequency: weekly,
// We can imagine adding per-domain
// configuration as needed to this design.
ignorePaths: [
"/something",
"/another_thing"
]
}
Assertions can then be built around these files. For example, we can assert that there is no duplication of
Generated files
We can then use this to functionally generate additional configuration within Jsonnet.
For example, we can now import these files, and generate a list of all of our domains. We can then write assertions that guarantee uniqueness of domain64 values, and that they are all the correct size, etc.
Generated files become templated jsonnet files that are built via Makefile, and ultimately want library support in Jemison for loading/accessing them easily.
Service configs
The current configuration design for the services themselves seems to be working well, and does not (strictly) need a redesign.
additional configurations
Beyond this, there are other files that might need to be introduced.
Admin API access control. A simple file mapping api.data.gov key IDs to email addresses for API access control within the team.
CF/Cloud.gov access control. For controlling who has access to the CF/Cloud.gov spaces for deployment purposes.
There may be others, but these are likely 1) small and 2) obvious, compared to the core drivers of the application as a whole.
Consequences
The consequence of a change is that it is not strictly forward motion. That said, not having a good design to start will likely slow us down over the coming months.
The benefit to static configuration files is that they are easy to validate/code against. So, this change is not likely to be disruptive. Or, any failures will
The text was updated successfully, but these errors were encountered:
Areas of impact
Related documents/links
Context
Jemison will initially deploy driven entirely by configuration files.
The existing design, in JSonnet, organically emerged. However, it looks neither intentional nor maintainable/scalable.
Decision
A table-driven design, that can be edited and verified via tooling, makes more sense. The configuration should be transformed from something people can work with into something that the application can work with. That implies
CSV -> ... -> ... -> JSON
What follows is a proposed design, for discussion, to take into account the needs and challenges we are aware of.
The problem
Jemison needs configuration for multiple reasons.
dev
,staging
, andprod
). There may be different configuration values for each of those environments.sitemap.xml
to guide us)? In some respects, the domain configuration is a critical part of the crawler configuration.At the beginning, we need to be able to edit and maintain these configurations. We want them to be
But, we also want them to be human editable.
Possible solutions
1 provides a path to an editing environment, but is complex to maintain.
2, 3, and 4 are most compatible with Git.
2 provides the richest tooling for manipulation within the configuration language.
3, 4 are fragile.
Proposed solution
We'll start with JSonnet, which is where we are now. It can be easily be compiled to JSON, and can be manipulated directly, via libraries, in all the languages we are likely to work in.
Design
Jemison effectively needs to be driven off a set of databases, which (ideally) change infrequently. We are, therefore, encoding a set of databases as JSonnet.
Flatter data (wide tables) are easier to work with, in some ways, becuase the relationships are clear. However, they typically lead to data duplication.
Hashes/maps/dictionaries are easier to comprehend, but lead to more linking. Those links can be verified, however.
We'll start with dictionaries.
TLDs
Jemison needs to know about TLDs.
tlds.libsonnet:
Second-level domains
We need to know about our 2LDs. We will ultimately need to know a lot about our 2LDs.
We'll do one file per 2LD. This will ultimately lead to thousands of source configuration files. However:
We will name files in a RFQDN style (Reverse Fully Qualified Domain Name).
gov.gsa.libsonnet:
Assertions can then be built around these files. For example, we can assert that there is no duplication of
Generated files
We can then use this to functionally generate additional configuration within Jsonnet.
For example, we can now import these files, and generate a list of all of our domains. We can then write assertions that guarantee uniqueness of
domain64
values, and that they are all the correct size, etc.Generated files become templated
jsonnet
files that are built via Makefile, and ultimately want library support in Jemison for loading/accessing them easily.Service configs
The current configuration design for the services themselves seems to be working well, and does not (strictly) need a redesign.
additional configurations
Beyond this, there are other files that might need to be introduced.
There may be others, but these are likely 1) small and 2) obvious, compared to the core drivers of the application as a whole.
Consequences
The consequence of a change is that it is not strictly forward motion. That said, not having a good design to start will likely slow us down over the coming months.
The benefit to static configuration files is that they are easy to validate/code against. So, this change is not likely to be disruptive. Or, any failures will
The text was updated successfully, but these errors were encountered: