-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Updates, including a "new service howto"
Adding a story about how to add new services. This will help guide the devs as we add new services to Jemison.
- Loading branch information
Showing
6 changed files
with
628 additions
and
167 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,127 @@ | ||
|
||
This database is not (yet) well designed. Currently, there is a notion of a `raw_content` table, which is where `pack` deposits text. | ||
|
||
```sql | ||
CREATE TABLE raw_content ( | ||
id BIGSERIAL PRIMARY KEY, | ||
host_path BIGINT references guestbook(id), | ||
tag TEXT default , | ||
content TEXT | ||
) | ||
``` | ||
|
||
From there, it is unclear how best to structure and optimize the content. | ||
|
||
There are two early-stage ideas. Both have tradeoffs in terms of performance and implementation complexity, and it is not clear yet which to pursue. | ||
|
||
|
||
### one idea: inheritence. | ||
|
||
https://www.postgresql.org/docs/current/tutorial-inheritance.html | ||
|
||
We could define a searchable table as `gov`. | ||
|
||
```sql | ||
create table gov ( | ||
id ..., | ||
host_path ..., | ||
tag ..., | ||
content ... | ||
); | ||
``` | ||
|
||
From there, we could have *empty* inheritence tables. | ||
|
||
```sql | ||
create table gsa () inherits (gov); | ||
create table hhs () inherits (gov); | ||
create table nih () inherits (gov); | ||
``` | ||
|
||
and, from there, the next level down: | ||
|
||
```sql | ||
create table cc () inherits (nih); | ||
create table nccih () inherits (nih); | ||
create table nia () inherits (nih); | ||
``` | ||
|
||
Then, insertions happen at the **leaves**. That is, we only insert at the lowest level of the hierarchy. However, we can then query tables higher up, and get results from the entire tree. | ||
|
||
This does two things: | ||
|
||
1. It lets queries against a given domain happen naturally. If we want to query `nia.nih.gov`, we target that table with our query. | ||
2. If we want to query all of `nih`, then we query the `nih` table. | ||
3. If we want to query everything, we target `gov` (or another tld). | ||
|
||
Given that we are going to treat these tables as build artifacts, we can always regenerate them. And, it is possible to add new tables through a migration easily; we just add a new create table statement. | ||
|
||
(See [this article](https://medium.com/miro-engineering/sql-migrations-in-postgresql-part-1-bc38ec1cbe75) about partioning/inheritence, indexing, and migrations. It's gold.) | ||
|
||
### declarative partitioning | ||
|
||
Another approach is to use `PARTITION`s. | ||
|
||
This would suggest our root table has columns we can use to drive the derivative partitions. | ||
|
||
```sql | ||
create table gov ( | ||
id ..., | ||
domain64 BIGINT, | ||
host_path ..., | ||
tag ..., | ||
content ... | ||
partition by range(domain64) | ||
); | ||
``` | ||
|
||
To encode all of the TLDs, domains, and subdomains we will encounter, we'll use a `domain64` encoding. Why? It maps the entire URL space into a single, 64-bit number (or, `BIGINT`). | ||
|
||
``` | ||
FF:FFFFFF:FFFFFF:FF | ||
``` | ||
|
||
or | ||
|
||
``` | ||
tld:domain:subdomain:subsub | ||
``` | ||
|
||
This is described more in detail in [domain64.md](domain64.md). | ||
|
||
As an example: | ||
|
||
| tld | domain | sub | hex | dec | | ||
|-----|--------|-----|----------------------|-------------------| | ||
| gov | gsa | _ | #x0100000100000000 | 72057598332895232 | | ||
| gov | gsa | tts | #x0100000100000100 | 72057598332895488 | | ||
| gov | gsa | api | #x0100000100000200 | 72057598332895744 | | ||
|
||
GSA is from the range #x0100000001000000 -> #x0100000001FFFFFF, or 72057594054705152 -> 72057594071482367 (a diff of 16777215). Nothing else can be in that range, because we're using the bitstring to partition off ranges of numbers. | ||
|
||
Now, everything becomes bitwise operations on 64-bit integers, which will be fast everywhere... and, our semantics map well to our domain. | ||
|
||
Partitioning to get a table with only GSA entries is | ||
|
||
```sql | ||
CREATE TABLE govgsa PARTITION OF gov | ||
FOR VALUES FROM (72057598332895232) TO (72057602627862527); | ||
``` | ||
|
||
Or, just one subdomain in the space: | ||
|
||
```sql | ||
CREATE TABLE govgsatts PARTITION OF gov | ||
FOR VALUES FROM (72057598332895488) TO (72057598332895743); | ||
``` | ||
|
||
or we can keep the hex representation: | ||
|
||
```sql | ||
CREATE TABLE govgsatts PARTITION OF gov | ||
FOR VALUES FROM (select x'0100000100000100') TO (select x'01000001000001FF'); | ||
``` | ||
|
||
All table operations are on the top-level table (insert, etc.), the indexes and whatnot are inherited automatically, and I can search the TLD, domain, or subdomain without difficulty---because it all becomes a question of what range the `domain64` value is in. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# work db | ||
|
||
The "work" DB is where day-to-day application work takes place. Data supporting the crawling/indexing work, for example, lives in this database. It is separate from the queues (which are high frequency, small data transactions) and search (which is read heavy). | ||
|
||
[//]: # ( O| - Zero or one ) | ||
[//]: # ( || - One and only one ) | ||
[//]: # ( O{ - Zero or many ) | ||
[//]: # ( |{ - One or many ) | ||
|
||
```mermaid | ||
erDiagram | ||
HOSTS { | ||
INT id | ||
BIGINT domain64 UK | ||
} | ||
GUESTBOOK | ||
HOSTS ||--O{ GUESTBOOK: "Has" | ||
``` | ||
|
||
## guestbook | ||
|
||
The guestbook is where we keep track of URLs that have been/want to be searched. These tables live in the `cmd/migrate` app, which handles our migrations on every deploy. [These are dbmate migrations](https://github.com/GSA-TTS/jemison/tree/main/cmd/migrate/work_db/db/migrations). | ||
|
||
```sql | ||
create table guestbook ( | ||
id bigint generated always as identity primary key, | ||
domain64 bigint not null, | ||
last_modified timestamp, | ||
last_fetched timestamp, | ||
next_fetch timestamp not null, | ||
scheme integer not null default 1, | ||
content_type integer not null default 1, | ||
content_length integer not null default 0, | ||
path text not null, | ||
unique (domain64, path) | ||
); | ||
``` | ||
|
||
The dates drive a significant part of the entree/fetch algorithms. | ||
|
||
* `last_modified` is EITHER the timestamp provided by the remote webserver for any given page, OR if not present, we assign this value in `fetch`, setting it to the last fetched timestamp. | ||
* `last_fetched` is the time that the page was fetched. This is updated every time we fetch the page. | ||
* `next_fetch` is a computed value; if a page is intended to be fetched weekly, then `fetch` will set this as the current time plus one week at the time the page is fetched. | ||
|
||
## hosts | ||
|
||
```sql | ||
create table hosts ( | ||
id bigint generated always as identity primary key, | ||
domain64 bigint, | ||
next_fetch timestamp not null, | ||
unique(id), | ||
unique(domain64), | ||
constraint domain64_domain | ||
check (domain64 > 0 and domain64 <= max_bigint()) | ||
) | ||
; | ||
``` | ||
|
||
Like the `guestbook`, this table plays a role in determining whether a given domain should be crawled. If we want to crawl a domain *right now*, we set the `next_fetch` value in this table to yesterday, allowing all crawls of URLs under this domain to be valid. |
Oops, something went wrong.