feat: update exporter #4327

sywhb · 2024-08-26T11:07:01Z

update exported files in following file structure:

exports/{userId}/{date}/{uuid}.zip  
  - metadata_{start_page}_to_{end_page}.json   
  - /content     
    - {slug}.html   
  - /highlights     
    - {slug}.md

add export job and get api

- metadata_{start_page}_to_{end_page}.json - /content - {slug}.html - /highlights - {slug}.md

vercel · 2024-08-26T11:07:03Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
omnivore-demo	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 29, 2024 4:54am
omnivore-prod	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 29, 2024 4:54am

sywhb · 2024-08-26T11:20:01Z

packages/export-handler/src/index.ts

+            if (content) {
+              // Add content files to /content
+              archive.append(content, {
+                name: `content/${slug}.html`,
+              })
+            }


@jacksonh I wonder if I should call the /content API instead to get the readable content so we can avoid long queries in db

Yeah that could help. It would hit the cache I guess?

Yeah, it would

sywhb · 2024-08-26T11:21:32Z

packages/api/src/jobs/export.ts

+
+export const EXPORT_JOB_NAME = 'export'
+
+export const exportJob = async (jobData: ExportJobData) => {


@jacksonh another question is if we should just use async jobs to export the data instead of calling the cloud run service

@sywhb maybe we should try to start with an async job and get a baseline of performance in demo. Both our accounts are quite large there so it would give an idea.

sywhb · 2024-08-26T11:22:58Z

packages/export-handler/src/index.ts

+          }))
+
+          archive.append(JSON.stringify(metadata, null, 2), {
+            name: `metadata_${cursor}_to_${cursor + size}.json`,


@jacksonh Here I actually export paginated metadata into multiple JSON files to avoid having a very large JSON file

packages/api/src/routers/export_router.ts

sywhb · 2024-08-26T11:24:53Z

packages/api/src/utils/createTask.ts

+      jobId: `${EXPORT_JOB_NAME}_${userId}_${JOB_VERSION}`,
+      removeOnComplete: true,
+      removeOnFail: true,


The export jobs are deduplicated but we probably also want to limit number of exports per user per day by checking the number of zip files in cloud storage

Yeah agreed, or even create an entry in postgres, as annoying as that can be

Yeah, makes sense to me

sywhb

@jacksonh Could you help me to clarify some questions in the PR? Thank you!

github-actions · 2024-08-27T09:48:44Z

Squawk Report

🚒 4 violations across 1 file(s)

`packages/db/migrations/0186.do.create_export_table.sql`

-- Type: DO
-- Name: create_export_table
-- Description: Create a table to store the export information

BEGIN;

CREATE TABLE omnivore.export (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
    user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
    state TEXT NOT NULL,
    total_items INT DEFAULT 0,
    processed_items INT DEFAULT 0,
    task_id TEXT,
    signed_url TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX export_user_id_idx ON omnivore.export(user_id);

GRANT SELECT, INSERT, UPDATE, DELETE ON omnivore.export TO omnivore_user;

COMMIT;

🚒 Rule Violations (4)

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-big-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-big-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-bigint-over-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-bigint-over-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

📚 More info on rules

⚡️ Powered by Squawk (1.1.2), a linter for PostgreSQL, focused on migrations

sywhb · 2024-08-27T10:09:43Z

Could you take a look at this when you have time?
No rush 🙏 Thank you @jacksonh

jacksonh

I think this looks good, we definitely need to test with some larger libraries in demo. I'm a little worried it could time out and then just get stuck in a loop restarting from the beginning.

sywhb · 2024-08-29T06:19:04Z

I think this looks good, we definitely need to test with some larger libraries in demo. I'm a little worried it could time out and then just get stuck in a loop restarting from the beginning.

Yeah, I agree. Probably should create a batch of sub tasks so could better recover and rate limit the number of export tasks

sywhb added 2 commits August 26, 2024 17:34

exports/{userId}/{date}/{uuid}.zip

d4aee94

- metadata_{start_page}_to_{end_page}.json - /content - {slug}.html - /highlights - {slug}.md

add export job and get api

a75acc5

sywhb requested review from satindar and jacksonh as code owners August 26, 2024 11:07

vercel bot deployed to Preview – omnivore-demo August 26, 2024 11:12 View deployment

vercel bot deployed to Preview – omnivore-prod August 26, 2024 11:12 View deployment

sywhb commented Aug 26, 2024

View reviewed changes

packages/api/src/routers/export_router.ts Show resolved Hide resolved

sywhb commented Aug 26, 2024

View reviewed changes

use async job to handle exporter

444c78f

vercel bot deployed to Preview – omnivore-prod August 27, 2024 06:25 View deployment

vercel bot deployed to Preview – omnivore-demo August 27, 2024 06:25 View deployment

wait for write stream to finish

48b3f73

vercel bot deployed to Preview – omnivore-demo August 27, 2024 07:02 View deployment

vercel bot deployed to Preview – omnivore-prod August 27, 2024 07:03 View deployment

upload readable content before exporting to cache the content

0e523d8

vercel bot deployed to Preview – omnivore-demo August 27, 2024 07:38 View deployment

vercel bot deployed to Preview – omnivore-prod August 27, 2024 07:38 View deployment

save export tasks in db and check db before starting export

f77ded3

vercel bot deployed to Preview – omnivore-prod August 27, 2024 09:53 View deployment

vercel bot deployed to Preview – omnivore-demo August 27, 2024 09:53 View deployment

fix table permission

8294392

vercel bot deployed to Preview – omnivore-prod August 27, 2024 10:10 View deployment

vercel bot deployed to Preview – omnivore-demo August 27, 2024 10:10 View deployment

jacksonh approved these changes Aug 29, 2024

View reviewed changes

remove comments

3b9dd90

vercel bot deployed to Preview – omnivore-prod August 29, 2024 04:54 View deployment

vercel bot deployed to Preview – omnivore-demo August 29, 2024 04:54 View deployment

sywhb merged commit 32f4b68 into main Aug 29, 2024
7 checks passed

sywhb deleted the feature/exporter branch August 29, 2024 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update exporter #4327

feat: update exporter #4327

sywhb commented Aug 26, 2024 •

edited

Loading

vercel bot commented Aug 26, 2024 •

edited

Loading

sywhb Aug 26, 2024

jacksonh Aug 26, 2024

sywhb Aug 27, 2024

sywhb Aug 26, 2024

jacksonh Aug 26, 2024

sywhb Aug 26, 2024

sywhb Aug 26, 2024

jacksonh Aug 26, 2024

sywhb Aug 27, 2024

sywhb left a comment

github-actions bot commented Aug 27, 2024 •

edited

Loading

sywhb commented Aug 27, 2024

jacksonh left a comment

sywhb commented Aug 29, 2024


		export const EXPORT_JOB_NAME = 'export'

		export const exportJob = async (jobData: ExportJobData) => {

feat: update exporter #4327

feat: update exporter #4327

Conversation

sywhb commented Aug 26, 2024 • edited Loading

vercel bot commented Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sywhb left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 27, 2024 • edited Loading

Squawk Report

🚒 4 violations across 1 file(s)

packages/db/migrations/0186.do.create_export_table.sql

🚒 Rule Violations (4)

sywhb commented Aug 27, 2024

jacksonh left a comment

Choose a reason for hiding this comment

sywhb commented Aug 29, 2024

sywhb commented Aug 26, 2024 •

edited

Loading

vercel bot commented Aug 26, 2024 •

edited

Loading

github-actions bot commented Aug 27, 2024 •

edited

Loading

`packages/db/migrations/0186.do.create_export_table.sql`