Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: update exporter #4327

Merged
merged 8 commits into from
Aug 29, 2024
Merged

feat: update exporter #4327

merged 8 commits into from
Aug 29, 2024

Conversation

sywhb
Copy link
Contributor

@sywhb sywhb commented Aug 26, 2024

  • update exported files in following file structure:
exports/{userId}/{date}/{uuid}.zip  
  - metadata_{start_page}_to_{end_page}.json   
  - /content     
    - {slug}.html   
  - /highlights     
    - {slug}.md
  • add export job and get api

  - metadata_{start_page}_to_{end_page}.json
  - /content
    - {slug}.html
  - /highlights
    - {slug}.md
Copy link

vercel bot commented Aug 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
omnivore-demo ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 29, 2024 4:54am
omnivore-prod ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 29, 2024 4:54am

Comment on lines +197 to +202
if (content) {
// Add content files to /content
archive.append(content, {
name: `content/${slug}.html`,
})
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacksonh I wonder if I should call the /content API instead to get the readable content so we can avoid long queries in db

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that could help. It would hit the cache I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would


export const EXPORT_JOB_NAME = 'export'

export const exportJob = async (jobData: ExportJobData) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacksonh another question is if we should just use async jobs to export the data instead of calling the cloud run service

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sywhb maybe we should try to start with an async job and get a baseline of performance in demo. Both our accounts are quite large there so it would give an idea.

}))

archive.append(JSON.stringify(metadata, null, 2), {
name: `metadata_${cursor}_to_${cursor + size}.json`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacksonh Here I actually export paginated metadata into multiple JSON files to avoid having a very large JSON file

Comment on lines +1086 to +1088
jobId: `${EXPORT_JOB_NAME}_${userId}_${JOB_VERSION}`,
removeOnComplete: true,
removeOnFail: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The export jobs are deduplicated but we probably also want to limit number of exports per user per day by checking the number of zip files in cloud storage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, or even create an entry in postgres, as annoying as that can be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense to me

Copy link
Contributor Author

@sywhb sywhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacksonh Could you help me to clarify some questions in the PR? Thank you!

Copy link

github-actions bot commented Aug 27, 2024

Squawk Report

🚒 4 violations across 1 file(s)


packages/db/migrations/0186.do.create_export_table.sql

-- Type: DO
-- Name: create_export_table
-- Description: Create a table to store the export information

BEGIN;

CREATE TABLE omnivore.export (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
    user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
    state TEXT NOT NULL,
    total_items INT DEFAULT 0,
    processed_items INT DEFAULT 0,
    task_id TEXT,
    signed_url TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX export_user_id_idx ON omnivore.export(user_id);

GRANT SELECT, INSERT, UPDATE, DELETE ON omnivore.export TO omnivore_user;

COMMIT;

🚒 Rule Violations (4)

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-big-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-big-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-bigint-over-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

packages/db/migrations/0186.do.create_export_table.sql:6:2: warning: prefer-bigint-over-int

   6 | CREATE TABLE omnivore.export (
   7 |     id UUID PRIMARY KEY DEFAULT uuid_generate_v1mc(),
   8 |     user_id UUID NOT NULL REFERENCES omnivore.user(id) ON DELETE CASCADE,
   9 |     state TEXT NOT NULL,
  10 |     total_items INT DEFAULT 0,
  11 |     processed_items INT DEFAULT 0,
  12 |     task_id TEXT,
  13 |     signed_url TEXT,
  14 |     created_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  15 |     updated_at TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP
  16 | );

  note: Hitting the max 32 bit integer is possible and may break your application.
  help: Use 64bit integer values instead to prevent hitting this limit.

📚 More info on rules

⚡️ Powered by Squawk (1.1.2), a linter for PostgreSQL, focused on migrations

@sywhb
Copy link
Contributor Author

sywhb commented Aug 27, 2024

Could you take a look at this when you have time?
No rush 🙏 Thank you @jacksonh

Copy link
Contributor

@jacksonh jacksonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, we definitely need to test with some larger libraries in demo. I'm a little worried it could time out and then just get stuck in a loop restarting from the beginning.

@sywhb
Copy link
Contributor Author

sywhb commented Aug 29, 2024

I think this looks good, we definitely need to test with some larger libraries in demo. I'm a little worried it could time out and then just get stuck in a loop restarting from the beginning.

Yeah, I agree. Probably should create a batch of sub tasks so could better recover and rate limit the number of export tasks

@sywhb sywhb merged commit 32f4b68 into main Aug 29, 2024
7 checks passed
@sywhb sywhb deleted the feature/exporter branch August 29, 2024 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants