Skip to content

Imports documents from an ndjson-stream to a Sanity dataset

License

Notifications You must be signed in to change notification settings

sanity-io/import

Repository files navigation

@sanity/import

Imports documents from an ndjson-stream to a Sanity dataset

Installing

npm install --save @sanity/import

Usage

const fs = require('fs')
const sanityClient = require('@sanity/client')
const sanityImport = require('@sanity/import')

const client = sanityClient({
  projectId: '<your project id>',
  dataset: '<your target dataset>',
  token: '<token-with-write-perms>',
  useCdn: false,
})

// Input can either be a readable stream (for a `.tar.gz` or `.ndjson` file), a folder location (string), or an array of documents
const input = fs.createReadStream('my-documents.ndjson')

const options = {
  /**
   * A Sanity client instance, preconfigured with the project ID and dataset
   * you want to import data to, and with a token that has write access.
   */
  client: client,

  /**
   * Which mutation type to use for creating documents:
   * `create` (default)  - throws error if document IDs already exists
   * `createOrReplace`   - replaces documents with same IDs
   * `createIfNotExists` - skips document with IDs that already exists
   *
   * Optional.
   */
  operation: 'create',

  /**
   * Function called when making progress. Gets called with an object of
   * the following shape:
   * `step` (string) - the current step name of the import process
   * `current` (number) - the current progress of the step, only present on some steps
   * `total` (number) - total items before complete, only present on some steps
   */
  onProgress: (progress) => {
    /* report progress */
  },

  /**
   * Whether or not to allow assets in different datasets. This is usually
   * an error in the export, where asset documents are part of the export.
   *
   * Optional, defaults to `false`.
   */
  allowAssetsInDifferentDataset: false,

  /**
   * Whether or not to allow failing assets due to download/upload errors.
   *
   * Optional, defaults to `false`.
   */
  allowFailingAssets: false,

  /**
   * Whether or not to replace any existing assets with the same hash.
   * Setting this to `true` will regenerate image metadata on the server,
   * but slows down the import.
   *
   * Optional, defaults to `false`.
   */
  replaceAssets: false,

  /**
   * Whether or not to skip cross-dataset references. This may be required
   * when importing a dataset with cross-dataset references to a different
   * project, unless a dataset with the referenced name exists.
   *
   * Optional, defaults to `false`.
   */
  skipCrossDatasetReferences: false,

  /**
   * Whether or not to import system documents (like permissions and custom retention). This
   * is usually not necessary, and may cause conflicts if the target dataset
   * already contains these documents. On a new dataset, it is recommended that roles are re-created
   * manually, and that any custom retention policies are re-created manually.
   *
   * Optional, defaults to `false`.
   */
  allowSystemDocuments: false,
}

sanityImport(input, options)
  .then(({numDocs, warnings}) => {
    console.log('Imported %d documents', numDocs)
    // Note: There might be warnings! Check `warnings`
  })
  .catch((err) => {
    console.error('Import failed: %s', err.message)
  })

CLI-tool

This functionality is built in to the sanity package as sanity dataset import, but is also usable through the sanity-import CLI tool, part of this package:

$ sanity-import --help

  CLI tool that imports documents from an ndjson file or URL

  Usage
    $ sanity-import -p <projectId> -d <dataset> -t <token> sourceFile.ndjson

  Options
    -p, --project <projectId> Project ID to import to
    -d, --dataset <dataset> Dataset to import to
    -t, --token <token> Token to authenticate with
    --asset-concurrency <concurrency> Number of parallel asset imports
    --replace Replace documents with the same IDs
    --missing Skip documents that already exist
    --allow-failing-assets Skip assets that cannot be fetched/uploaded
    --replace-assets Skip reuse of existing assets
    --skip-cross-dataset-references Skips references to other datasets
    --help Show this help

  Examples
    # Import "./my-dataset.ndjson" into dataset "staging"
    $ sanity-import -p myPrOj -d staging -t someSecretToken my-dataset.ndjson

    # Import into dataset "test" from stdin, read token from env var
    $ cat my-dataset.ndjson | sanity-import -p myPrOj -d test -

  Environment variables (fallbacks for missing flags)
    --token = SANITY_IMPORT_TOKEN

Future improvements

  • When documents are imported, record which IDs are actually touched
    • Only upload assets for documents that are still within that window
    • Only strengthen references for documents that are within that window
    • Only count number of imported documents from within that window
  • Asset uploads and strengthening can be done in parallel, but we need a way to cancel the operations if one of the operations fail
  • Introduce retrying of asset uploads based on hash + indexing delay
  • Validate that dataset exists upon start
  • Reference verification
    • Create a set of all document IDs in import file
    • Create a set of all document IDs in references
    • Create a set of referenced ID that do not exist locally
    • Batch-wise, check if documents with missing IDs exist remotely
    • When all missing IDs have been cross-checked with the remote API (or a max of say 100 items have been found missing), reject with useful error message.

License

MIT-licensed. See LICENSE.