Skip to content

ccsd/canvas-data-embulk-configs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

canvas-data-embulk-configs

YAML configs for importing Canvas Data with Embulk

Essentially, these are provided as a starting point for your own workflow, to manage Canvas Data in YAML instead of in code.

Visit Managing Canvas Data with Embulk on the CanvasLMS Community for discussions and workflow ideas.

canvas data v4.2.5

Embulk

Embulk is an open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services. https://www.embulk.org/docs/

with support for

and features useful for Canvas Data

  • Decode gzipped files
  • The ability to intelligently guess the format and data types of CSV files
  • Parallel execution of tasks, multi-threading per CPU Core, and a task for each batch file
  • Input CSV Plugin as default Input for Embulk
  • Filter data with Filter Plugins, https://plugins.embulk.org/#filter
  • Output Data to SQL
    • Insert, Insert Direct, Replace, Merge, Truncate and Truncate Insert
    • Timestamp formatting
    • TimeZone conversion from UTC for date time columns before_load and after_load, config options to run queries before (truncate) and after import (indexes) and more

For more details see the Wiki docs

Config files?

Embulk uses YAML config files for each task, for Canvas Data this means each input source (table files) and it's output destination (db table) is 1 file. This includes differences between staging, test and production destinations. 100 plus config files may seem like an odd workflow at first, but it's a lot less work to manage than generating DDLs with schema.json, and plugins are a lot easier than coding custom sorting and filtering tasks.

Embulk can recreate the whole table each time the config is run. This means editing the config file is your only edit, leaving

  • less importer coding
  • less DDL scripting
  • zero overrides

Maintainers

I will attempt to keep these configs up-to-date, tagged with each schema version so you can use them in your own workflow. However, it's unlikely I'll be able to maintain and test the configs for 4 databases regularly, nor can I see the data affected by each institutions use case of CanvasLMS. You may see scenarios, data, and values others have not. I'm currently using MS SQL Server, and would appreciate anyone using these configs to help maintain as Canvas Data changes.

Contributing

If you use this repository, please consider submitting a Pull Request or Issue for the following:

  • If you experience issues with the CSV parsing parameters
  • If you experience issues with getting data and rows into SQL with errors
  • If you can improve the SQL
  • If you find enumerables not listed
  • If you find column lengths not accurate, or your own data is longer than defined
  • We do not use Canvas Catalog, therefore the configs for these tables are not prepared and tested
  • The Oracle configs are currently only setup for insert_method: normal and not oci. OCI greatly improves the import speed. If you can help support this, please consider testing and documenting. https://github.com/embulk/embulk-output-jdbc/tree/master/embulk-output-oracle#insert-methods
  • Oracle has some compatibility and identifier length issues between versions, and I am currently not able to get the after_load: indexes working with my dev version, they do work directly in SQL editor.