Skip to content
Friedrich Lindenberg edited this page Jun 6, 2015 · 1 revision

The following documentations aims to outline the way in which SpenDB stores and queries data. Note that this documentation is aimed at developers who want to modify core functions of the platform. For anyone who wants to simply load data or use publicly accessible APIs, see the help under Web API.

Offline analytics in SpenDB

SpenDB provides a multi-tenant OLAP-style data store, a concept sometimes referred to as a data mart. The system aims to allow the addition of further datasets at run-time and via the web, while keeping loaded data immutable. For each dataset, a reference is managed within the core data model, and a specific fact table is generated to keep the actual data.

While OpenSpending used a star schema representation with multiple tables generated per dataset, SpenDB does not attempt to normalise the data as it is loaded.

Modeling/mapping schema

SpenDB keeps an extensive set of metadata for each Dataset. The metadata is used to query the physical model and to provide general information about the dataset.

The metadata is stored in a specified object structure, which is often represented as JSON. The basic layout is this:

"dataset": {
  ... basic dataset attributes ...
},
"model": {
  "dimensions": {
    ... dimension descriptions ...
  },
  "measures": {
    ... measure descriptions ...
  }
}

Each of these sections is documented below.

Dataset core metadata

The core Dataset attributes are very generic and easily explained:

"dataset": {
  "name": "machine-name",
  "label": "Nicer, human-friendly Title",
  "currency": "EUR",
  "description": "This can be Markdown-formatted"
}

The name of the dataset will be part of each URL that refers to it, so it makes sense to choose a concise term without any special characters, such as spaces, symbols or text with accents or umlauts.

currency is expected to be a valid, three-letter currency code, e.g. EUR or USD. All measures are by default assumed to be specified in this currency, unless otherwise noted.

Dimension and measures model

The second section of the model, model, serves a duplicate function: it is both used to define how the data should be modelled in SpenDB and how values for each attribute can be located within the generated fact table. Future versions of SpenDB may break this up, defining both a model and mapping.

For dimensions and measures, a simple model format is available:

"model": {
  "measures": {
    "amount": {
      "label": "Amount paid",
      "description": "...",
      "column": "amt",
      "type": "float"
    }
  },
  "dimensions": {
    "time": {
      "label": "Time of transaction",
      "description": "...",
      "attributes": {
        "year": {
          "type": "integer",
          "column": "year_paid"
        }
      }
    },
    "transaction": {
      "type": "value",
      "label": "Transaction ID",
      "description": "12-digit identifier for each entry.",
      "attributes": {
        "label": {
          "type": "string",
          "column": "tx_id",
          "key": true
        }
      }
    }
  }
}

The model above defines three concepts, one measure and two dimensions. The meaning of type, label and description are somewhat self-explanatory. column is used to define the source column, based on a normalised form of the source data file column name.

A valid input CSV file for the model defined above might look like this:

tx_id,year_paid,amt
D38DEF-ZZ,2008,5044.0
AAA372-39,2011,43.5
(missing),2009,2854922.0

Physical model

When loading a :py:class:~.Dataset, SpenDB will generate a single table (and columns) to represent the data. A table called <dataset_name>__fact will be generated for each dataset with an id column. The id is auto-generated for each entry.

Attribute name conventions

SpenDB also gives special importance to a set of other attributes of compound dimensions so that it makes sense to define as many of them as possible:

  • name must be a unique, identifying key for each member of the dimension.
  • label is assumed to be a human-readable identifier that will be used as a title and heading for the dimension member pahe and references to the member in general.
  • uri contains a unified URI for the entity mentioned in this dimension, e.g. an OpenCorporates URI or a reference to a classificiation scheme.
  • color will be used when the dimension member is included in visualizations. If no color is set, it will be selected from a pre-defined palette.
  • parent is reserved for future use.