DataHub is made up of a generic backend and a React-based UI. Original DataHub blog post talks about the design extensively and mentions some of the features of DataHub. Our open sourcing blog post also provides a comparison of some features between LinkedIn production DataHub vs open source DataHub. Below is a list of the latest features that are available in DataHub, as well as ones that will soon become available.
- Search: full-text & advanced search, search ranking
- Browse: browsing through a configurable hierarchy
- Schema: table & document schema in tabular and JSON format
- Coarse grain lineage: support for lineage at the dataset level, tabular & graphical visualization of downstreams/upstreams
- Ownership: surfacing owners of a dataset, viewing datasets you own
- Dataset life-cycle management: deprecate/undeprecate, surface removed datasets and tag it with "removed"
- Institutional knowledge: support for adding free form doc to any dataset
- Fine grain lineage: support for lineage at the field level [coming soon]
- Social actions: likes, follows, bookmarks [coming soon]
- Compliance management: field level tag based compliance editing [coming soon]
- Top users: frequent users of a dataset [coming soon]
- Globally defined: Tags provided a standardized set of labels that can be shared across all your entities
- Supports entities and schemas: Tags can be applied at the entity level or for datasets, attached to schema fields.
- Searchable Entities can be searched and filtered by tag
- Search: full-text & advanced search, search ranking
- Browse: browsing through a configurable hierarchy [coming soon]
- Profile editing: LinkedIn style professional profile editing such as summary, skills
- Search: full-text & advanced search, search ranking
- Basic information: ownership, location. Link to external service for viewing the dashboard.
- Institutional knowledge: support for adding free form doc to any dashboards [coming soon]
- Search: full-text & advanced search, search ranking
- Browse: browsing through a configurable hierarchy
- Schema history: view and diff historic versions of schemas
- GraphQL: visualization of GraphQL schemas
- Search: full-text & advanced search, search ranking
- Browse: browsing through a configurable hierarchy
- Basic information:
- Execution history: Executions and their status. Link to external service for viewing full info.
- Search: full-text & advanced search, search ranking
- Browse: browsing through a configurable hierarchy
- Basic information: ownershp, dimensions, formula, input & output datasets, dashboards
- Institutional knowledge: support for adding free form doc to any metric
There's a basic, Java-oriented overview of metadata ingestion.
We also have a Python-based ingestion framework which supports the following sources:
- Hive
- Kafka
- RDBMS (MySQL, Oracle, Postgres, MS SQL, etc)
- Data warehouse (Snowflake, BigQuery, etc)
- LDAP
That ingestion framework is extensible, so you can easily create new sources of metadata. You just need to transform the metadata into our standard MCE format, and the framework will help ingest metadata to DataHub.