Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send Process and ProcessTemplate objects from DataStage #54

Closed
cmgrote opened this issue Aug 29, 2019 · 9 comments
Closed

Send Process and ProcessTemplate objects from DataStage #54

cmgrote opened this issue Aug 29, 2019 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@cmgrote
Copy link
Member

cmgrote commented Aug 29, 2019

Scenarios:

  1. Process's inputs & outputs are not fixed at design time (ie. virtual assets as an input or output in IGC) should send a ProcessTemplate
  2. Process's inputs & outputs are fixed at design time (ie. no virtual assets as input or output in IGC) should send a Process -- this includes the case where a lineage mapping ("alias" / same-as) has been created for the inputs / outputs after deployment (but still before run-time).
  3. Process's inputs & outputs are resolved at run-time (eg. name of file includes date) should send a Process, with relationship to a Process(Template) above; run-time stats (elapsed time, record counts, etc) would only go to Open Lineage services (not via Egeria)
@cmgrote cmgrote added the enhancement New feature or request label Aug 29, 2019
@cmgrote cmgrote self-assigned this Aug 29, 2019
@cmgrote
Copy link
Member Author

cmgrote commented Oct 4, 2019

On further discussion at offsite in October it seemed more likely that DataStage will always send only Process objects and never a ProcessTemplate...

TBC based on conclusion in odpi/egeria#1576

@popa-raluca
Copy link
Contributor

What about the virtual assets from IGC? Will DataStage only send processes that have the assets resolved?

@cmgrote
Copy link
Member Author

cmgrote commented Oct 4, 2019

This is why I'm trying to get to the bottom of what we agreed for a ProcessTemplate. Virtual assets are simply assets that are not formally resolved in IGC, but they are still fully-described in the sense that they have names, data types, etc.

For example, take a database_column virtual asset. It will have:

  • a name
  • a data type
  • a parent database_table
  • the database_table will have a parent database_schema
  • the database_schema will have a parent database
  • etc

So from the perspective of representing the input or output (via a PortImplementation), I'd suggest they're as fully-descriptive as any other "real" input or output to a Process. They simply won't ever have a SemanticAssignment associated with them, and there won't ever be an event for them that can be picked up by the EventMapper and sent as an entity instance to the rest of the cohort...

However, we could still create these as new SchemaTypes as part of the payload I send along to the Data Engine OMAS, to ensure that the Data Engine OMAS knows about them despite not being synced at the underlying OMRS level (?)

The reality is that they are still likely to be useful from a design lineage perspective, I think, so I wouldn't want to drop them out entirely (and my vague memory of our discussion in Huizen was that ProcessTemplate was something that never went into lineage on its own; it could only get into lineage by being used in a Process).

@popa-raluca
Copy link
Contributor

The issue that we had with the virtual assets was related to Asset Lineage OMAS. When building the graph, it cannot retrieve the whole context for the virtual asset. It can only go up to the SchemaType, and it needs the whole context for all the assets involved, not only for the ones that have a SemanticAssignment. @DimitriosMaimaris please correct me if I'm wrong :)

@cmgrote
Copy link
Member Author

cmgrote commented Oct 4, 2019

If by "the whole context" you mean the column, table, schema, etc I should be able to provide that (but likely simply wasn't in the past) -- assuming our OMAS-level interface would allow all of that to be communicated through it (not sure?)

@DimitriosMaimaris
Copy link

The problem we had was what exactly Raluca said. We need everything by whole context meaning being able to take let's say the Vertical Lineage for the asset up to the Connection level either it has a Glossary Term attached to it or not.

@popa-raluca
Copy link
Contributor

@cmgrote just to confirm, would DataStgage proxy be able to create the ''whole context" - if Data Engine OMAS provides the corresponding endpoint for creating it?

@cmgrote
Copy link
Member Author

cmgrote commented Oct 7, 2019

My intention would indeed be to create:

  • RelationalColumn
  • NestedSchemaAttribute
  • RelationalTable
  • AttributeForSchema
  • RelationalDBSchemaType
  • AssetSchemaType
  • DeployedDatabaseSchema
  • DataContentForDataSet
  • Database

I'd need to check whether I could actually produce something above that (ie. ConnectionToAsset, Connection, ConnectionConnectorType, ConnectorType, ConnectionEndpoint and Endpoint) -- worst case perhaps it makes sense to generate a "placeholder" for those where I always use the same generated placeholder values for virtual assets (?)

@cmgrote
Copy link
Member Author

cmgrote commented Oct 17, 2019

Replacing with #93 as this seems to have moved away from Process and ProcessTemplate to how we can handle virtual assets like any other asset...

@cmgrote cmgrote closed this as completed Oct 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants