Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Stage proxy governance rule creation #432

Open
AlexCostoiu opened this issue Nov 13, 2020 · 1 comment
Open

Data Stage proxy governance rule creation #432

AlexCostoiu opened this issue Nov 13, 2020 · 1 comment

Comments

@AlexCostoiu
Copy link

Hello,

Opening this issue regarding the Information Governance Rules created for synchronization in IGC source environment. As per their definitions these rules are meant to govern other assets and to show that those specific assets comply with certain business requirements.

So this glossary item type is not used as it was meant to be by the proxy. Also because of this the Glossary Author role is needed and also glossary items are created in the source catalog(which probably shouldn't happen as they are not relevant for their metadata). Another point is the risk that certain business users might also delete this rule and the proxy will lose it's last synchronization time.

Considering all the above is there a plan to record and store the timestamp of the last proxy run for each project on the proxy side or on the target(Egeria) side? Can this be changed?

Best Regards,
Alex

@AlexCostoiu AlexCostoiu changed the title IGC proxy governance rule creation Data Stage proxy governance rule creation Nov 13, 2020
@cmgrote
Copy link
Member

cmgrote commented Nov 13, 2020

Hi @AlexCostoiu,

I'm not sure I necessarily agree with your assessment:

As per their definitions these rules are meant to govern other assets and to show that those specific assets comply with certain business requirements. So this glossary item type is not used as it was meant to be by the proxy.

Metadata about data processing is an asset itself (to compliance teams and potentially other business areas).

  1. It is important to share this information about data processing, to inform end-to-end lineage including systems and other assets that are beyond the boundary of this particular system (IGC / DataStage)
  2. It is important to understand the breadth of information / systems / assets covered
  3. It is important to understand how "fresh" or up-to-date this information is

These governance rules therefore capture that (1) the information is shared, (2) its scope (the projects included), and (3) when it was last shared. In your definition, the rule therefore shows that this specific asset (data processing information) complies with certain business requirements (1, 2, and 3). You might even suggest that there is a governance policy under which these rules be placed defining something about the need for providing "end-to-end data lineage".

But of course you can disagree with this assessment. To your question about changing the behaviour: in the most simple terms, the connector code here is provided under an open source license, so you are free to fork it and do whatever you like with it 😉

In regards to changing its current design here in this repository, we'd need to determine what that revised design should be, and then there are various avenues we could pursue to achieve it (pull requests, etc).

Regarding your suggestions above: we have not considered adding such recording in the proxy itself, as the proxy itself is basically stateless (the state is all managed through the repository it proxies -- IGC itself). Adding a separate layer of state management would fundamentally change the operating characteristics of the proxy (requiring additional components and persistent storage to be available at all times the proxy itself is running). I believe this would actually change the behaviour of proxies in Egeria in general, so would be a fairly fundamental change not just to this connector but to Egeria as well.

I'm also not sure it makes sense to put this state management on the "target" side, as the proxy's responsibility is for reading from the source and broadcasting outwards, not how that broadcast is ultimately distributed to one (or potentially many) targets. With many potential targets, such an approach would quickly enter the territory of distributed systems challenges like quorums, split brain handling, etc.

I would suggest that both approaches would therefore add significant complexity beyond the current implementation.

So perhaps it is better to first consider general options on approaches:

  • Is there any alternative you would suggest regarding persisting the state in the proxied repository itself (eg. not using governance rules, but some other type or storage mechanism of IGC)?
  • Is there some alternative state management mechanism you would suggest, which would not carry with it additional operational constraints and complexities (persistent storage, components with guaranteed operational parity of the proxy itself (always online at exactly the same time), etc) outside the proxied repository itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants