Push-model data - How does OPAL handle when an OPA container is restarted or gets out-of-sync with source system #385

phil-lee-kb · 2023-02-13T23:15:37Z

phil-lee-kb
Feb 13, 2023

Hi there,

I'm just looking into OPAL as an option for the push-model scenario where we have relatively large (5Gb) source database that gets updated in real-time, and we're looking at whether to use the pull-model in OPA, or use OPAL to handle the push-model. We would likely use Kafka to mediate between the source db and our cloud services such as OPAL, meaning there would be a Kafka topic containing all events from the db to read from.

Something I'm finding it hard to understand is how OPA instances handle startup and restart scenarios under the push-model, and also where there might be, say, network issues which result in the OPA instance not having the latest data. This also applies if we have OPAL in the picture. If we consider the scenarios one by one:

Startup - am I right in thinking that the OPA instance would need access to an up-to-date static document containing the full source dataset? If so, does that get stored in OPAL itself, or elsewhere? We're looking at Kafka for storing all change events for the db, so can OPAL read from that and rehydrate the OPA instance from that, or does it need a static version of the data somewhere?
Restart - what happens if an OPA container/pod is restarted? Does OPAL re-register it and then perform the same data initialialization process?
Data out-of-sync - let's say the socket from OPAL > OPA instance was dropped for a period, or something else happened which meant one of the OPA instances was out-of-sync. How does OPAL handle this and make sure that the OPA client is kept up-to-date when connectivity is restored?

Happy to be pointed at the relevant sections of the documentation if these are already answered :-)

Thanks,
Phil

orweis · 2023-02-14T00:07:29Z

orweis
Feb 14, 2023
Maintainer

HI @phil-lee-kb - this is quite a bit to unpack here; I'll try my best to cover all the key points.

we have relatively large (5Gb) source database that gets updated in real-time

OPA in bets of cases can handle up to 5GB of data, 2GB is already a struggle in most cases.
Note - almost always you don't need all of your applications data for authorization; but a very small subset of it (usually just Ids, relationships between them, and a few select attributes for ABAC policies).
If you do have a lot of data - you can address it by applying sharding (via topics with OPAL) between multiple OPA instances.

to use the pull-model in OPA, or use OPAL to handle the push-model

To be frank, I think the pull model in OPA is very bad - I wouldn't touch it unless you really have to.
More details here: https://www.permit.io/blog/load-external-data-into-opa

meaning there would be a Kafka topic containing all events from the db to read from

Cool, OPAL can listen in directly to Kafka- https://docs.opal.ac/tutorials/run_opal_with_kafka#triggering-events-directly-from-kafka

how OPA instances handle startup and restart scenarios under the push-model, and also where there might be, say, network issues which result in the OPA instance not having the latest data. This also applies if we have OPAL in the picture. If we consider the scenarios one by one

OPA instances don't handle any of this - that's where OPAL comes in.
OPAL client's retry with (configurable) smart backoffs on all network components- the OPAL server, OPA itself, data-sources, policy sources, etc ...
When data-updates are sent - the OPAL server sends only the event and instructions on how to get the data, instead of the data itself through the it's lightweight pub/sub channel (Read more here)- which means clients know of changes very quickly, and can react to them even before getting the actual data (via a special policy, transactions, callbacks, and health-check routes.).
You can use the health-check policy as part of your own policy.

Startup - am I right in thinking that the OPA instance would need access to an up-to-date static document containing the full source dataset? If so, does that get stored in OPAL itself, or elsewhere? We're looking at Kafka for storing all change events for the db, so can OPAL read from that and rehydrate the OPA instance from that, or does it need a static version of the data somewhere?

As I wrote , OPA just needs a subset of your data- the one relevant for the policy (you can transform the data as you need using custom data fetchers) .
The information is stored within OPA alone (OPAL client loads it for you into it)
You can bring the data from wherever you want as long as it's accessible to the OPAL-clients; you might need to write data-fetchers though often the bulletin HTTP fetcher would cover your needs.

Restart - what happens if an OPA container/pod is restarted? Does OPAL re-register it and then perform the same data initialialization process?

Yes, basically - you can manage these pods / clusters via the health checks opal exposes (just don't forget to turn them on)

Data out-of-sync - let's say the socket from OPAL > OPA instance was dropped for a period, or something else happened which meant one of the OPA instances was out-of-sync. How does OPAL handle this and make sure that the OPA client is kept up-to-date when connectivity is restored?

It assumes that state is lost and loads everything afresh according to OPAL_DATA_CONFIG_SOURCES (Which btw can point to another server to dynamically control the sources for each client and each load.
In the future OPAL might support differential updates.

OPAL has a lot of interesting little elements in it's design, which address most concerns people have when starting to think about the problem space; I suggest going through the tutorials one by one and gradually learning about OPAL.
Also consider watching this video : https://youtu.be/A5adHlkmdC0

Hope this helps,
Or

1 reply

phil-lee-kb Feb 14, 2023
Author

Thanks for such a quick and thorough reply!

OPA in bets of cases can handle up to 5GB of data, 2GB is already a struggle in most cases

Yes, I found the "small/medium/large" datasets on https://www.openpolicyagent.org/docs/latest/external-data/ bit hard to quantify. In some systems, 5Gb is regarded as tiny. But given your experience, would you say that 2Gb qualifies as a "large" dataset in the context of OPA?

To be frank, I think the pull model in OPA is very bad - I wouldn't touch it unless you really have to.

I'm interested in your opinion here, as this is a fascinating topic of push vs pull - probably one of the biggest decisions we have to make for our first OPA use-case. Do you reckon it would be better for us to start with a push-model approach and only fallback to pull-model if we can't make that work?
When looking into existing uses of OPA, I noted that some orgs such as Chime have approached it from the "optimise the pull-model" angle as per https://www.styra.com/resources/videos/https-youtu-be-qhvh7ilygqk/. They built their own ~~sidecar app~~ OPA module which handles the actual data-fetching, which actually looks similar conceptually to what OPAL does - except it does it during policy evaluation rather than out-of-band.

It assumes that state is lost and loads everything afresh according to OPAL_DATA_CONFIG_SOURCES (Which btw can point to another server to dynamically control the sources for each client and each load.
In the future OPAL might support differential updates.

So in that instance, it could load all the data by replaying the topic rather than needing to have a separate readonly copy of the full dataset?

Thanks,
Phil

orweis · 2023-02-14T01:19:09Z

orweis
Feb 14, 2023
Maintainer

But given your experience, would you say that 2Gb qualifies as a "large" dataset in the context of OPA?
I'd say 2GB (GigaBytes to be clear -- not Gigabits) is on average setup quite a bit for OPA (note data structures in OPA tend to inflate in memory), but it all depends on the underlying machine you're running on.

Do you reckon it would be better for us to start with a push-model approach and only fallback to pull-model if we can't make that work?

The pull model is very risky, you couple the stability, latency, performance, and availability of your services to a datasource (e.g. an application data SQL database) in one of your most critical chains. This can make the entire behavior of your application unpredictable - this is extremely apparent when you take into account that most of these data-sources weren't meant for critical performance, but just your run of the mill application data queries. Even something simple like changing schema or indexing on such a source can throw the entire thing out of balance.
I'd rather not speak to specific implementations - as you can definitely get this to work if you engineer for it (i.e. engineer the datasource to be critical chain compatible) - but that would be more work in most cases. And as you hinted you end-up recreating the replicator pattern - which is what OPAL does for you.

So in that instance, it could load all the data by replaying the topic rather than needing to have a separate readonly copy of the full dataset?

In the future yes; now you don't have to have all the data in one static source, but you do need the ability to point the client to multiple sources to fetch in aggregate the data that would result in the up to date picture.
Pro tip: this data source can be another OPA instance that is still up to date.

5 replies

phil-lee-kb Feb 14, 2023
Author

I'd rather not speak to specific implementations - as you can definitely get this to work if you engineer for it (i.e. engineer the datasource to be critical chain compatible) - but that would be more work in most cases. And as you hinted you end-up recreating the replicator pattern - which is what OPAL does for you.

Correct - so far we'd been looking only at pull-model and have a future POC to look at replicating a subset of our on-prem data to a high-performance cloud store close to our OPAs to make this more performant/resilient. Basically doing as you said with replicator... Fairly easy for one data store, but maybe not when we have many? However it sounds like you've already been through this and OPAL is the result. Also I liked the Netflix video where they take a similar approach as you mention :-)

Thanks again - we'll give OPAL a go!

phil-lee-kb Feb 14, 2023
Author

One more question around that pro tip.... ;-) I assume that means the OPAL data fetcher calling the OPA /data endpoints? We'd considered disabling the /data endpoints (at least to any requests from outside the pod) given that the data inside OPA is sensitive and we don't want it leaking out. However, is it possible to secure the OPA /data endpoints so that, for instance, only OPAL could call them in order to retrieve a copy of the current data? Not really an OPAL-specific question but relevant since it would form part of our solution if using OPAL

orweis Feb 14, 2023
Maintainer

This option of using OPA itself as a data source is of course optional.

You can protect OPA end points through OPA's own auth features: https://www.openpolicyagent.org/docs/latest/security/#authentication-and-authorization, or through a reverse proxy you'd place in front

phil-lee-kb Feb 14, 2023
Author

This option of using OPA itself as a data source is of course optional

Optional but awfully convenient as an alternative to setting up another readonly data source :-)

orweis Feb 14, 2023
Maintainer

True , hence the "pro-tip" 😉

phil-lee-kb · 2023-02-14T21:43:37Z

phil-lee-kb
Feb 14, 2023
Author

One more question - whilst I can see the benefits of having OPAL pull directly from a git repo for policy, in our organisation, it's likely we would require a separate release process for policies. I.e. policy changes get made and merged into main, but we want to have them release via a separate pipeline. We've done a POC of that with vanilla-opa using the bundle api, where we deploy policies via a separate pipeline and then OPA picks them up in due course. I was wondering if OPAL supports this. I.e. disable OPALs auto-retrieval of policies from a git repo, and instead leave OPA to retrieve policy from a separately updated bundle api (either something like nginx, or directly from S3 or another http-accessible file store)

1 reply

orweis Feb 15, 2023
Maintainer

Yes. 😊
https://docs.opal.ac/tutorials/track_an_api_bundle_server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Push-model data - How does OPAL handle when an OPA container is restarted or gets out-of-sync with source system #385

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Push-model data - How does OPAL handle when an OPA container is restarted or gets out-of-sync with source system #385

phil-lee-kb Feb 13, 2023

Replies: 3 comments · 7 replies

orweis Feb 14, 2023 Maintainer

phil-lee-kb Feb 14, 2023 Author

orweis Feb 14, 2023 Maintainer

phil-lee-kb Feb 14, 2023 Author

phil-lee-kb Feb 14, 2023 Author

orweis Feb 14, 2023 Maintainer

phil-lee-kb Feb 14, 2023 Author

orweis Feb 14, 2023 Maintainer

phil-lee-kb Feb 14, 2023 Author

orweis Feb 15, 2023 Maintainer

phil-lee-kb
Feb 13, 2023

Replies: 3 comments 7 replies

orweis
Feb 14, 2023
Maintainer

phil-lee-kb Feb 14, 2023
Author

orweis
Feb 14, 2023
Maintainer

phil-lee-kb Feb 14, 2023
Author

phil-lee-kb Feb 14, 2023
Author

orweis Feb 14, 2023
Maintainer

phil-lee-kb Feb 14, 2023
Author

orweis Feb 14, 2023
Maintainer

phil-lee-kb
Feb 14, 2023
Author

orweis Feb 15, 2023
Maintainer