Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and Load Tips #68

Open
ghost opened this issue Jun 21, 2016 · 3 comments
Open

Performance and Load Tips #68

ghost opened this issue Jun 21, 2016 · 3 comments

Comments

@ghost
Copy link

ghost commented Jun 21, 2016

Do you have any tips or ideas concerning the various possible data consumption patterns enabled by RWS?

For example, I am faced with 10 distinct eCRFs (forms) that contain data relevant to me. They are similar, structurally, but not identical. To process the data I need, I could:

  • Loop through the subjects and use a SubjectDatasetRequest for each subject, reading out only the XML nodes that are of interest
  • Loop through the subjects and, for each subject, loop through form-constrained SubjectDatasetRequests
  • Loop through the forms, using FormDataRequest for each form, consuming the data in either XML or CSV

I think any of those patterns could be made to work for me, though maybe some are more elegant than others, but I'd like to have minimal impact on the environment. I'm given to understand the RWS service uses the same computing resources as the regular browser UI, and I cannot take away any responsiveness for the end users. I'm unaware of any benchmarks about how RWS use affects the users though. Still you can imagine that if we have 100 subjects, each of which has, possibly, 10 forms worth of data, you end up with quite a few service calls in a fairly short time.

My guess is that using the FormDataRequest is going to be most efficient, probably using the XML output , given that this pattern appears to use the fewest service calls. There's a lot I don't know about Medidata internals, caching and optimization so I could easily be wrong (e.g. the database indexing does not support Form-oriented lookups as well as subject-oriented lookups so the form lookups end up being much more expensive).

Any advice? I can't necessarily just delay until the dead of night... OR, is this all unnecessary fretting? Should I stop worrying as long as I keep the service calls down to a dull roar?

@isparks
Copy link
Collaborator

isparks commented Jun 22, 2016

Hi John. A bit of background first. Rave is primarily a transactional system with what I would describe as a follow-along reporting subsystem called Clinical Views. The Clinical View system scans for changes in the transactional tables and pulls them into a de-normalized reporting table. The data entry to available for reporting lag is generally low but if there is a lot going on you can get longer delays, up to hours. The services you are talking about all read from these clinical views. A header included with any response tells you the last time the views were updated which can help with the scheduling of further calls. i.e. if you see that a clinical view has not been updated in 4 hours you might choose to repeat a request at intervals until you get more up-to-date data.

Since all the approaches you have suggested read from these clinical views the load on Rave should not be significant, you are making requests from what are essentially materialized views in the database, pre-computed extract tables. These views are organized by form so a subject-level request or a study level request has to do a lot more joins. That said, for a very large study requesting all data for a particular form in a single hit could get you a timeout if it takes more than 1 hour to stream the data. If you think the studies won't become so large that you'll hit this limit then that may be the most efficient, otherwise by subject would likely be safer and you can do by subject/form combination but that may leave you a lot of requests.

There are ways to get just incremental changes from a last date/time but the client has to set up their clinical view configuration in a certain way and that is not guaranteed so I would not rely on it.

Lastly, you could combine the ClinicalAuditRecord dataset with these requests. You'd poll the ClinicalAuditRecords service for data changes (see the AuditEvent sub-project in the extras directory of rwslib) to detect changes in subjects/forms and then on your polling interval you could request those forms/subjects that you know have had changes which could reduce the total number of requests. Bear in mind my caveat about the update timestamps on the clinical views, the audits will be written before the clinical views reflect those changes.

But my advice overall, from knowledge of how the system works rather than really extensive experience of using these various services, is to do by-form if you know data volumes are not hundreds of thousands of records or by subject otherwise.

I know the above doesn't give you a definitive answer but hopefully helps to understand the risks and tradeoffs.

@vagarwal77
Copy link

I am looking to see sample filled eCRFs (forms) for the reference and better understanding. Please suggets from where i can get the same?

@iansparks
Copy link
Collaborator

@vagarwal77 a google image search for "medidata rave eCRF" will provide a lot of example screenshots. I don't think we can help you further than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants