-
Notifications
You must be signed in to change notification settings - Fork 1
Vitro Data Ingest Options
Note these are not in any order of preference
The Vitro SparqlUpdateApi
code is accessible via a tomcat servlet that takes a request with an update param. The request string is literally the update=INSERT DATA {}
wrapper syntax for the inserted triples. The SparqlUpdateApiController
class takes the value of update as a string and performs the update using the Jena Sparql api.
Sample code:
vitro/api/src/main/java/edu/cornel /mannlib/vitro/webapp/controller/api/SparqlUpdateApiController.java
This would involve modifying or extending the Vitro code that controls how data gets loaded into Vitro via the webapp, or using them as guides. The data is loaded from dataset files.
https://jena.apache.org/documentation/tdb/java_api.html https://jena.apache.org/documentation/tdb/tdb_transactions.html
Sample code:
https://jena.apache.org/documentation/query/update.html https://jena.apache.org/documentation/query/cmds.html
Sample code:
Sample code:
Anything put in the abox/firsttime
or tbox/firsttime
directories will get loaded into the store the first time (hence the name) - i.e. when those models are empty or on startup. The obvious downfall here is that it would require an application restart every time we want the data loaded.
Use Jena's command line tool to load files directly and efficiently into a TDB-back model. Tdbloader2 can only be used to create a database.
https://jena.apache.org/documentation/tdb/commands.html
The directory of the TDB database is one of these(?)
/usr/local/vivo/home> ls -ld tdb*
drwxr-x--- 31 jgreben admin 992 May 8 15:51 tdbContentModels
drwxr-x--- 31 jgreben admin 992 May 7 22:28 tdbModels
Fuseki is a SPARQL server. It provides REST-style SPARQL HTTP Update, SPARQL Query, and SPARQL Update using the SPARQL protocol over HTTP. This would be similar to loading data via the Vitro sparql api without using Vitro...
https://jena.apache.org/documentation/serving_data/
- Globally stop or pause the indexer when bulkloading. This appears in the
/usr/share/tomcat/logs/vitro.all.log
as records are being loaded one at a time:
...
2018-05-15 16:37:21,950 INFO [IndexHistory] PAUSE, 5/15/18 4:24 PM, []
2018-05-15 16:37:21,964 INFO [IndexHistory] UNPAUSE, 5/15/18 4:24 PM, []
2018-05-15 16:37:22,053 INFO [IndexHistory] PAUSE, 5/15/18 4:24 PM, []
2018-05-15 16:37:22,067 INFO [IndexHistory] UNPAUSE, 5/15/18 4:24 PM, []
...
From Huda at Cornell:
We should also keep in mind that at Cornell, when they do an update they do have to stop things for some hours, and there's also search index rebuilding time to take into account.
Re: accessing Jena directly:
The Vitro java code ^^ is accessible via a tomcat servlet that takes a request with an update param. The request string is literally the update=INSERT DATA {} wrapper syntax for the inserted triples. The SparqlUpdateApiController.java class takes the value of update as a string and sends that to a Jena library class to perform the update.
Basically I think that hacking the java code to load data will only buy us the avoidance of passing a potentially large string over http to the tomcat servlet and it's associated class. The largeness of the string is not really a major issue because tomcat can be configured to handle large post requests with the maxPostSize connector attribute and additional tomcat tuning.
Otherwise, we could probably write a java wrapper class to call the SparqlUpdateApiController with the same update string as an argument, but I'm not so sure that seeking to avoiding the servlet layer will give us much of a serious performance boost. It would probably be better to just tune tomcat to be performant in case we run into issues.
From Jim Blake at Cornell:
Configuring Vitro to use Jena TDB is easy (applicationSetup.n3)
There are good reasons to use Jena TDB instead of Jena SDB. TDB might well be the best choice. Don’t know.
For standard VIVO installation, Jena SDB is still the default.
I don’t know of anyone who is using Vitro or VIVO in production with a large triple-store based on Jena TDB – the community will not be able to provide input with regard to performance or reliability.
If you use methods 1., 2., or 5., you are working with Vitro. This means that your changes will be noted by listeners that do (a) simple inferencing, and (b) search index update.
If you use methods 3., 4., 6., or 7., you are working around Vitro. This means that you must insure correct inferencing and index updating on your own. Some VIVO sites operate this way. Most do not.
For high availability, one can imagine building an updated triple-store and an updated search index, and then swapping them into a live Vitro instance. This is easy to do with the search index, since every query is a stateless HTTP connection. For Jena TDB, you would need to add code that would close all open connections, flushing the memory buffers, and then open new connections. Vitro currently has no such code.
This should not be a problem. There are internal API calls that wil pause and unpause the indexer.
Is the same true for the reasoner? (inferencing)
- RIALTO Wiki Homepage
- RIALTO Use Cases
- RIALTO Architecture
- RIALTO Data Models
- RIALTO Acceptance Criteria
- RIALTO Data Sources
- Demo Videos
- Neptune/λ Integration
- Core/Combine Integration
- SPARQL Proxy λ
- Derivatives λ
- Entity Resolver Service
- Rebuild Trigger Task
- Solr Setup
- Ingest Service
- Combine Data Sources
- Data Mappings
- Load Procedure
- Starting & Monitoring ETL
- Counting # of Publications
- Jena/TDB vs Blazegraph
- Vitro Ingest Options
- VIVO/Vitro Assessment
- VIVO Community Convo Notes
- Vitro vs Stand-Alone Datastore
- Provisioning a VM
- Deployment Process
- Toggle inferencing
- Check Inferencing is On
- Recompute inferences
- Toggle indexing
- Working with Vitro Solr
- Vitro Solr Samples
- Ingest via Fuseki SPARQL-over-HTTP
- Ingest via Jena ARQ
- Ingest via Jena tdbloader
- Ingest via Vitro SPARQL-over-HTTP
- Ingest via TDB Java API
- Vitro Logging
- Detecting TDB Changes