Skip to content

Commit

Permalink
avoid expensive Solr join for public dvObjects in search (experimenta…
Browse files Browse the repository at this point in the history
…l) (#10555)

* avoid expensive Solr join when guest users search (affect IP Groups) #10554

* fix copy/past error, target doc for file, not dataset #10554

* Checking a few experimental changes into the branch:
Jim's soft commit fixes from 10547;
A quick experiment, replacing join on public objects with a boolean publicObject_b:true
for logged-in users as well (with a join added for just for their own personal documents;
groups are ignored for now). #10554

* Step 3, of the performance improvement effort relying on a boolean "publicObject" flag for
published documents - now for logged-in users, AND with support for groups.
Group support experimental, but appears to be working. #10554

* Modified the implementation for the guest user, to support ip groups. #10554

* Removed the few autocommit-related changes previously borrowed from 10547, to keep things separate and clear, for testing etc. #10554

* Reorganized the optimized code in SearchServiceBean; combined the code block
for the guest and authenticated users. #10554

* updated the release note. #10554

* Removed the warning from the ip groups guide about the effect of the new
search optimization feture that was no longer true. #10554

* Updated the section of the guide describing the new Solr optimization
feature flags. #10554

* Updated the performance section of the guide. #10554

* Modified IndexServiceBean to use the new feature flag, that has been separated from the flag that
enables the search-side optimization;
Fixed the groups sub-query for the guest user. #10554

* cosmetic #10554

* doc tweaks #10554

* no-op code cleanup, correct case of publicObject_b #10554

---------

Co-authored-by: Leonid Andreev <leonid@hmdc.harvard.edu>
  • Loading branch information
pdurbin and landreev authored Jun 4, 2024
1 parent 23a4d9b commit 3c55c3f
Show file tree
Hide file tree
Showing 7 changed files with 181 additions and 47 deletions.
5 changes: 5 additions & 0 deletions doc/release-notes/10554-avoid-solr-join-guest.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Two experimental features flag called "add-publicobject-solr-field" and "avoid-expensive-solr-join" have been added to change how Solr documents are indexed for public objects and how Solr queries are constructed to accommodate access to restricted content (drafts, etc.). It is hoped that it will help with performance, especially on large instances and under load.

Before the search feature flag ("avoid-expensive...") can be turned on, the indexing flag must be enabled, and a full reindex performed. Otherwise publicly available objects are NOT going to be shown in search results.

For details see https://dataverse-guide--10555.org.readthedocs.build/en/10555/installation/config.html#feature-flags and #10555.
4 changes: 4 additions & 0 deletions doc/sphinx-guides/source/developers/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,10 @@ Solr

While in the past Solr performance hasn't been much of a concern, in recent years we've noticed performance problems when Harvard Dataverse is under load. Improvements were made in `PR #10050 <https://github.com/IQSS/dataverse/pull/10050>`_, for example.

We are tracking performance problems in `#10469 <https://github.com/IQSS/dataverse/issues/10469>`_.

In a meeting with a Solr expert on 2024-05-10 we were advised to avoid joins as much as possible. (It was acknowledged that many Solr users make use of joins because they have to, like we do, to keep some documents private.) Toward that end we have added two feature flags called ``avoid-expensive-solr-join`` and ``add-publicobject-solr-field`` as explained under :ref:`feature-flags`. It was confirmed experimentally that performing the join on all the public objects (published collections, datasets and files), i.e., the bulk of the content in the search index, was indeed very expensive, especially on a large instance the size of the IQSS prod. archive, especially under indexing load. We confirmed that it was in fact unnecessary and were able to replace it with a boolean field directly in the indexed documents, which is achieved by the two feature flags above. However, as of writing this, this mechanism should still be considered experimental.

Datasets with Large Numbers of Files or Versions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
6 changes: 6 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3268,6 +3268,12 @@ please find all known feature flags below. Any of these flags can be activated u
* - api-session-auth
- Enables API authentication via session cookie (JSESSIONID). **Caution: Enabling this feature flag exposes the installation to CSRF risks!** We expect this feature flag to be temporary (only used by frontend developers, see `#9063 <https://github.com/IQSS/dataverse/issues/9063>`_) and for the feature to be removed in the future.
- ``Off``
* - avoid-expensive-solr-join
- Changes the way Solr queries are constructed for public content (published Collections, Datasets and Files). It removes a very expensive Solr join on all such documents, improving overall performance, especially for large instances under heavy load. Before this feature flag is enabled, the corresponding indexing feature (see next feature flag) must be turned on and a full reindex performed (otherwise public objects are not going to be shown in search results). See :doc:`/admin/solr-search-index`.
- ``Off``
* - add-publicobject-solr-field
- Adds an extra boolean field `PublicObject_b:true` for public content (published Collections, Datasets and Files). Once reindexed with these fields, we can rely on it to remove a very expensive Solr join on all such documents in Solr queries, significantly improving overall performance (by enabling the feature flag above, `avoid-expensive-solr-join`). These two flags are separate so that an instance can reindex their holdings before enabling the optimization in searches, thus avoiding having their public objects temporarily disappear from search results while the reindexing is in progress.
- ``Off``

**Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development </container/dev-usage>`, you can set them in the `docker compose <https://docs.docker.com/compose/environment-variables/set-environment-variables/>`_ file.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
import edu.harvard.iq.dataverse.datavariable.VariableMetadataUtil;
import edu.harvard.iq.dataverse.datavariable.VariableServiceBean;
import edu.harvard.iq.dataverse.harvest.client.HarvestingClient;
import edu.harvard.iq.dataverse.settings.FeatureFlags;
import edu.harvard.iq.dataverse.settings.JvmSettings;
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
import edu.harvard.iq.dataverse.util.FileUtil;
Expand Down Expand Up @@ -214,6 +215,9 @@ public Future<String> indexDataverse(Dataverse dataverse, boolean processPaths)
solrInputDocument.addField(SearchFields.DATAVERSE_CATEGORY, dataverse.getIndexableCategoryName());
if (dataverse.isReleased()) {
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
if (FeatureFlags.ADD_PUBLICOBJECT_SOLR_FIELD.enabled()) {
solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
}
solrInputDocument.addField(SearchFields.RELEASE_OR_CREATE_DATE, dataverse.getPublicationDate());
} else {
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, UNPUBLISHED_STRING);
Expand Down Expand Up @@ -878,6 +882,9 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long

if (state.equals(indexableDataset.getDatasetState().PUBLISHED)) {
solrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
if (FeatureFlags.ADD_PUBLICOBJECT_SOLR_FIELD.enabled()) {
solrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
}
// solrInputDocument.addField(SearchFields.RELEASE_OR_CREATE_DATE,
// dataset.getPublicationDate());
} else if (state.equals(indexableDataset.getDatasetState().WORKING_COPY)) {
Expand Down Expand Up @@ -1391,6 +1398,9 @@ public SolrInputDocuments toSolrDocs(IndexableDataset indexableDataset, Set<Long
if (indexableDataset.getDatasetState().equals(indexableDataset.getDatasetState().PUBLISHED)) {
fileSolrDocId = solrDocIdentifierFile + fileEntityId;
datafileSolrInputDocument.addField(SearchFields.PUBLICATION_STATUS, PUBLISHED_STRING);
if (FeatureFlags.ADD_PUBLICOBJECT_SOLR_FIELD.enabled()) {
datafileSolrInputDocument.addField(SearchFields.PUBLIC_OBJECT, true);
}
// datafileSolrInputDocument.addField(SearchFields.PERMS, publicGroupString);
addDatasetReleaseDateToSolrDoc(datafileSolrInputDocument, dataset);
// has this published file been deleted from the current draft version?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,15 @@ public class SearchFields {
public static final String DEFINITION_POINT_DVOBJECT_ID = "definitionPointDvObjectId";
public static final String DISCOVERABLE_BY = "discoverableBy";

/**
* publicObject_b is an experimental field tied to the
* avoid-expensive-solr-join feature flag. Rather than discoverableBy which
* is a field on permission documents, publicObject_b is a field on content
* documents (dvObjects). By indexing publicObject_b=true, we can let guests
* search on it, avoiding an expensive join for those (common) users.
*/
public static final String PUBLIC_OBJECT = "publicObject_b";

/**
* i.e. "Unpublished", "Draft" (multivalued)
*/
Expand Down
172 changes: 125 additions & 47 deletions src/main/java/edu/harvard/iq/dataverse/search/SearchServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import edu.harvard.iq.dataverse.authorization.users.PrivateUrlUser;
import edu.harvard.iq.dataverse.authorization.users.User;
import edu.harvard.iq.dataverse.engine.command.DataverseRequest;
import edu.harvard.iq.dataverse.settings.FeatureFlags;
import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.util.SystemConfig;
import java.io.IOException;
Expand Down Expand Up @@ -1001,14 +1002,132 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
user = GuestUser.get();
}

AuthenticatedUser au = null;
Set<Group> groups;

if (user instanceof GuestUser) {
// Yes, GuestUser may be part of one or more groups; such as IP Groups.
groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
} else {
if (!(user instanceof AuthenticatedUser)) {
logger.severe("Should never reach here. A User must be an AuthenticatedUser or a Guest");
throw new IllegalStateException("A User must be an AuthenticatedUser or a Guest");
}

au = (AuthenticatedUser) user;

// ----------------------------------------------------
// (3) Is this a Super User?
// If so, they can see everything
// ----------------------------------------------------
if (au.isSuperuser()) {
// Somewhat dangerous because this user (a superuser) will be able
// to see everything in Solr with no regard to permissions. But it's
// been this way since Dataverse 4.0. So relax. :)

return dangerZoneNoSolrJoin;
}

// ----------------------------------------------------
// (4) User is logged in AND onlyDatatRelatedToMe == true
// Yes, give back everything -> the settings will be in
// the filterqueries given to search
// ----------------------------------------------------
if (onlyDatatRelatedToMe == true) {
if (systemConfig.myDataDoesNotUsePermissionDocs()) {
logger.fine("old 4.2 behavior: MyData is not using Solr permission docs");
return dangerZoneNoSolrJoin;
} else {
// fall-through
logger.fine("new post-4.2 behavior: MyData is using Solr permission docs");
}
}

// ----------------------------------------------------
// (5) Work with Authenticated User who is not a Superuser
// ----------------------------------------------------

groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
}

if (FeatureFlags.AVOID_EXPENSIVE_SOLR_JOIN.enabled()) {
/**
* Instead of doing a super expensive join, we will rely on the
* new boolean field PublicObject:true for public objects. This field
* is indexed on the content document itself, rather than a permission
* document. An additional join will be added only for any extra,
* more restricted groups that the user may be part of.
* **Note the experimental nature of this optimization**.
*/
StringBuilder sb = new StringBuilder();
StringBuilder sbgroups = new StringBuilder();

// All users, guests and authenticated, should see all the
// documents marked as publicObject_b:true, at least:
sb.append(SearchFields.PUBLIC_OBJECT + ":" + true);

// One or more groups *may* also be available for this user. Once again,
// do note that Guest users may be part of some groups, such as
// IP groups.

int groupCounter = 0;

// An AuthenticatedUser should also be able to see all the content
// on which they have direct permissions:
if (au != null) {
groupCounter++;
sbgroups.append(IndexServiceBean.getGroupPerUserPrefix() + au.getId());
}

// In addition to the user referenced directly, we will also
// add joins on all the non-public groups that may exist for the
// user:
for (Group group : groups) {
String groupAlias = group.getAlias();
if (groupAlias != null && !groupAlias.isEmpty() && !groupAlias.startsWith("builtIn")) {
groupCounter++;
if (groupCounter > 1) {
sbgroups.append(" OR ");
}
sbgroups.append(IndexServiceBean.getGroupPrefix() + groupAlias);
}
}

if (groupCounter > 1) {
// If there is more than one group, the parentheses must be added:
sbgroups.insert(0, "(");
sbgroups.append(")");
}

if (groupCounter > 0) {
// If there are any groups for this user, an extra join must be
// added to the query, and the extra sub-query must be added to
// the combined Solr query:
sb.append(" OR {!join from=" + SearchFields.DEFINITION_POINT + " to=id v=$q1}");
// Add the subquery to the combined Solr query:
solrQuery.setParam("q1", SearchFields.DISCOVERABLE_BY + ":" + sbgroups.toString());
logger.info("The sub-query q1 set to " + SearchFields.DISCOVERABLE_BY + ":" + sbgroups.toString());
}

String ret = sb.toString();
logger.info("Returning experimental query: " + ret);
return ret;
}

// END OF EXPERIMENTAL OPTIMIZATION

// Old, un-optimized way of handling permissions.
// Largely left intact, minus the lookups that have already been performed
// above.

// ----------------------------------------------------
// (1) Is this a GuestUser?
// Yes, see if GuestUser is part of any groups such as IP Groups.
// ----------------------------------------------------
if (user instanceof GuestUser) {
String groupsFromProviders = "";
Set<Group> groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));

StringBuilder sb = new StringBuilder();

String groupsFromProviders = "";
for (Group group : groups) {
logger.fine("found group " + group.getIdentifier() + " with alias " + group.getAlias());
String groupAlias = group.getAlias();
Expand All @@ -1025,51 +1144,11 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
return guestWithGroups;
}

// ----------------------------------------------------
// (2) Retrieve Authenticated User
// ----------------------------------------------------
if (!(user instanceof AuthenticatedUser)) {
logger.severe("Should never reach here. A User must be an AuthenticatedUser or a Guest");
throw new IllegalStateException("A User must be an AuthenticatedUser or a Guest");
}

AuthenticatedUser au = (AuthenticatedUser) user;

// if (addFacets) {
// // Logged in user, has publication status facet
// //
// solrQuery.addFacetField(SearchFields.PUBLICATION_STATUS);
// }

// ----------------------------------------------------
// (3) Is this a Super User?
// Yes, give back everything
// ----------------------------------------------------
if (au.isSuperuser()) {
// Somewhat dangerous because this user (a superuser) will be able
// to see everything in Solr with no regard to permissions. But it's
// been this way since Dataverse 4.0. So relax. :)

return dangerZoneNoSolrJoin;
}

// ----------------------------------------------------
// (4) User is logged in AND onlyDatatRelatedToMe == true
// Yes, give back everything -> the settings will be in
// the filterqueries given to search
// ----------------------------------------------------
if (onlyDatatRelatedToMe == true) {
if (systemConfig.myDataDoesNotUsePermissionDocs()) {
logger.fine("old 4.2 behavior: MyData is not using Solr permission docs");
return dangerZoneNoSolrJoin;
} else {
logger.fine("new post-4.2 behavior: MyData is using Solr permission docs");
}
}

// ----------------------------------------------------
// (5) Work with Authenticated User who is not a Superuser
// ----------------------------------------------------
// ----------------------------------------------------
// It was already confirmed, that if the user is not GuestUser, we
// have an AuthenticatedUser au which is not null.
/**
* @todo all this code needs cleanup and clarification.
*/
Expand Down Expand Up @@ -1100,7 +1179,6 @@ private String getPermissionFilterQuery(DataverseRequest dataverseRequest, SolrQ
* a given "content document" (dataset version, etc) in Solr.
*/
String groupsFromProviders = "";
Set<Group> groups = groupService.collectAncestors(groupService.groupsFor(dataverseRequest));
StringBuilder sb = new StringBuilder();
for (Group group : groups) {
logger.fine("found group " + group.getIdentifier() + " with alias " + group.getAlias());
Expand Down
22 changes: 22 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,28 @@ public enum FeatureFlags {
* @since Dataverse @TODO:
*/
API_BEARER_AUTH("api-bearer-auth"),
/**
* For published (public) objects, don't use a join when searching Solr.
* Experimental! Requires a reindex with the following feature flag enabled,
* in order to add the boolean publicObject_b:true field to all the public
* Solr documents.
*
* @apiNote Raise flag by setting
* "dataverse.feature.avoid-expensive-solr-join"
* @since Dataverse 6.3
*/
AVOID_EXPENSIVE_SOLR_JOIN("avoid-expensive-solr-join"),
/**
* With this flag enabled, the boolean field publicObject_b:true will be
* added to all the indexed Solr documents for publicly-available collections,
* datasets and files. This flag makes it possible to rely on it in searches,
* instead of the very expensive join (the feature flag above).
*
* @apiNote Raise flag by setting
* "dataverse.feature.add-publicobject-solr-field"
* @since Dataverse 6.3
*/
ADD_PUBLICOBJECT_SOLR_FIELD("add-publicobject-solr-field"),
;

final String flag;
Expand Down

0 comments on commit 3c55c3f

Please sign in to comment.