Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findEntitiesByProperty returns incorrect results #462

Open
pmarius22 opened this issue Dec 29, 2020 · 3 comments
Open

findEntitiesByProperty returns incorrect results #462

pmarius22 opened this issue Dec 29, 2020 · 3 comments
Labels
bug Something isn't working

Comments

@pmarius22
Copy link
Contributor

pmarius22 commented Dec 29, 2020

While executing initial load, we noticed that not all the GlossaryTerms are retrieved from IGC to be processed. We are trying to retrieve 3000+ terms and the maxPageSize is set to 1000 results which means the results are retrieved in pages.
The search endpoint from IGC does not have the expected behaviour. The first search page is sorted by _id however when the following pages are retrieved using the next page url provided in the IGC response body, the results are no longer sorted and this causes terms to not be included in the response and others to be duplicated.
I have modified the connector code on my local environment to not use any sorting property for the first page and this results in retrieving all the GlossaryTerms.
The next page URL contains the following query parameter: sorts : SearchSort{property='_id', ascending=true} which seems to do nothing. I have manually tried the queries using instead sorts : [{"property": "_id","ascending": true}] and this sorts the results correctly.

IGC info:
Version ‎‪11.7.1SP2‬
Build ‎‪11.7.1SP2-dev_b164_cc72b37fee5a‬

If more info is needed, please let me know

@pmarius22 pmarius22 added the bug Something isn't working label Dec 29, 2020
@cmgrote
Copy link
Member

cmgrote commented Jan 25, 2021

I've not been able to reproduce this with the latest code of the connector (which I don't believe has changed for quite some time)... Can you please confirm:

  • Which version (build) of the connector you're using that gives this incorrect SearchSort serialization?
  • The debug-level output of the query that's constructed that contains this incorrect serialization (feel free to redact any environment-specific information like hostname with XXXX before posting here)

For example, when I run a query against all terms (POST to {{baseURL}}/servers/{{igc_server}}/open-metadata/repository-services/users/{{user}}/instances/entities/by-property) with the following body:

{
	"class": "EntityPropertyFindRequest",
	"typeGUID": "0db3e6ec-f5ef-4d75-ae38-b7ee6fd6ec0a",
	"pageSize": 1
}

I see the query is properly serialized, including the sort order:

2021-01-25 13:50:31.855 DEBUG 91113 --- [nio-9443-exec-2] o.o.e.c.i.i.clientlibrary.IGCRestClient  : POSTing to https://XXXX/ibm/iis/igc-rest/v1/search with:

(single-line JSON formatted for readability):

{
  "types": [
    "term"
  ],
  "properties": [
    "abbreviation",
    "additional_abbreviation",
    "assigned_to_terms",
    "created_by",
    "created_on",
    "example",
    "glossary_type",
    "is_modifier",
    "language",
    "long_description",
    "modified_by",
    "modified_on",
    "name",
    "native_id",
    "referencing_categories",
    "short_description",
    "status",
    "type",
    "usage",
    "workflow_current_state",
    "workflow_stored_state"
  ],
  "pageSize": 1,
  "where": {
    "conditions": [
      {
        "conditions": [
          {
            "property": "category_path.name",
            "operator": "<>",
            "value": "Classifications"
          }
        ],
        "operator": "and"
      }
    ],
    "operator": "and"
  },
  "sorts": [
    {
      "property": "_id",
      "ascending": true
    }
  ]
}

@pmarius22
Copy link
Contributor Author

pmarius22 commented Feb 24, 2021

Hi Chris, apologies for the late response.

The first request that is made is the same that you posted and that one is fine, it works and the results are sorted. However the response from the above post looks something like this:

{
    "paging": {
        "numTotal": 3000,
        "next": "igc_url/ibm/iis/igc-rest/v1/search?types=term&properties=abbreviation&properties=additional_abbreviation&properties=assigned_to_terms&properties=created_by&properties=created_on&properties=example&properties=glossary_type&properties=is_modifier&properties=language&properties=long_description&properties=modified_by&properties=modified_on&properties=name&properties=native_id&properties=referencing_categories&properties=short_description&properties=status&properties=type&properties=usage&properties=workflow_current_state&properties=workflow_stored_state&sorts=SearchSort{property='_id', ascending=true}&where={"conditions":[{"conditions":[{"property":"category_path.name","operator":"<>","value":"Classifications"}],"operator":"and"},{"property":"modified_on","operator":">","value":"0"}],"operator":"and"}&pageSize=1000&begin=1000",
        "pageSize": 1000,
        "end": 999,
        "begin": 0
    },
    "items": []
}

As you can see from the paging information, this page contains the first 1000 results out of 3000. From what I could get from the code, the next page of results is retrieved using a GET on the "next" url from the above response which contains the wrong serialisation for the query param sorts &sorts=SearchSort{property='_id', ascending=true}
Executing this second GET for the next page retrieves the results unordered which in the end results in some terms missing from the final response.

Edit: This was tested with the latest version of the connector code.

Hope this helps, let me know if I can help with anything else. Thanks!

@cmgrote
Copy link
Member

cmgrote commented Mar 4, 2021

So it makes sense now, and I've investigated further, and I think I can now reproduce -- the fundamental problem is that you get the same results for multiple pages, right?

To your point above, it's a bit confusing because of url-encoding...

  • There is some code in the IGCRestClient class that reconstructs this next page URL to ensure that we avoid any potential security exposure by stripping off the URL at the front (in case that could be hacked via DNS, etc)
  • As part of this, the URL must be de-structured and then re-constructed. Anytime this is done, we need to be very careful about how the URL is parsed -- in particular, ensuring that its various components remain encoded at all times (or if we are adding anything ourselves that we pre-encode it).

What you've outlined above looks indeed like a problematic URL, as I can see that everything after the sorts parameter is not encoded: characters like [, {, etc all need to be replaced with %xx.

However, looking at the code of the IGCRestClient itself, I can see that all information is encoded at all times.

I've also gone ahead and extended the tests (ClientTest) to include an explicit sort as part of the multi-page test that runs during the build process. I can see from explicit logging that I've added temporarily on my side that this retains the encoding all the way through:

Line 1230ish in IGCRestClient is where the re-structuring of the URL is done, and I've inserted logging lines just before and after the re-structuring. Before the restructuring I have:

org.odpi.egeria.connectors.ibm.igc.clientlibrary.IGCRestClient - Received nextURL: https://infosvr:9446/ibm/iis/igc-rest/v1/search?types=term&properties=created_by&properties=created_on&properties=modified_by&properties=modified_on&where=%7B%22conditions%22%3A%5B%7B%22property%22%3A%22name%22,%22operator%22%3A%22like+%25%7B0%7D%25%22,%22value%22%3A%22address%22%7D%5D,%22operator%22%3A%22and%22%7D&pageSize=2&begin=2

(Note everything is %xx-encoded, as-received directly from IGC.)

After the restructuring, I then have:

org.odpi.egeria.connectors.ibm.igc.clientlibrary.IGCRestClient - Passing on restructured nextURL: ibm/iis/igc-rest/v1/search?types=term&properties=created_by&properties=created_on&properties=modified_by&properties=modified_on&where=%7B%22conditions%22%3A%5B%7B%22property%22%3A%22name%22,%22operator%22%3A%22like+%25%7B0%7D%25%22,%22value%22%3A%22address%22%7D%5D,%22operator%22%3A%22and%22%7D&pageSize=2&begin=2

(Note that the initial https://infosvr:9446 string has been removed, but I still have the remainder %xx-encoded.)

So at least the URL itself appears to be correctly sent along each time.

With this, I can see that the sort criteria is passed along into the second page of results; however I did observe that the sorting is for some reason dropped from that second page of results onwards -- and that this happens in IGC itself.

Could you try to reproduce the behaviour you are observing using only the IGC REST API directly? I believe there is a bug in IGC itself where this sort criteria is not being passed along across all pages, and this would help confirm it for your particular scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants