Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow API returns empty stream when one of ignored fields is used in attribute filter #9655

Closed
brendan-ward opened this issue Apr 10, 2024 · 4 comments · Fixed by #9664
Closed
Assignees

Comments

@brendan-ward
Copy link

What is the bug?

First observed in pyogrio #388

When using the Arrow API, if an ignored field is used by the attribute filter, an empty stream will be returned. When the ignored field is not used by the attribute field, the correct results are returned.

When not using the Arrow API, the correct results are returned.

Steps to reproduce the issue

Using NaturalEarth lowres shapefile

drv = ogr.GetDriverByName("ESRI Shapefile")
ds = drv.Open(filename, 0)

# attribute filter includes ignored field
lyr = ds.GetLayerByIndex(0)
lyr.SetAttributeFilter(""" "iso_a3" = 'CAN' """)
lyr.SetIgnoredFields(["iso_a3"])
stream = lyr.GetArrowStream()
batch = stream.GetNextRecordBatch()
print(batch is None)
# True

# attribute filter does not include ignored field
lyr2 = ds.GetLayerByIndex(0)
lyr2.SetAttributeFilter(""" "iso_a3" = 'CAN' """)
lyr2.SetIgnoredFields(["name"])
stream = lyr2.GetArrowStream()
batch = stream.GetNextRecordBatch()
print(batch is None)
# False
print(batch.GetLength())
# 1

Versions and provenance

Detected in GDAL 3.8.3 - 3.8.5

Additional context

No response

@rouault
Copy link
Member

rouault commented Apr 15, 2024

When using the Arrow API, if an ignored field is used by the attribute filter, an empty stream will be returned

This is the expected result.

When not using the Arrow API, the correct results are returned.

I don't confirm this with the following snippet that returns nothing when SetIgnoredFields() is called, and returns one record otherwise

from osgeo import gdal, ogr
gdal.UseExceptions()

ds = ogr.Open("autotest/ogr/data/poly.shp")
lyr = ds.GetLayer(0)
lyr.SetAttributeFilter("EAS_ID = 158")
lyr.SetIgnoredFields(["EAS_ID"])
for f in lyr:
    f.DumpReadable()

Running the above snippet on a Parquet file will even trigger an error:

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    lyr.SetIgnoredFields(["EAS_ID"])
  File "/home/even/gdal/gdal/build_cmake/swig/python/osgeo/ogr.py", line 2182, in SetIgnoredFields
    return _ogr.Layer_SetIgnoredFields(self, *args)
RuntimeError: Constraint on field EAS_ID cannot be applied due to it being ignored

In #9664, I'm clarifying the interaction between SetAttributeFilter() and SetIgnoredFields()

@rouault rouault self-assigned this Apr 15, 2024
@brendan-ward
Copy link
Author

Thanks @rouault for investigating and the additional clarification in the docs; that helps explain existing behavior when using the Arrow API. And your test refuting my results was very helpful because it identified a bug in pyogrio causing us to not actually set the ignored fields when we were reading in regular (non-Arrow API) mode.

@theroggy
Copy link
Contributor

theroggy commented Apr 17, 2024

@rouault I suppose it is non-trivial to either support filtering on ignored fields or to always throw an error when there is a risk for wrong data being returned, i.e. when OGRSQL is used?

As it involves wrong data being returned, an error would be even better than documentation I suppose...

@rouault
Copy link
Member

rouault commented Apr 17, 2024

or to always throw an error when there is a risk for wrong data being returned, i.e. when OGRSQL is used?

It's a bit tricky. It cannot be really done by SetAttributeFilter() or SetIgnoredFields() themselves, otherwise that means that there should be a precise ordering of them. So if that was done, that should be done later when starting to evaluate the filter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants