Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return only unique record when using includes #8

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

gracz21
Copy link

@gracz21 gracz21 commented May 30, 2023

The problem

In some cases when a query is using includes and the included tables are used in the where clauses, the returned results seems to be duplicated and in the end, it returns too little records compared to set limit. Example from specs:

User.includes(:organizations).where(organizations: { id: Organization.pluck(:id) }).limit(2).order(created_at: :asc)
[#<User id: 1, login: "mikeissocool", created_at: "2023-05-30 10:26:12", updated_at: "2023-05-30 10:26:12">, #<User id: 2, login: "iheanyi", created_at: "2023-05-30 10:26:12", updated_at: "2023-05-30 10:26:12">]

User.includes(:organizations).where(organizations: { id: Organization.pluck(:id) }).limit(2).order(created_at: :asc).fast_page
[#<User id: 1, login: "mikeissocool", created_at: "2023-05-30 10:26:12", updated_at: "2023-05-30 10:26:12">]

It's happening because at some point (when pluck is called) a SQL JOIN is executed and if the JOIN operation multiplies the results, the ids contains duplicated record IDs. This way, when the ids.first(limit_value) is called, it takes first limit_value values from a duplicated set.

The solution

I propose to call uniq after taking the ids and this way, operating on a unique set of record IDs

@mscoutermarsh
Copy link
Member

Thank you! Missed the notification on this PR, will look into soon.

@mscoutermarsh
Copy link
Member

When looking into this, did you notice this?

ActiveRecord::Base.logger = Logger.new(STDOUT)
og = User.includes(:organizations).where(organizations: { id: Organization.pluck(:id) }).limit(2).order(created_at: :asc)

[5] pry(#<FastPageTest>)> og.reload
D, [2023-07-04T12:04:49.742304 #1600] DEBUG -- :   SQL (0.5ms)  SELECT DISTINCT "users"."id" FROM "users" LEFT OUTER JOIN "user_organizations" ON "user_organizations"."user_id" = "users"."id" LEFT OUTER JOIN "organizations" ON "organizations"."id" = "user_organizations"."organization_id" WHERE "organizations"."id" IN (?, ?) ORDER BY "users"."created_at" ASC LIMIT ?  [["id", 1], ["id", 2], ["LIMIT", 2]]
D, [2023-07-04T12:04:49.743943 #1600] DEBUG -- :   SQL (0.2ms)  SELECT "users"."id" AS t0_r0, "users"."login" AS t0_r1, "users"."created_at" AS t0_r2, "users"."updated_at" AS t0_r3, "organizations"."id" AS t1_r0, "organizations"."name" AS t1_r1, "organizations"."created_at" AS t1_r2, "organizations"."updated_at" AS t1_r3 FROM "users" LEFT OUTER JOIN "user_organizations" ON "user_organizations"."user_id" = "users"."id" LEFT OUTER JOIN "organizations" ON "organizations"."id" = "user_organizations"."organization_id" WHERE "organizations"."id" IN (?, ?) AND "users"."id" IN (?, ?) ORDER BY "users"."created_at" ASC  [["id", 1], ["id", 2], ["id", 1], ["id", 2]]

Surprisingly, It seems that ActiveRecord is already using the deferred join technique on this query automatically. It grabs the distinct IDs and then passes them into the 2nd query. In this case, I don't think fast_page would benefit since it results in grabbing the unique IDs twice.

@gracz21
Copy link
Author

gracz21 commented Jul 4, 2023

@mscoutermarsh I had a similar case during my work with this gem (when there was a usage of includes + the included relation was used in the query) and with fast_page the results were incomplete (instead of returning 10 records as was set in the limit, there were 2 or 3 records).

Went more deep down into this gem implementation and found out the issue described above. Also, the spec I provide did not pass without the change proposed in this MR. Dunno, how it is possible you got a DISTINCT applied out of the box.

@mscoutermarsh
Copy link
Member

Guessing the query in your app in slightly different.

Could you turn the query logger on and try the query in your app without fast_page?

See if it runs the distinct query for you already?

I agree this is a bug - I'm thinking the solution might be a bit different though. Since as is (even with your fix), fast_page is adding an extra query that is unnecessary. Making it slower.

@gracz21
Copy link
Author

gracz21 commented Jul 4, 2023

Will try to provide more details this week

@mscoutermarsh
Copy link
Member

Thanks!

I'm curious if the fix here is detecting when ActiveRecord behaves this way and "disabling" fast_page from running the extra query.

I'll play around with it some more as well.

@gracz21
Copy link
Author

gracz21 commented Jul 4, 2023

Don't think so, the fix here is simply applying a uniq on a PORO array

@mscoutermarsh
Copy link
Member

I'll try to explain again, I wasn't clear. 😄

This example without fast_page.

ActiveRecord::Base.logger = Logger.new(STDOUT)
og = User.includes(:organizations).where(organizations: { id: Organization.pluck(:id) }).limit(2).order(created_at: :asc)

[5] pry(#<FastPageTest>)> og.reload
D, [2023-07-04T12:04:49.742304 #1600] DEBUG -- :   SQL (0.5ms)  SELECT DISTINCT "users"."id" FROM "users" LEFT OUTER JOIN "user_organizations" ON "user_organizations"."user_id" = "users"."id" LEFT OUTER JOIN "organizations" ON "organizations"."id" = "user_organizations"."organization_id" WHERE "organizations"."id" IN (?, ?) ORDER BY "users"."created_at" ASC LIMIT ?  [["id", 1], ["id", 2], ["LIMIT", 2]]
D, [2023-07-04T12:04:49.743943 #1600] DEBUG -- :   SQL (0.2ms)  SELECT "users"."id" AS t0_r0, "users"."login" AS t0_r1, "users"."created_at" AS t0_r2, "users"."updated_at" AS t0_r3, "organizations"."id" AS t1_r0, "organizations"."name" AS t1_r1, "organizations"."created_at" AS t1_r2, "organizations"."updated_at" AS t1_r3 FROM "users" LEFT OUTER JOIN "user_organizations" ON "user_organizations"."user_id" = "users"."id" LEFT OUTER JOIN "organizations" ON "organizations"."id" = "user_organizations"."organization_id" WHERE "organizations"."id" IN (?, ?) AND "users"."id" IN (?, ?) ORDER BY "users"."created_at" ASC  [["id", 1], ["id", 2], ["id", 1], ["id", 2]]

Notice how it does the DISTINCT query and then uses those results in the 2nd query. This is what fast_page normally does (the deferred join technique) to speed up pagination queries.

But for this specific query, it seems that ActiveRecord is already doing it. Which makes fast_page unnecessary for this query. If anything, fast_page is making the query performance worse by running a similar ID select again.

Yes, the wrong results are a bug. But the fix might be having fast_page recognize it can't optimize anything further & calling the original query instead of running the deferred join again.

@gracz21
Copy link
Author

gracz21 commented Jul 4, 2023

IMO it may not be ActiveRecord feature rather the DB/DB adapter thing. What DB are you using?

@mscoutermarsh
Copy link
Member

Ah could be. This was directly from the gem's test suite. Which is sqlite.

@jeremy
Copy link

jeremy commented Feb 29, 2024

That's Active Record: see ActiveRecord::FinderMethods#apply_join_dependency and ActiveRecord::ConnectionAdapters::SchemaStatements#distinct_relation_for_primary_key:

      def apply_join_dependency(eager_loading: group_values.empty?)
        join_dependency = construct_join_dependency(
          eager_load_values | includes_values, Arel::Nodes::OuterJoin
        )
        relation = except(:includes, :eager_load, :preload).joins!(join_dependency)
    
        if eager_loading && has_limit_or_offset? && !(
            using_limitable_reflections?(join_dependency.reflections) &&
            using_limitable_reflections?(
              construct_join_dependency(
                select_association_list(joins_values).concat(
                  select_association_list(left_outer_joins_values)
                ), nil
              ).reflections
            )
          )
          relation = skip_query_cache_if_necessary do
            klass.connection.distinct_relation_for_primary_key(relation)
          end
        end
    
        if block_given?
          yield relation, join_dependency
        else
          relation
        end
      end

and

      def distinct_relation_for_primary_key(relation) # :nodoc:
        primary_key_columns = Array(relation.primary_key).map do |column|
          visitor.compile(relation.table[column])
        end
        
        values = columns_for_distinct(
          primary_key_columns,
          relation.order_values
        )
      
        limited = relation.reselect(values).distinct!
        limited_ids = select_rows(limited.arel, "SQL").map do |results|
          results.last(Array(relation.primary_key).length) # ignores order values for MySQL and PostgreSQL
        end

        if limited_ids.empty?
          relation.none!
        else
          relation.where!(**Array(relation.primary_key).zip(limited_ids.transpose).to_h)
        end
        
        relation.limit_value = relation.offset_value = nil
        relation
      end

The references to eager-loaded has_many associations along with a limit/offset triggers this path since the outer joins needed for eager loading would break the logical limit/offset.

It would be interesting to be able to explicitly trigger this behavior, since it'd be pretty helpful for pagination in general!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants