-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clouddriver-ecs: regression with 1.24.0 - broken #6245
Comments
Just to clarify, the api isn't broken but rather affected by the same issue described in: #6084 To provide some context, previously, the application/server group information was being inferred in Clouddriver, but with the newly implemented moniker support (as you can see the api's are different one queries name, while the other does groupName), we're now using the cache to retrieve this information, and it's using the same logic that is described in the above issue. |
@lifeofguenter would you be able to provide some information as to what query you're trying to run/request you're making? What are you trying to do when you see this behaviour? What is the experience that you're having? Are you no longer able to use Spinnaker or are you just seeing a large number of queries? Also, in your hal config, do you have any namingStrategies defined/configured? |
In deck -> application -> clusters -> select server group -> sidebar pops up with server-group information Deck will do the following API request to retrieve information for that sidebar:
we are not using any namingStrategy currently (so it should be defaulting to something?) - Given that the amount of SQL queries are almost exactly the same like in #6084, I don't think this is an issue with the new implementation, rather the new implementation uncovered an old issue. I do however see this in our clouddriver.yml which I am not sure if its correct:
e.g. what is "taskRepository" and why is "cache.enabled" required?
We are unable to use 1.24 unfortunately and had to rollback. With every new version of Spinnaker we are losing new components due to the 4k SQL query bug. And since we have limited understanding of what other API endpoints might be affected as well in 1.24 we decided to not take a chance :) I think someone needs to look into clouddriver-ecs and how it translates lookups into sql queries :) - but I don't think it will be an easy task, nor something that will be fixed anytime soon 😞 |
Thanks for the quick response. Will continue to investigate with the given information. |
From my initial investigation, it looks to be related to a PR to support Monikers for ECS, in which we removed the applicationName (to support monikers): spinnaker/clouddriver@fc08d9a#diff-08ef4737fffbc8ef45383c0%5B%E2%80%A6%5D2d4b4b314665b4f134c0b5a4L527 (customers with a large number of serverGroups, would previously use that value to prevent the EcsServerClusterProvider from checking every serverGroup, if the service's appname didn't match the name of the application provided, it wouldn't check the rest of the logic). The check that would bypass all of the logic if the appname didn't match: spinnaker/clouddriver@fc08d9a#diff-08ef4737fffbc8ef45383c0%5B%E2%80%A6%5D2d4b4b314665b4f134c0b5a4R131 (application is being passed in as null every time so that check is always false, but anyone with a larger number of server groups is timing out now since we're checking every single server group, rather than just the ones that match serverGroupName). |
@lifeofguenter I've made a change (and it's been merged to master and the next patch release) that should hopefully fix this issue for you. A patch release should be released soon, please let us know if this is still an issue once 1.24.1 is released (hopefully within the next few hours) |
thanks @piradeepk - your changes have fixed the issue 🍾 I am now back to a more "sane" number of mysql queries for that endpoint: I must note though, that compared to a EC2 lookup its still concerning high: E.g. the sidebar opens up with a EC2 servergroup quite snappy in less than 50ms - as its only doing 3 sql queries. With ECS it does ~50 sql queries and thus takes almost 1000ms for the same endpoint. We have both equally amount of services on EC2 and ECS. My question: why is ECS so much slower? Why does it require a magnitude of 50-100x of SQL queries for the same information? |
Analogue to #6084 there is now a new endpoint that is affected from the 4k SQL bug:
the same API call on 1.23.5 only does ~100 SQL queries:
Which still seems a bit high, but much more acceptable :) (e.g. downgrading fixed the issue for us)
The text was updated successfully, but these errors were encountered: