Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clouddriver-ecs: regression with 1.24.0 - broken #6245

Closed
lifeofguenter opened this issue Dec 21, 2020 · 7 comments · Fixed by spinnaker/clouddriver#5168
Closed

clouddriver-ecs: regression with 1.24.0 - broken #6245

lifeofguenter opened this issue Dec 21, 2020 · 7 comments · Fixed by spinnaker/clouddriver#5168

Comments

@lifeofguenter
Copy link

lifeofguenter commented Dec 21, 2020

Analogue to #6084 there is now a new endpoint that is affected from the 4k SQL bug:

Screen Shot 2020-12-21 at 11 36 27

the same API call on 1.23.5 only does ~100 SQL queries:

Screen Shot 2020-12-21 at 12 08 04

Which still seems a bit high, but much more acceptable :) (e.g. downgrading fixed the issue for us)

@piradeepk
Copy link

piradeepk commented Dec 21, 2020

Just to clarify, the api isn't broken but rather affected by the same issue described in: #6084

To provide some context, previously, the application/server group information was being inferred in Clouddriver, but with the newly implemented moniker support (as you can see the api's are different one queries name, while the other does groupName), we're now using the cache to retrieve this information, and it's using the same logic that is described in the above issue.

@piradeepk
Copy link

piradeepk commented Dec 21, 2020

@lifeofguenter would you be able to provide some information as to what query you're trying to run/request you're making? What are you trying to do when you see this behaviour? What is the experience that you're having? Are you no longer able to use Spinnaker or are you just seeing a large number of queries?

Also, in your hal config, do you have any namingStrategies defined/configured?

@lifeofguenter
Copy link
Author

lifeofguenter commented Dec 21, 2020

@piradeepk

In deck -> application -> clusters -> select server group -> sidebar pops up with server-group information

Deck will do the following API request to retrieve information for that sidebar:

https://spinnaker-gate.foo/applications/fooservice/serverGroups/prod-ecs/us-east-1/fooservice-v016?includeDetails=false

Also, in your hal config, do you have any namingStrategies defined/configured?

we are not using any namingStrategy currently (so it should be defaulting to something?) - Given that the amount of SQL queries are almost exactly the same like in #6084, I don't think this is an issue with the new implementation, rather the new implementation uncovered an old issue.

I do however see this in our clouddriver.yml which I am not sure if its correct:

sql:
  enabled: true
  taskRepository:
    enabled: true
  cache:
    enabled: true

e.g. what is "taskRepository" and why is "cache.enabled" required?

Are you no longer able to use Spinnaker

We are unable to use 1.24 unfortunately and had to rollback. With every new version of Spinnaker we are losing new components due to the 4k SQL query bug. And since we have limited understanding of what other API endpoints might be affected as well in 1.24 we decided to not take a chance :)

I think someone needs to look into clouddriver-ecs and how it translates lookups into sql queries :) - but I don't think it will be an easy task, nor something that will be fixed anytime soon 😞

@piradeepk
Copy link

Thanks for the quick response. Will continue to investigate with the given information.

@piradeepk
Copy link

From my initial investigation, it looks to be related to a PR to support Monikers for ECS, in which we removed the applicationName (to support monikers): spinnaker/clouddriver@fc08d9a#diff-08ef4737fffbc8ef45383c0%5B%E2%80%A6%5D2d4b4b314665b4f134c0b5a4L527 (customers with a large number of serverGroups, would previously use that value to prevent the EcsServerClusterProvider from checking every serverGroup, if the service's appname didn't match the name of the application provided, it wouldn't check the rest of the logic).

The check that would bypass all of the logic if the appname didn't match: spinnaker/clouddriver@fc08d9a#diff-08ef4737fffbc8ef45383c0%5B%E2%80%A6%5D2d4b4b314665b4f134c0b5a4R131 (application is being passed in as null every time so that check is always false, but anyone with a larger number of server groups is timing out now since we're checking every single server group, rather than just the ones that match serverGroupName).

@piradeepk
Copy link

@lifeofguenter I've made a change (and it's been merged to master and the next patch release) that should hopefully fix this issue for you. A patch release should be released soon, please let us know if this is still an issue once 1.24.1 is released (hopefully within the next few hours)

@piradeepk piradeepk self-assigned this Dec 21, 2020
@lifeofguenter
Copy link
Author

lifeofguenter commented Dec 22, 2020

thanks @piradeepk - your changes have fixed the issue 🍾

I am now back to a more "sane" number of mysql queries for that endpoint:

Screen Shot 2020-12-22 at 10 53 54

I must note though, that compared to a EC2 lookup its still concerning high:

Screen Shot 2020-12-22 at 10 55 54

E.g. the sidebar opens up with a EC2 servergroup quite snappy in less than 50ms - as its only doing 3 sql queries. With ECS it does ~50 sql queries and thus takes almost 1000ms for the same endpoint.

We have both equally amount of services on EC2 and ECS. My question: why is ECS so much slower? Why does it require a magnitude of 50-100x of SQL queries for the same information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants