Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize tf template E2E deployment time #444

Closed

Conversation

yiyinglovecoding
Copy link
Collaborator

@yiyinglovecoding yiyinglovecoding commented Mar 26, 2024

Making terraform apply faster by paralleling creation of gke cluster and cloudsql database

  • Making custom-network a separate module and module infra should depends on custom-network in rag
  • Making cloudsql-secret a separate resource that depends on namespace and cloudsql

It doesn't affect Jupyterhub and Ray using Infra module as the infra still create network for them. In RAG, however, custom-network willl create network outside infra module.

test:

module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_project_iam_member.cluster_service_account-nodeService_account[0]: Creation complete after 9s [id=yiyingzhang-gke-dev/roles/container.nodeServiceAccount/serviceAccount:tf-gke-test-refactor12-6wm8@yiyingzhang-gke-dev.iam.gserviceaccount.com]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [10s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [10s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [20s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [20s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [30s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [30s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [40s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [40s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [50s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [50s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [1m0s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [1m0s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [1m10s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [1m10s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [1m20s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [1m20s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [1m30s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [1m30s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [1m40s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [1m40s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [1m50s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [1m50s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [2m0s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [2m0s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [2m10s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [2m10s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [2m20s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still creating... [2m20s elapsed]
module.cloudsql.module.cloudsql.google_sql_database_instance.default: Still creating... [2m30s elapsed]
module.infra[0].module.public-gke-autopilot-cluster[0].module.gke.google_container_cluster.primary: Still cr

time terrform apply on AP cluster
before: around 30min
after: around 20min

e2e: done

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding yiyinglovecoding changed the title [WIP][DO NOT REVIEW] Refactor Optimize tf template E2E deployment time Mar 27, 2024
@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding
Copy link
Collaborator Author

test ray cluster failed with gmp engine error, seems not related to my change.
https://screenshot.googleplex.com/BFBmgVbDrgAHbNi

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding yiyinglovecoding changed the title Optimize tf template E2E deployment time [WIP][DO NOT REVIEW]Optimize tf template E2E deployment time Mar 28, 2024
@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding yiyinglovecoding changed the title [WIP][DO NOT REVIEW]Optimize tf template E2E deployment time Optimize tf template E2E deployment time Mar 29, 2024
@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

cloudbuild.yaml Outdated
set -e

cd /workspace/modules/custom-network
terraform apply \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to test network creation as part of infrastrcuture/ since that's how it'll be used for RAG, Ray, Jupyter. Does it have to be a separate step in the tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right, combined it into cluster creation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a I might misunderstood it so want to confirm that
if we want to accelerate the deployment by paralleling cloudsql database and gke cluster creation, we have to move network creation out of infrastrcuture/.
As cloudsql database depends on the network.
@imreddy13

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim this PR splits out the network creation from infra, because for RAG we need to create the cloud SQL instance in parallel to the GKE cluster (which needs a network to exist).

But for ray and jupyter, we don't need that. @yiyinglovecoding can we test both flows in our E2E?

  1. network created by infra for ray and jupyter applications
  2. network created by rag for rag application

Might need to restructure and/or add tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could add a lot of complexity. How much faster is creation due to this change? If it's not significant should we reconsider?

Copy link
Collaborator Author

@yiyinglovecoding yiyinglovecoding Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be around at least 10mins for AP.

There has been lots of changes in the template so I just ran the
the AP creation before change is around 25mins, after change it would be 20mins.

@imreddy13

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

applications/rag/main.tf Show resolved Hide resolved
applications/rag/main.tf Outdated Show resolved Hide resolved
cloudbuild.yaml Outdated
set -e

cd /workspace/modules/custom-network
terraform apply \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim this PR splits out the network creation from infra, because for RAG we need to create the cloud SQL instance in parallel to the GKE cluster (which needs a network to exist).

But for ray and jupyter, we don't need that. @yiyinglovecoding can we test both flows in our E2E?

  1. network created by infra for ray and jupyter applications
  2. network created by rag for rag application

Might need to restructure and/or add tests

infrastructure/main.tf Outdated Show resolved Hide resolved
infrastructure/main.tf Outdated Show resolved Hide resolved
@yiyinglovecoding yiyinglovecoding changed the title Optimize tf template E2E deployment time [WIP]Optimize tf template E2E deployment time Apr 2, 2024
@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@yiyinglovecoding yiyinglovecoding changed the title [WIP]Optimize tf template E2E deployment time Optimize tf template E2E deployment time Apr 3, 2024
@yiyinglovecoding
Copy link
Collaborator Author

/gcbrun

@spencer-p
Copy link
Collaborator

Closing old issues with merge conflicts. Please rebase and re-open if still relevant.

@spencer-p spencer-p closed this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants