- Infrastructure Design
- Scalability, reliability, availability & maintainability
- Hybrid & Edge cloud
- Distributed processing systems
- Migration of data warehouse from on-premises DC to GCP
Processing starts with data collection and ends with exploration & visualization.
- How to choose the appropriate compute service depending on the use case
- Compute Engine / Kubernetes Engine / App Engine / Cloud Functions
- Cloud Run & Anthos are not covered
- Scalability, reliability, availability & maintainability
- Hybrid & Edge cloud
- Distributed processing systems
- Migration of data warehouse from on-premises DC to GCP
- IaaS
- You have the greatest amount of control over your infrastructure : full acces to VMs instances
- Good option if you need max control over the conf & are willing to manage instances
- Configuration:
- OS, software, conf
- machine type: vCPU, RAM, GPU, storage
- security features : Shielded VMs & accelerators
- region & zone
- VMs can be grouped into cluster for HA & scalability : an instance group (VMs with identical conf.) is managed as a single unit.
- managed service : Google maintains the cluster, the kube's installation & conf on instance groups
- kube is deployed on a cluster of servers
- pro :
- users can precisely tune the allocation of resources to each container.
- For apps designed as a set of microservices. App components separated into microservices in their own containers : easier to allocate resources & to maintain
- allow mlutiple env : other cloud provider & on-prem
- cons :
- microservices should have the same liefcycle (maintenance can disrupt other microservices)
- apps must be containerized
- PaaS
- allow devs to focus an app development.
- 2 versions :
- Standard: serverless environnement, support Go, Java? PHP, Nodejs & Python.
- Flexible: runs Docker containers & allows devs to customize their runtime env --> avantage of a PaaS with flexibility
- serverless, managed compute service
- for running code in response to events that occur in the cloud. Writing a msg to a Cloud Pub/Sub or uploading files can triger the execution of a Cloud Function
- Funcs are coded in JavaScript, Python 3 & Go
Availability = ability of a user to access a resource at a specific time - usually measured as the percentage of time that a system is operational.
it is a function of reliability = the probability that a system'll meet service-level objectives for some duration of time - measured as the mean time between failures.
Scalability = ability of a system to handle de/increases in workload by adding/removing resources.
- You are responsible for ensuring high availability & scalability.
- Done with MIGs (MIGs are defined using a template including all specs)
- When an instance in a MIG fails, it is replaced identically
- Load balancers direct traffic only to responsive instances using health checks - they are either global or regional
- Autoscalers add & remove instances according to workload - you create a policy that specifies the criteria for adjusting the size of the group (CPU utilization & other metrics collected by Stackdriver...)
- you have the greatest level of control over your instances, but you are also responsible for configuring MIGs, load balancers, & autoscalers.
- K8bs is designed to support HA, reliability & scalability for containers & apps running in a cluster.
- When pods fail, they are replaced much like failed instances in a MIG.
- Nodes belong to a pool : the autorepair feature turned on, failed nodes'll be reprovisioned automatically.
pros of serverless & managed services = they are designed to be HA, scalable & reliable.
- App Engine : you do have the option of configuring policies for scaling (target : CPU, throughput utilizations, & max concurrent requests).
- Cloud Functions are designed so that each instance of a cloud function handles one request at a time. Additional instances can be created.
-
Memorystore : Standard Tier is automatically configured to maintain a replica in a different zone used only for H.A, not scalability when Redis detects a failure & triggers a failover.
-
Compute Engine and Kubernetes Engine : Persistent disks (built-in redundancy for HA and reliability) are used to provide network-based disk storage to VMs & containers. Users can also create snapshots of disks & store them in Cloud Storage.
-
Cloud SQL: HA mode by maintaining a primary instance in one zone and a standby instance in an other one within the same region (synchronous replication keeps the data up to date)
-
Cloud Storage stores replicas of objects within a region when using standard storage and across regions when using multi-regional storage.
- Standard Tier uses the public Internet network to transfer data between Google data centers
- Premium Tier routes traffic only over Google’s global network
Combines on-premises transaction processing systems with cloud-base analytics platforms. Data is extracted & transferred to the cloud for analytics processing. OLTP systems have predictable workloads with little variance Whereas Analytic workloads are predictable but can be highly variable
A variation of Hybrid Cloud. Used when
- connectivity / network is not reliable or
- bandwidth not sufficient.
- low-latency processing is required (IoT, manufacturing...) A combo of cloud & ege base computed may be used. CI / CD process ensures consitency accross edge devices. When the full apps is run at the edge consider using containers.
TO BE CONTINUED
TO BE CONTINUED
TO BE CONTINUED
DW = repositories of enterrprise data organized for BI reporting & analysis
- DW include extraction, tranformation & load scripts, view & enbedded UDF, reports & dataviz.
- Identity mngmt & access control are used to protect confidentialiyt, integrity & availability
- 2 scenarii
- off-loading a DW migration: copying data & scheames, when BI needs the extra storage & compute capacity of the cloud
- a full DW migration = off-loading + moving pipelines : allows to take advantage of the cloud ETL tools.
- 4 stages
- Technical requirements :
identifying existing use cases: data needed, source, update frequency
ETL jobs, access controls, metadata, reports, dataviz
- Buissness benefits of a migration
cost-savings, increased agility, flexibility
- definition of KPIs to measure the migration process effciency vs objectives
amount of data migrated, dataviz available...
- how you can take advantage of BigQuery
(no need to plan for resources/serverless, Colossus FS to ensure availability...)
Alternative ways to prioritize, depending on your needs - iterative process
- migration of analytical workloads first
- focusing on the UX first
- prioritize low-risk use case first
- Testing & validating all aspects (schema correctly defined, all data actually loaded, pipelines, queries,
dataviz working flawlessly...)