Skip to content

jostrm/azure-enterprise-scale-ml

Repository files navigation

Project: azure-enterprise-scale-ml (ESML) AI Factory

The Enterprise Scale AI Factory is a plug and play solution that automates the provisioning, deployment, and management of AI projects on Azure with a template way of working.

  • Plug and play accelerator for: DataOps, MLOps, LLMOps, enterprise scale environment.

Main purpose:

  1. Marry mutliple best practices & accelerators: It reuses multiple existing Microsoft accelerators/landingzone architecture and best practices such as CAF & WAF, and provides an end-2-end experience including Dev,Test, Prod environments.
  2. Plug-and-play: Dynamicallly create infra-resources per team, including networking dynamically, and RBAC dynamically
    • Example of dynamicall: Subnet/IP calculator, ACL permission on the datalake for a project team, services "glued together"
  3. Template way of working & Project way of working: The AI Factory is project based (cost control, privacy, scalability per project) and provides multiple templates besides infrastructure template: DataLake template, DataOps templates, MLOps templates, with selectable project types.
    • Sub-purpose: Same MLOps - weather data scientists chooses to work from Azure Databricks or Azure Machine Learning` - same MLOps template is used.
    • Sub-purpose: Common way of working, common toolbox, a flexible one: A toolbox with a LAMBDA architecture with tools such as: Azure Datafactory, Azure Databricks, Azure Machine Learning, Eventhubs, AKS
  4. Enterprise scale & security & battle tested: Used by customers and partners with MLOps since 2019 (see LINKS) to accelerate the development and delivery of AI solutions, with common tooling & marrying multiple best practices. Private networking (private endpoints), as default.

Public links for more info

Documentation

The Documentation is organized around ROLES via Doc series.

Doc series Role Focus Details
10-19 CoreTeam Governance Setup of AI Factory. Governance. Infrastructure, networking. Permissions
20-29 CoreTeam Usage User onboarding & AI Factory usage. DataOps for the CoreTeam's data ingestion team
30-39 ProjectTeam Usage Dashboard, Available Tools & Services, DataOps, MLOps, Access options to the private AIFactory
40-49 All FAQ Various frequently asked questions. Please look here, before contacting an ESML AIFactory mentor.

It is also organized via the four components of the ESML AIFactory:

Component Role Doc series
1) Infra:AIFactory CoreTeam 10-19
2) Datalake template All 20-29,30-39
3) Templates for: DataOps, MLOps, *LLMOps All 20-29, 30-39
4) Accelerators: ESML SDK (Python, PySpark), RAG Chatbot, etc ProjectTeam 30-39

LINK to Documentation

Best practices implemented & benefits

NEWS TABLE

Date Category What Link
2024-03 Automation Add core team member 26-add-esml-coreteam-member.ps1
2024-03 Automation Add project member 26-add-esml-project-member.ps1
2024-03 Tutorial Core-team tutorial 10-AIFactory-infra-subscription-resourceproviders.md
2024-03 Tutorial End-user tutorial 01-jumphost-vm-bastion-access.md
2024-03 Tutorial End-user tutorial 03-use_cases-where_to_start.md
2024-02 Tutorial End-user installation Compute Instance R01-install-azureml-sdk-v1+v2.m
2024-02 Datalake - Onboarding Auto-ACL on PROJECT folder in lakel -
2023-03 Networking No Public IP: Virtual private cloud - updated networking rules https://learn.microsoft.com/en-us/azure/machine-learning/v1/how-to-secure-workspace-vnet?view=azureml-api-1&preserve-view=true&tabs=required%2Cpe%2Ccli
2023-02 ESML Pipeline templates Azure Databricks: Training and Batch pipeline templates. 100% same support as AML pipeline templates (inner/outer loop MLOps) -
2022-08 ESML infra (IaC) Bicep now support yaml as well -
2022-10 ESML MLOps ESML MLOps v3 advanced mode, support for Spark steps ( Databricks notebooks / DatabrickStep ) -

TEMPLATES for PIPELINES

TRAINING & INFERENCE pipelines is 1 of 5 template types in ESML AIFactory that accelerates for the end-user.

  • 0.1% percentage of the code to write, to go from R&D process, to productional Pipelines:

THE Challenge

Innovating with AI and Machine Learning, multiple voices expressed the need to have an Enterprise Scale AI & Machine Learning Platform with end-2-end turnkey DataOps and MLOps. Other requirements were to have an enterprise datalake design, able to share refined data across the organization, and high security and robustness: General available technology only, vNet support for pipelines & data with private endpoints. A secure platform, with a factory approach to build models.

Even if best practices exists, it can be time consuming and complex to setup such a AI Factory solution, and when designing an analytical solution a private solution without public internet is often desired since working with productional data from day one is common, e.g. already in the R&D phase. Cyber security around this is important.

  • Challenge 1: Marry multiple, 4, best practices
  • Challenge 2: Dev, Test, Prod Azure environments/Azure subscriptions
  • Challenge 3: Turnkey: Datalake, DataOps, INNER & OUTER LOOP MLOps Also, the full solution should be able to be provisioned 100% via infrastructure-as-code, to be recreated and scale across multiple Azure subscriptions, and project-based to scale up to 250 projects - all with their own set of services such as their own Azure machine learning workspace & compute clusters.

THE Strategy

To meet the requirements & challenge, multiple best practices needed to be married and implemented, such as: CAF/WAF, MLOps, Datalake design, AI Factory, Microsoft Intelligent Data Platform / Modern Data Architecture. An open source initiative could help all at once, this open-source accelerator Enterprise Scale ML(ESML) - to get an AI Factory on Azure

THE Solution - TEMPLATES & Accelerator

ESML provides an AI Factory quicker (within 4-40 hours), with 1-250 ESMLProjects, an ESML Project is a set of Azure services glued together securely.

  • Challenge 1 solved: Marry multiple, 4, best practices
  • Challenge 2 solved: Dev, Test, Prod Azure environments/Azure subscriptions
  • Challenge 3 solved: Turnkey: Datalake, DataOps, INNER & OUTER LOOP MLOps ESML marries multiple best practices into one solution accelerator, with 100% infrastructure-as-code

ESML 4 main components:

ESML AI Factory - 4 step process:

ESML AI Factory "Oneslider": Dev,Test,Prod environments - Enterprise Scale LandingZones

  • Easy to provision a new ESMLProject for Dev,Test,Prod with easy cost followup, since its own PROJECT resource groups for each Project team in the ESML AI Factory:
  • Horisontally 3 COMMON environment (Dev,Test, Prod) and vertically ESMLProject 1-250

The Azure Devops/BICEP can optionally integrate with ITSM system as a "ticket" in ServiceNow/Remedy/JIRA Service Desk. The below info is needed for the ESML provisioning:

ESML Architecture - "Modern data analytics platform"

Based on this reference architecture: https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-architecture

Contributing to ESML AIFactory?

This repository is a push-only mirror. Ping Joakim Åström for contributions / ideas.

Since "mirror-only" design, Pull requests are not possible, except for ESML admins. See LICENCE file (open source, MIT license) Speaking of open source, contributors:

  • Credit to Kim Berg and Ben Kooijman for contributing! (kudos to the ESML IP calculator and Bicep additions for esml-project type)
  • Credit to Christofer Högvall for contributing! (kudos to the Powershell script, to enable Resource providers, if not exits)
    • azure-enterprise-scale-ml\environment_setup\aifactory\bicep\esml-util\26-enable-resource-providers.ps1