当前位置: 首页 > news >正文

利用CICD管道和MLOps自动化微调、部署亚马逊云科技上的AI大语言模型

项目简介:

小李哥将继续每天介绍一个基于亚马逊云科技AWS云计算平台的全球前沿AI技术解决方案,帮助大家快速了解国际上最热门的云计算平台亚马逊云科技AWS AI最佳实践,并应用到自己的日常工作里。

本次介绍的是如何在亚马逊云科技利用CodePipeline实现机器学习模型算法自动化微调和部署,首先在自动化工作流中创建Step Function状态机,利用状态机在机器学习托管服务SageMaker上微调大语言模型,最终为用户提供了一个对外暴露的URL端点用于推理。本架构设计全部采用了云原生Serverless架构,提供可扩展和安全的AI解决方案。本方案的解决方案架构图如下:

方案所需基础知识  

什么是 Amazon SageMaker?

Amazon SageMaker 是亚马逊云科技提供的一站式机器学习服务,旨在帮助开发者和数据科学家轻松构建、训练和部署机器学习模型。SageMaker 提供了从数据准备、模型训练到模型部署的全流程工具,使用户能够高效地在云端实现机器学习项目。

什么是 Amazon Step Functions?

Amazon Step Functions 是亚马逊云科技提供的一项完全托管的工作流编排服务,允许用户通过可视化的方式将多个 AWS 服务串联在一起,形成自动化的流程。Step Functions 使开发者能够轻松定义和管理复杂的工作流,包括分支决策、并行处理、错误处理和重试逻辑。

使用 Step Function 状态机自动化 SageMaker 上大模型创建、微调、部署的好处

通过使用 Amazon Step Functions 状态机,开发者可以自动化 Amazon SageMaker 上的大模型创建、微调和部署过程。Step Functions 允许将这些步骤串联成一个可视化的工作流,简化了复杂的机器学习管道管理。自动化的好处包括:

提高效率

将重复性任务自动化,减少人工干预,加速模型开发和部署流程。

降低错误风险

通过预定义的工作流,确保每个步骤按序执行,降低人为错误的可能性。

增强可扩展性

轻松处理不同规模的机器学习任务,从小规模实验到大规模生产部署,保持一致的工作流管理。

简化运维

自动化流程可简化模型的监控和管理,便于随时调整和优化机器学习管道。

利用 Step Functions 自动化 SageMaker 的操作,不仅提高了机器学习项目的开发效率,还确保了整个流程的稳定性和可重复性。

本方案包括的内容

1. 通过SDK代码形式定义亚马逊云科技State Function状态机配置

2. 配置亚马逊云科技Pipeline构建CICD管道,自动化创建State Function工作流

3. 启动State Function工作流自动化大语言AI模型的创建、微调和部署

项目搭建具体步骤:

1. 首先我们进入到亚马逊云科技控制台,进入CodeCommit代码库服务,点击"Clone URL"分别复制两个代码库的URL,用于将代码库代码clone到本地。

2. 下面进入到亚马逊云科技云端IDE Cloud9中,创建一个新的Cloud9后点击“Open”打开。

3. 在IDE控制台中运行以下命令,将“genai-repo”中的模型文件下载到本地

git clone <genai-repo URL>
cd genai-repo

4. 我们在文件夹中新建如下两个文件“buildspec.yml”和“state_machine_manager.py”,分别是CICD和Step Function状态配置文件。文件内容如下:

“buildspec.yml”:该文件主要是在CICD代码构建中的配置文件,主要是运行命令“python state_machine_manager.py”

version: 0.2phases:install:commands:- python --version- pip install --upgrade pip- pip install boto3- pip install --upgrade sagemaker- pip install --upgrade stepfunctionspre_build:commands:- cd $CODEBUILD_SRC_DIRbuild:commands:- echo Build started on `date`- cd $CODEBUILD_SRC_DIR- echo Current directory `ls -la`- echo Building the AWS Step-Function...          - echo Path `pwd` - python state_machine_manager.pypost_build:commands:- echo Build completed on `date`

“state_machine_manager.py”:该文件主要是用于创建一个Step Function,定义工作流在SageMaker上对模型进行自动化创建、微调和部署,整个Step Function工作流包含多个状态,具体的定义在workflow_definition变量中。

import boto3
import datetime
import random
import uuid
import logging
import stepfunctions
import sagemaker
import io
import random
import json
import sys
from sagemaker import djl_inferencefrom sagemaker import image_uris
from sagemaker import Model
from stepfunctions import steps
from stepfunctions.steps import *
from stepfunctions.workflow import Workflowiam = boto3.client('iam')
s3 = boto3.client('s3')stepfunctions.set_stream_logger(level=logging.INFO)### SET UP STEP FUNCTIONS ###
unique_timestamp = f"{datetime.datetime.now():%H-%m-%S}"
state_machine_name = f'FineTuningLLM-{unique_timestamp}'
notebook_name = f'fine-tuning-llm-{unique_timestamp}'
succeed_state = Succeed("HelloWorldSuccessful")
fail_state = Fail("HelloWorldFailed")
new_model_name = f"trained-dolly-{unique_timestamp}"try:# Get a list of all bucket namesbucket_list = s3.list_buckets()# Filter bucket names starting with 'automate'bucket_names = [bucket['Name'] for bucket in bucket_list['Buckets'] if bucket['Name'].startswith('automate')]mybucket = bucket_names[0].strip("'[]")
except Exception as e:print(f"Error: {e}")# Get the stepfunction_workflow_role
try:role = iam.get_role(RoleName='stepfunction_workflow_role')workflow_role = role['Role']['Arn']
except iam.exceptions.NoSuchEntityException:print("The role 'stepfunction_workflow_role' does not exist.")# Get the sagemaker_exec_role
try:role2 = iam.get_role(RoleName='sagemaker_exec_role')sagemaker_exec_role = role2['Role']['Arn']
except iam.exceptions.NoSuchEntityException:print("The role 'sagemaker_exec_role' does not exist.")# Create a SageMaker model object
model_data="s3://{}/output/lora_model.tar.gz".format(mybucket)image_uri = image_uris.retrieve(framework="djl-deepspeed",version="0.22.1",region="us-east-1")
trained_dolly_model = Model(image_uri=image_uri,model_data=model_data,predictor_cls=djl_inference.DJLPredictor,role=sagemaker_exec_role)# Create a retry configuration for SageMaker throttling exceptions. This is attached to
# the SageMaker steps to ensure they are retried until they run.
SageMaker_throttling_retry = stepfunctions.steps.states.Retry(error_equals=['ThrottlingException', 'SageMaker.AmazonSageMakerException'],interval_seconds=5,max_attempts=60,backoff_rate=1.25
)
# Create a state machinestep to create the model
model_step = steps.ModelStep('Create model',model=trained_dolly_model,model_name=new_model_name
)
# Add a retry configuration to the model_step
model_step.add_retry(SageMaker_throttling_retry)# Create notebook for running SageMaker training job.
create_sagemaker_notebook = LambdaStep(state_id="Create training job",parameters={"FunctionName": "create_notebook_function","Payload": {"notebook_name": notebook_name},        },
)
# Get notebook status
get_notebook_status = LambdaStep(state_id="Get training job status",parameters={"FunctionName": "get_notebook_status_function","Payload": {"notebook_name": notebook_name},          },
)#choice state
response_notebook_status = Choice(state_id="Response to training job status")
wait_for_training_job = Wait(state_id="Wait for training job",seconds=150)
wait_for_training_job.next(get_notebook_status)
#retry checking notebook status
response_notebook_status.add_choice(rule=ChoiceRule.StringEquals(variable="$.Payload.trainningstatus", value="Failed"),next_step=fail_state,
)
response_notebook_status.add_choice(rule=ChoiceRule.StringEquals(variable="$.Payload.trainningstatus", value="Stopped"),next_step=fail_state,
)
response_notebook_status.add_choice(ChoiceRule.StringEquals(variable="$.Payload.trainningstatus", value="NotAvailable"),next_step=fail_state,
)
inservice_rule=ChoiceRule.StringEquals(variable="$.Payload.trainningstatus", value="InService")
response_notebook_status.add_choice(ChoiceRule.Not(inservice_rule),next_step=wait_for_training_job,
)# Create a step to generate an Amazon SageMaker endpoint configuration
endpoint_config_step = steps.EndpointConfigStep("Create endpoint configuration",endpoint_config_name=new_model_name,model_name=new_model_name,initial_instance_count=1,instance_type='ml.g4dn.2xlarge'
)
# Add a retry configuration to the endpoint_config_step
endpoint_config_step.add_retry(SageMaker_throttling_retry)# Create a step to generate an Amazon SageMaker endpoint
endpoint_step = steps.EndpointStep("Create endpoint",endpoint_name=f"endpoint-{new_model_name}",endpoint_config_name=new_model_name)
# Add a retry configuration to the endpoint_step
endpoint_step.add_retry(SageMaker_throttling_retry)# Chain the steps together to generate a full AWS Step Function
workflow_definition = steps.Chain([create_sagemaker_notebook,wait_for_training_job,get_notebook_status,response_notebook_status,model_step,endpoint_config_step,endpoint_step
])# Create an AWS Step Functions workflow based on inputs
basic_workflow = Workflow(name=state_machine_name,definition=workflow_definition,role=workflow_role,
)jsonDef = basic_workflow.definition.to_json(pretty=True)print('---------')
print(jsonDef)
print('---------')basic_workflow.create()

5.接下来我们将文件夹中新的全部文件上传回我们的代码库中

git add *
git commit -m "initial commit"
git pus

6. 接下来我们进入到代码构建服务CodeBuild中,创建一个新的项目。

7.为项目起名“genai-build”,并为构建添加代码库,代码库设置为genai-repo,分支选为master。

8.为代码构建添加授权权限,以及构建配置文件Buildspec,最后点击创建。

9. 接下来我们进入到CodePipeline中创建一个新的CICD部署任务

10.为pipeline起名“genai-pipeline”,并分配授权权限。

11. 首先选择CICD部署流中的数据源,选择类型为CodeCommit代码库,项目repo为“genai-repo”,分支为master。

12. 在Build代码构建阶段选择我们刚刚创建的CodeBuild项目“genai-build”。省略部署阶段,直接点击创建。

13. 等待代码构建阶段成功完成,接下来我们进入到step function服务主页。

14. 在step function主页中我们可以看到codebuild服务中我们新创建了一个Step Function: “FineTuningLLM-19-08-44”

15. 我们点击Step Function后可以获取我们之前定义的工作流配置信息、

{"StartAt": "Create training job","States": {"Create training job": {"Parameters": {"FunctionName": "create_notebook_function","Payload": {"notebook_name": "fine-tuning-llm-19-08-44"}},"Resource": "arn:aws:states:::lambda:invoke","Type": "Task","Next": "Wait for training job"},"Wait for training job": {"Seconds": 150,"Type": "Wait","Next": "Get training job status"},"Get training job status": {"Parameters": {"FunctionName": "get_notebook_status_function","Payload": {"notebook_name": "fine-tuning-llm-19-08-44"}},"Resource": "arn:aws:states:::lambda:invoke","Type": "Task","Next": "Response to training job status"},"Response to training job status": {"Type": "Choice","Choices": [{"Variable": "$.Payload.trainningstatus","StringEquals": "Failed","Next": "HelloWorldFailed"},{"Variable": "$.Payload.trainningstatus","StringEquals": "Stopped","Next": "HelloWorldFailed"},{"Variable": "$.Payload.trainningstatus","StringEquals": "NotAvailable","Next": "HelloWorldFailed"},{"Not": {"Variable": "$.Payload.trainningstatus","StringEquals": "InService"},"Next": "Wait for training job"}],"Default": "Create model"},"Create model": {"Parameters": {"ExecutionRoleArn": "arn:aws:iam::903982278766:role/sagemaker_exec_role","ModelName": "trained-dolly-19-08-44","PrimaryContainer": {"Environment": {},"Image": "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118","ModelDataUrl": "s3://automate-fine-tuning-e91ee010/output/lora_model.tar.gz"}},"Resource": "arn:aws:states:::sagemaker:createModel","Type": "Task","Next": "Create endpoint configuration","Retry": [{"ErrorEquals": ["ThrottlingException","SageMaker.AmazonSageMakerException"],"IntervalSeconds": 5,"MaxAttempts": 60,"BackoffRate": 1.25}]},"Create endpoint configuration": {"Resource": "arn:aws:states:::sagemaker:createEndpointConfig","Parameters": {"EndpointConfigName": "trained-dolly-19-08-44","ProductionVariants": [{"InitialInstanceCount": 1,"InstanceType": "ml.g4dn.2xlarge","ModelName": "trained-dolly-19-08-44","VariantName": "AllTraffic"}]},"Type": "Task","Next": "Create endpoint","Retry": [{"ErrorEquals": ["ThrottlingException","SageMaker.AmazonSageMakerException"],"IntervalSeconds": 5,"MaxAttempts": 60,"BackoffRate": 1.25}]},"Create endpoint": {"Resource": "arn:aws:states:::sagemaker:createEndpoint","Parameters": {"EndpointConfigName": "trained-dolly-19-08-44","EndpointName": "endpoint-trained-dolly-19-08-44"},"Type": "Task","End": true,"Retry": [{"ErrorEquals": ["ThrottlingException","SageMaker.AmazonSageMakerException"],"IntervalSeconds": 5,"MaxAttempts": 60,"BackoffRate": 1.25}]},"HelloWorldFailed": {"Type": "Fail"}}
}

16. 在Step Function运行状态视图中我们可以看到全部步骤都已经完成了。其中两个状态“create training job"和"get training job status"分别调用了两个不同的lambda python函数。

“create training job"的Python代码如下:

import boto3
import base64
import osdef lambda_handler(event, context):aws_region = 'us-east-1'notebook_name = event["notebook_name"]# s3_bucket='automate-fine-tunning-gblpoc'    notebook_file = 'lab-notebook.ipynb'iam = boto3.client('iam')# Create SageMaker and S3 clientssagemaker = boto3.client('sagemaker', region_name=aws_region)s3 = boto3.resource('s3', region_name=aws_region)s3_client = boto3.client('s3')s3_bucket = os.environ['s3_bucket']s3_prefix="notebook_lifecycle"lifecycle_config_script = f"""#!/bin/bashset -ecd /home/ec2-user/SageMaker/aws s3 cp s3://{s3_bucket}/{s3_prefix}/training_scripts.zip .unzip training_scripts.zipecho "Running training job..."source /home/ec2-user/anaconda3/bin/activate pytorch_p310chmod +x /home/ec2-user/SageMaker/converter.shchown ec2-user:ec2-user /home/ec2-user/SageMaker/converter.shnohup /home/ec2-user/SageMaker/converter.sh >>  /home/ec2-user/SageMaker/nohup.out 2>&1 & """lifecycle_config_name = f'LCF-{notebook_name}'print(lifecycle_config_script)# Function to manage lifecycle configurationdef manage_lifecycle_config(lifecycle_config_script):content = base64.b64encode(lifecycle_config_script.encode('utf-8')).decode('utf-8')try:# Create lifecycle configuration if not foundsagemaker.create_notebook_instance_lifecycle_config(NotebookInstanceLifecycleConfigName=lifecycle_config_name,OnCreate=[{'Content': content}])except sagemaker.exceptions.ClientError as e:print(e)# Try to describe the notebook instance to determine its status# Get the role with the specified nametry:role = iam.get_role(RoleName='sagemaker_exec_role')sagemaker_exec_role = role['Role']['Arn']except iam.exceptions.NoSuchEntityException:print("The role 'sagemaker_exec_role' does not exist.")try:response = sagemaker.describe_notebook_instance(NotebookInstanceName=notebook_name)except sagemaker.exceptions.ClientError as e:print(e)if 'RecordNotFound' in str(e):manage_lifecycle_config(lifecycle_config_script)# Create a new SageMaker notebook instance if not found# Updated to 4xl by DWhite due to 12xl not being available. 7/18/2024sagemaker.create_notebook_instance(NotebookInstanceName=notebook_name,InstanceType='ml.g5.4xlarge',RoleArn=sagemaker_exec_role,LifecycleConfigName=lifecycle_config_name,VolumeSizeInGB=30)else:raisereturn {'statusCode': 200,'body': 'Notebook instance setup and lifecycle configuration applied.'}

"get training job status"的代码如下:

import boto3
import json
import oss3 = boto3.client('s3')
sagemaker = boto3.client('sagemaker')
s3_bucket = os.environ['s3_bucket']def lambda_handler(event, context):print(event)notebook_name = event["notebook_name"] notebook_status = "NotAvailable"  training_job_status = 'NotAvailable'check_status = 'NotAvailable'# Try to describe the notebook instance to determine its statustry:response = sagemaker.describe_notebook_instance(NotebookInstanceName=notebook_name)notebook_status = response['NotebookInstanceStatus']if notebook_status == 'InService':find_artifact = s3.list_objects_v2(Bucket=s3_bucket,Prefix='output/lora_model.tar.gz')artifact_location = find_artifact.get('Contents',[])if not artifact_location:training_job_status = 'Creating'check_status = 'Creating'else:if 'output/lora_model.tar.gz' in str(artifact_location):training_job_status = 'Completed'check_status = 'InService'elif notebook_status == 'Failed':check_status = 'Failed'elif notebook_status == 'NotAvailable':check_status = 'NotAvailable'else:check_status = 'Pending'print(f"Notebook Status: {notebook_status}")print(f"Model on s3: {training_job_status}")print(f"Check status: {check_status}")except sagemaker.exceptions.ClientError as e:print(e)return {'statusCode': 200,'input': notebook_name,'trainningstatus': check_status}

17. 在Step Function工作流全部任务结束后,我们进入到SageMaker服务中,创建一个Jupyter Notebook并打开。

18. 我们创建一个新的Jupyter Notebook文件,并复制Fine-tuning微调代码。我们节选了部分微调代码段,主要是利用PEFT和Lora微调Dolly大语言模型。

EPOCHS = 10
LEARNING_RATE = 1e-4  
MODEL_SAVE_FOLDER_NAME = "dolly-3b-lora"training_args = TrainingArguments(output_dir=MODEL_SAVE_FOLDER_NAME,fp16=True,per_device_train_batch_size=1,per_device_eval_batch_size=1,learning_rate=LEARNING_RATE,num_train_epochs=EPOCHS,logging_strategy="steps",logging_steps=100,evaluation_strategy="steps",eval_steps=100, save_strategy="steps",save_steps=20000,save_total_limit=10,
)trainer = Trainer(model=model,tokenizer=tokenizer,args=training_args,train_dataset=split_dataset['train'],eval_dataset=split_dataset["test"],data_collator=data_collator,
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

19. 我们也需要创建一个SageMaker Lifecycle configurationsj脚本,用于在Step Function自动化模型微调中触发命令开启微调,启动脚本如下。

#!/bin/bashset -ecd /home/ec2-user/SageMaker/aws s3 cp s3://automate-fine-tuning-e91ee010/notebook_lifecycle/training_scripts.zip .unzip training_scripts.zipecho "Running training job..."source /home/ec2-user/anaconda3/bin/activate pytorch_p310chmod +x /home/ec2-user/SageMaker/converter.shchown ec2-user:ec2-user /home/ec2-user/SageMaker/converter.shnohup /home/ec2-user/SageMaker/converter.sh >>  /home/ec2-user/SageMaker/nohup.out 2>&1 & 

20. 最后我们进入到SageMaker的Endpoint工程中,就可以看到部署成功的AI大模型API端点URL了。

以上就是在亚马逊云科技上利用亚马逊云科技CICD服务CodePipeline和Step Function工作流,自动化AI大语言模型的创建、微调、部署的全部步骤。欢迎大家未来与我一起,未来获取更多国际前沿的生成式AI开发方案。

相关文章:

  • 北京网站建设多少钱?
  • 辽宁网页制作哪家好_网站建设
  • 高端品牌网站建设_汉中网站制作
  • 10 个 C# 关键字和功能
  • vue中父组件向子组件传值,子组件向父组件传值,简洁易懂
  • web前端之html弹窗面板的popover新属性
  • NC 把二叉树打印成多行
  • 2、Future与CompletableFuture实战
  • Positional Encoding | 位置编码【详解】
  • JAVA同城货运搬家系统小程序源码
  • 正信晟锦:借了钱不回信息怎么办
  • 前端(Vue)全屏 screenfull 通用解决方案及原理分析
  • 如何一键删除iPhone相册所有照片
  • 智密腾讯云直播组建--获取配置--getConfig
  • 压测模版
  • C ++初阶:C++入门级知识点
  • LangGPT结构化提示词
  • 基于微信小程序的高校校园信息整合平台的设计与实现
  • Google 是如何开发 Web 框架的
  • -------------------- 第二讲-------- 第一节------在此给出链表的基本操作
  • 【347天】每日项目总结系列085(2018.01.18)
  • 002-读书笔记-JavaScript高级程序设计 在HTML中使用JavaScript
  • Angular 响应式表单 基础例子
  • css系列之关于字体的事
  • el-input获取焦点 input输入框为空时高亮 el-input值非法时
  • es6(二):字符串的扩展
  • JavaWeb(学习笔记二)
  • jdbc就是这么简单
  • Laravel核心解读--Facades
  • TiDB 源码阅读系列文章(十)Chunk 和执行框架简介
  • use Google search engine
  • windows下如何用phpstorm同步测试服务器
  • 从@property说起(二)当我们写下@property (nonatomic, weak) id obj时,我们究竟写了什么...
  • 好的网址,关于.net 4.0 ,vs 2010
  • 利用jquery编写加法运算验证码
  • 使用 @font-face
  • 一道闭包题引发的思考
  • 译有关态射的一切
  • 原生JS动态加载JS、CSS文件及代码脚本
  • # 数论-逆元
  • $.proxy和$.extend
  • $jQuery 重写Alert样式方法
  • (javaweb)Http协议
  • (pytorch进阶之路)扩散概率模型
  • (Redis使用系列) Springboot 使用redis实现接口Api限流 十
  • (SpringBoot)第二章:Spring创建和使用
  • (代码示例)使用setTimeout来延迟加载JS脚本文件
  • (二十九)STL map容器(映射)与STL pair容器(值对)
  • (附源码)ssm基于jsp的在线点餐系统 毕业设计 111016
  • (机器学习的矩阵)(向量、矩阵与多元线性回归)
  • (十六)一篇文章学会Java的常用API
  • .NET 3.0 Framework已经被添加到WindowUpdate
  • .net 7和core版 SignalR
  • .NET CF命令行调试器MDbg入门(二) 设备模拟器
  • .NET COER+CONSUL微服务项目在CENTOS环境下的部署实践
  • .net core 6 集成 elasticsearch 并 使用分词器
  • .NET 自定义中间件 判断是否存在 AllowAnonymousAttribute 特性 来判断是否需要身份验证
  • .NET简谈互操作(五:基础知识之Dynamic平台调用)