地方网站运营方案,网页与网站建设,学历提升图片,做哪个视频网站赚钱数据准备、训练、评估三大核心阶段
一、整体流程重构#xff08;保留核心逻辑#xff09;
整个流程的核心目标是#xff1a;基于Llama Nemotron后训练数据集#xff0c;通过NeMo Curator筛选高质量推理类数据#xff0c;用LoRA轻量化微调Llama 3.1 8B Instruct模型…数据准备、训练、评估三大核心阶段一、整体流程重构保留核心逻辑整个流程的核心目标是基于Llama Nemotron后训练数据集通过NeMo Curator筛选高质量推理类数据用LoRA轻量化微调Llama 3.1 8B Instruct模型最终在MMLU/GPQA等基准上验证效果。整体流程可拆解为数据准备筛选→格式化→课程学习排序模型训练基于NeMo的LoRA微调指定超参模型评估Triton部署基准测试验证二、详细步骤与完善代码示例一数据准备NeMo Curator筛选与处理前置条件安装依赖需提前执行# 安装NeMo Curatorpipinstallnemo-curator[all]# 安装git lfs用于克隆大数据集sudoapt-getinstallgit-lfs# 安装其他依赖pipinstalltorch transformers datasets fasttext核心步骤补充细节修正代码importsubprocessimportglobimportjsonimportosfromnemo_curator.filtersimportLanguageFilter,TokenCountFilterfromnemo_curator.datasetsimportDocumentDatasetfromnemo_curator.utilsimportdownload_file# 步骤1克隆数据集 # 终端命令代码中执行defclone_dataset():subprocess.run([git,lfs,install])dataset_path./Llama-Nemotron-Post-Training-Datasetifnotos.path.exists(dataset_path):subprocess.run([git,clone,https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset,dataset_path])returndataset_path dataset_pathclone_dataset()input_diros.path.join(dataset_path,SFT)# 步骤2下载语言识别模型 lang_id_model_path./lid.176.ftzifnotos.path.exists(lang_id_model_path):download_file(urlhttps://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz,save_pathlang_id_model_path)# 步骤3NeMo Curator筛选数据 output_dir./curated-dataos.makedirs(output_dir,exist_okTrue)tokenizer_namemeta-llama/Llama-3.1-8B-Instruct# 构建NeMo Curator命令修正原命令参数适配官方接口command[python,-m,nemo_curator.cli,# 官方推荐的CLI入口process,--input-dir,input_dir,--output-dir,output_dir,# 筛选指定子集chatmath--filename-pattern,chat.*\\.jsonl$,math_v1.1.*\\.jsonl$,# 过滤列移除无用字段--drop-columns,version,license,generator,category,used_in_training,# 语言过滤仅保留英语--language-filter,en,--lang-id-model-path,lang_id_model_path,# 长度过滤--tokenizer,tokenizer_name,--max-tokens,8192,--max-completion-tokens,16384,# 并行配置--num-workers,8,--block-size,100MB,--device,cpu]# 执行筛选subprocess.run(command,checkTrue)# 步骤4合并JSONL文件 defmerge_jsonl_files(input_dir,output_file):jsonl_filesglob.glob(os.path.join(input_dir,*.jsonl))combined_data[]forfileinjsonl_files:withopen(file,r,encodingutf-8)asf:forline_num,lineinenumerate(f):lineline.strip()ifnotline:continuetry:datajson.loads(line)combined_data.append(data)exceptjson.JSONDecodeError:print(f跳过无效JSON行{file}第{line_num1}行)# 写入合并文件withopen(output_file,w,encodingutf-8)asf:foritemincombined_data:f.write(json.dumps(item,ensure_asciiFalse)\n)print(f合并完成共{len(combined_data)}条数据保存至{output_file})merge_jsonl_files(output_dir,./training_raw.jsonl)# 步骤5应用聊天模板系统提示 fromtransformersimportAutoTokenizer# 加载Llama 3.1 tokenizertokenizerAutoTokenizer.from_pretrained(tokenizer_name)tokenizer.pad_tokentokenizer.eos_token# 定义推理专用系统提示SYSTEM_PROMPTYou are a reasoning assistant. Solve problems step by step, explain your logic clearly, and ensure accuracy in mathematical and logical reasoning.defapply_chat_template(item):将原始数据转换为Llama 3.1 Instruct的聊天模板格式# 适配原始数据的字段假设原始数据有prompt和completion字段messages[{role:system,content:SYSTEM_PROMPT},{role:user,content:item[prompt]},{role:assistant,content:item[completion]}]# 应用官方聊天模板formatted_texttokenizer.apply_chat_template(messages,tokenizeFalse,add_generation_promptFalse)return{text:formatted_text,reasoning_type:item.get(reasoning,general)}# 处理所有数据withopen(./training_raw.jsonl,r,encodingutf-8)asf_in,\open(./training_formatted.jsonl,w,encodingutf-8)asf_out:forlineinf_in:itemjson.loads(line.strip())formatted_itemapply_chat_template(item)f_out.write(json.dumps(formatted_item,ensure_asciiFalse)\n)# 步骤6课程学习策略排序 defcurriculum_learning_sort(input_file,output_file): 课程学习排序 1. 按推理类型分组如数学推理/逻辑推理 2. 每组内按完成长度升序短→长 3. 交错合并各组平衡不同推理类型 # 读取数据并分组data_by_type{}withopen(input_file,r,encodingutf-8)asf:forlineinf:itemjson.loads(line.strip())r_typeitem[reasoning_type]ifr_typenotindata_by_type:data_by_type[r_type][]# 计算完成长度这里简化为文本总长度item[length]len(item[text])data_by_type[r_type].append(item)# 每组内按长度升序排序forr_typeindata_by_type:data_by_type[r_type].sort(keylambdax:x[length])# 交错合并类似归并排序sorted_data[]max_lenmax(len(lst)forlstindata_by_type.values())foriinrange(max_len):forr_typeindata_by_type:ifilen(data_by_type[r_type]):sorted_data.append(data_by_type[r_type][i])# 保存排序后的数据withopen(output_file,w,encodingutf-8)asf:foriteminsorted_data:# 移除临时长度字段item.pop(length,None)f.write(json.dumps(item,ensure_asciiFalse)\n)print(f课程学习排序完成共{len(sorted_data)}条数据保存至{output_file})curriculum_learning_sort(./training_formatted.jsonl,./training_final.jsonl)关键说明补充了依赖安装、异常处理如无效JSON行、字段适配等落地细节聊天模板严格遵循Llama 3.1 Instruct的官方格式确保模型输入兼容课程学习排序实现了“按推理类型分组→按长度排序→交错合并”的核心逻辑。二模型训练NeMo LoRA微调修正原代码问题原训练代码存在导入缺失、参数不完整、梯度累积逻辑不清晰等问题以下是完善版本前置条件安装NeMo框架pipinstallnemo_toolkit[nlp]1.20.0确保GPU环境建议A100/A800显存≥40GB。完整训练代码importtorchimportosfromomegaconfimportOmegaConffrompytorch_lightningimportTrainerfrompytorch_lightning.callbacksimportModelCheckpoint,LearningRateMonitor# NeMo导入修正原代码导入缺失fromnemo.collections.nlp.models.language_modeling.megatron_gpt_peft_modelsimportMegatronGPTPEFTModelfromnemo.collections.nlp.parts.nlp_overridesimportNLPSaveRestoreConnector,GradientAccumulationSchedulerfromnemo.collections.nlp.data.language_modeling.megatron.gpt_sft_datasetimportGPTSFTDatasetfromnemo.utilsimportlogging# 步骤1基础配置 # 超参数严格按要求设置LORA_RANK64LEARNING_RATE1e-4BATCH_SIZE_PER_GPU4# 单GPU批次根据显存调整GRADIENT_ACCUMULATION_STEPS64# 累积64步→总批次256MAX_TRAIN_STEPS2000MAX_SEQ_LENGTH8192MAX_ANSWER_LENGTH16384BASE_MODEL_NAMEmeta-llama/Llama-3.1-8B-InstructTRAIN_DATA_PATH./training_final.jsonlSAVE_DIR./nemo_modelsos.makedirs(SAVE_DIR,exist_okTrue)# 设置CUDAtorch.cuda.set_device(0)logging.set_level(logging.INFO)# 步骤2加载基础模型并配置LoRA # 初始化保存/恢复连接器save_restore_connectorNLPSaveRestoreConnector()# 配置PEFTLoRApeft_cfgOmegaConf.create({peft_scheme:lora,lora:{r:LORA_RANK,lora_alpha:LORA_RANK*2,# 通常设为rank的2倍target_modules:[q_proj,v_proj,k_proj,o_proj,gate_proj,up_proj,down_proj],lora_dropout:0.0,fan_in_fan_out:False,bias:none,task_type:CAUSAL_LM},base_model_name:BASE_MODEL_NAME,save_peft_only:True,# 仅保存LoRA权重节省空间})# 初始化Trainer适配NeMotrainerTrainer(devices1,acceleratorgpu,precisionbf16,# 混合精度训练提升速度/节省显存max_stepsMAX_TRAIN_STEPS,gradient_clip_val1.0,accumulate_grad_batchesGRADIENT_ACCUMULATION_STEPS,callbacks[ModelCheckpoint(dirpathSAVE_DIR,save_top_k1,monitortrain_loss),LearningRateMonitor(logging_intervalstep)],loggerFalse,)# 加载模型修正原代码重复加载问题modelMegatronGPTPEFTModel.from_pretrained(model_nameBASE_MODEL_NAME,peft_cfgpeft_cfg,trainertrainer,save_restore_connectorsave_restore_connector,config_overrides{megatron:{tensor_model_parallel_size:1,pipeline_model_parallel_size:1,},tokenizer:{library:huggingface,type:BASE_MODEL_NAME,kwargs:{padding_side:right}}})# 步骤3准备训练数据集 # 构建SFT数据集适配NeMo格式train_datasetGPTSFTDataset(file_pathTRAIN_DATA_PATH,tokenizermodel.tokenizer,max_seq_lengthMAX_SEQ_LENGTH,max_answer_lengthMAX_ANSWER_LENGTH,seed1234,add_bosTrue,add_eosTrue,truncation_methodright,pad_to_max_lengthFalse,)# 数据加载器train_dataloadertorch.utils.data.DataLoader(train_dataset,batch_sizeBATCH_SIZE_PER_GPU,shuffleTrue,num_workers4,pin_memoryTrue,drop_lastTrue,collate_fntrain_dataset.collate_fn,# 必须使用数据集自带的collate_fn)# 步骤4配置优化器与训练循环 # 优化器适配LoRA参数optimizertorch.optim.AdamW(model.parameters(),lrLEARNING_RATE,weight_decay0.001,betas(0.9,0.95))# 学习率调度器schedulertorch.optim.lr_scheduler.CosineAnnealingLR(optimizer,T_maxMAX_TRAIN_STEPS,eta_min1e-6)# 训练循环修正梯度累积逻辑model.train()model.cuda()global_step0total_loss0.0forbatchintrain_dataloader:ifglobal_stepMAX_TRAIN_STEPS:break# 数据移至GPUbatch{k:v.cuda()fork,vinbatch.items()}# 前向传播outputsmodel(input_idsbatch[tokens],attention_maskbatch[attention_mask],labelsbatch[labels])lossoutputs.loss# 反向传播自动处理梯度累积trainer.training_step((batch,),global_step)# 更新学习率scheduler.step()# 日志打印total_lossloss.item()ifglobal_step%100:avg_losstotal_loss/10logging.info(fStep{global_step}, Average Loss:{avg_loss:.4f}, LR:{scheduler.get_last_lr()[0]:.6f})total_loss0.0global_step1# 步骤5保存模型 # 保存完整模型含LoRA权重model.save_to(os.path.join(SAVE_DIR,llama3.1_8b_lora_reasoning.nemo))# 单独保存LoRA权重用于部署model.save_peft_weights(os.path.join(SAVE_DIR,lora_weights))logging.info(训练完成模型已保存至{}.format(SAVE_DIR))关键修正与说明补充了缺失的导入如Trainer、回调函数修复了原代码“重复加载模型”的问题梯度累积通过Trainer的accumulate_grad_batches实现逻辑更规范避免手动计算错误加入混合精度训练bf16大幅降低显存占用提升训练速度区分“完整模型保存”和“仅LoRA权重保存”适配后续部署需求。三模型评估Triton部署基准测试前置条件安装Triton推理服务器参考官方文档安装评估依赖pipinstallrequests evaluate datasets lm-eval步骤1模型转换为Triton格式# 1. 将NeMo模型转换为TensorRT-LLM格式适配Tritonpython -m nemo.export.trt_llm --model_path ./nemo_models/llama3.1_8b_lora_reasoning.nemo\--output_dir ./trt_llm_model\--tensor_parallelism1\--precision bf16# 2. 编写Triton配置文件model_repository/ensemble/config.pbtxt# 配置示例需根据实际路径调整 name:ensembleplatform:ensemblemax_batch_size:8input[{name:TEXT_INPUTdata_type: TYPE_STRING dims:[-1]}]output[{name:TEXT_OUTPUTdata_type: TYPE_STRING dims:[-1]}]ensemble_scheduling{step[{model_name:preprocessmodel_version: -1 input_map{key:TEXT_INPUTvalue:TEXT_INPUT}output_map{key:TOKEN_IDSvalue:TOKEN_IDS}},{model_name:llama3.1_8b_loramodel_version: -1 input_map{key:TOKEN_IDSvalue:TOKEN_IDS}output_map{key:GENERATED_TOKENSvalue:GENERATED_TOKENS}},{model_name:postprocessmodel_version: -1 input_map{key:GENERATED_TOKENSvalue:GENERATED_TOKENS}output_map{key:TEXT_OUTPUTvalue:TEXT_OUTPUT}}]}# 3. 启动Triton服务器tritonserver --model-repository./model_repository --gpu-memory-fraction0.8步骤2基准测试MMLU/GPQAimportlm_evalimportjsonimportrequestsfromlm_evalimportevaluator,tasks# 方式1Triton接口调用单样本测试 deftriton_generate(prompt,max_tokens100,temperature0.1):调用Triton服务器生成回复urlhttp://localhost:8000/v2/models/ensemble/generatepayload{text_input:prompt,parameters:{max_tokens:max_tokens,temperature:temperature,top_p:0.9,stop:[|end_of_text|]}}headers{Content-Type:application/json}try:responserequests.post(url,datajson.dumps(payload),headersheaders,timeout30)returnresponse.json()[text_output]exceptExceptionase:print(f生成失败{e})return# 测试示例test_promptWhat is 22*3? Explain your reasoning step by step.responsetriton_generate(test_prompt,max_tokens200)print(fPrompt:{test_prompt}\nResponse:{response})# 方式2MMLU/GPQA基准测试 defrun_benchmark():运行MMLU和GPQA基准测试# 配置测试任务task_list[mmlu,gpqa]# 加载LM-Eval的配置lm_eval_args{model:custom,# 自定义模型接口model_args:endpointhttp://localhost:8000/v2/models/ensemble/generate,tasks:,.join(task_list),batch_size:8,output_path:./evaluation_results.json,device:cuda}# 定义自定义模型加载函数适配Tritondefload_custom_model(model_args):classTritonModel:defgenerate(self,prompts,max_tokens100,temperature0.1):responses[]forpromptinprompts:responses.append(triton_generate(prompt,max_tokens,temperature))returnresponsesreturnTritonModel()# 注册自定义模型lm_eval.models.register_model(custom,load_custom_model)# 运行评估resultsevaluator.simple_evaluate(**lm_eval_args)# 保存结果withopen(./evaluation_results.json,w)asf:json.dump(results,f,indent4)# 打印关键指标print( 基准测试结果 )fortaskintask_list:iftaskinresults[results]:accresults[results][task][acc]ifaccinresults[results][task]elseresults[results][task][exact_match]print(f{task.upper()}准确率:{acc*100:.2f}%)# 与基础模型对比假设基础模型结果已提前获取base_model_results{mmlu:0.65,# 示例值需替换为实际基础模型结果gpqa:0.42# 示例值}print(\n 与基础模型对比 )fortaskintask_list:iftaskinresults[results]:current_accresults[results][task][acc]ifaccinresults[results][task]elseresults[results][task][exact_match]improvement(current_acc-base_model_results[task])/base_model_results[task]*100print(f{task.upper()}: 基础模型{base_model_results[task]*100:.2f}% → 微调后{current_acc*100:.2f}%提升{improvement:.2f}%)# 执行基准测试run_benchmark()关键说明补充了模型转换为Triton格式的命令和配置文件示例解决原代码“仅给出接口调用无部署细节”的问题集成了主流的lm-eval工具实现MMLU/GPQA的自动化评估并加入“与基础模型对比”的逻辑增加了异常处理避免评估过程中因接口调用失败导致程序中断。总结数据准备核心通过NeMo Curator筛选chat/math子集严格过滤语言/长度应用Llama 3.1官方聊天模板按“推理类型长度”的课程学习策略排序保证数据质量和训练效率训练核心基于NeMo框架配置LoRArank64通过梯度累积实现256的总批量大小训练2000步采用bf16混合精度降低显存占用同时保存完整模型和仅LoRA权重评估核心将模型转换为Triton兼容格式部署通过lm-eval工具在MMLU/GPQA基准上评估对比基础模型的准确率提升验证推理能力优化效果。