当前位置: 首页 > news >正文

大模型笔记3 Longformer for Extractive Summarization训练

目录

改为GPU运行

从文本label生成输入token label

多样本输出文本

保存训练过程损失和模型

部署到服务器

训练集构建


改为GPU运行

  1.检查是否有可用的GPU,并根据可用性设置设备。

  2.使用方法将模型和输入张量移动到GPU。.to(device)

  3.将所有相关的张量和模型移至GPU后进行计算。

# 检查是否有可用的GPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("run on ",device)

# 将模型移动到GPU

model.to(device)

我的电脑没有配cuda, 因此先尝试在colab运行

从文本label生成输入token label

之前样例的训练样本如:

labels: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

源码注释中对模型训练label的输入格式要求:

labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):

            Labels for computing the token classification loss.

            Indices should be in ``[0, ..., config.num_labels - 1]``.

构造token label

1. Tokenize训练句子

2. Tokenize 句子中所有label 短语

3. 对于每个标记标签,循环标记训练句子,只要找到匹配项,就将其标记为“1”,其余标记为“0”. 当一个句子中有两个label短语,将循环训练句子两次.

例子:

sentences=["HuggingFace is a company based in Paris and New York." ]

labels_texts=[ ["HuggingFace","York"]]

tokenized_sentence=['ĠHug', 'ging', 'Face', 'Ġis', 'Ġa', 'Ġcompany', 'Ġbased', 'Ġin', 'ĠParis', 'Ġand', 'ĠNew', 'ĠYork', '.']

#labels_cls为 tensor([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

生成token label的代码

from transformers import AutoTokenizer

import torch

# 初始化tokenizer

tokenizer = AutoTokenizer.from_pretrained("tmp/Longformer-finetuned-norm")

sentences = ["HuggingFace is a company based in Paris and New York.","I am a little tiger."]

labels_texts = [["HuggingFace", "York"],["tiger"]]

def tokenize_and_align_labels(sentences, labels_texts, tokenizer):

    tokenized_inputs = tokenizer(sentences, add_special_tokens=False, return_tensors="pt", padding=True, truncation=True)

    all_labels_cls = []

    for i, sentence in enumerate(sentences):

        labels_text = labels_texts[i]

        tokenized_sentence = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"][i])

        labels_cls = [0] * len(tokenized_sentence)

       

        for label_text in labels_text:

            tokenized_label = tokenizer.tokenize(label_text)

            label_length = len(tokenized_label)

           

            for j in range(len(tokenized_sentence) - label_length + 1):#判断文本中这段token与label中的token完全一致

                if tokenized_sentence[j:j + label_length] == tokenized_label:

                    # print("tokenized_sentence:",tokenized_sentence[j:j + label_length])

                    # print("tokenized_label:",tokenized_label)

                    labels_cls[j:j + label_length] = [1] * label_length

       

        all_labels_cls.append(labels_cls)

   

    return tokenized_inputs, torch.tensor(all_labels_cls)

inputs_id, labels_cls = tokenize_and_align_labels(sentences, labels_texts, tokenizer)

从pdf中读取的文字会有多余的\n

所有\n替换成空格

    paper_text = paper_text.replace('\n', ' ')

去除参考文献

def remove_references(text):

    keywords = ["References", "REFERENCES"]

    for keyword in keywords:

        index = text.find(keyword)

        if index != -1:

            return text[:index].strip()

    return text

paper_text = remove_references(paper_text)

另外有个问题, 我从json给它论文样本时token粒度直接为字母. 对比一下可以发现它外面少套了一层list

descriptions = [df['Dataset Description'].tolist()]

多样本输出文本

将分类结果转换为文本的功能改为多样本循环版本

只转换predicted_token_class_ids中一个元素的代码是这样的

prediction_string=get_prediction_string(predicted_token_class_ids[0])

现在遍历predicted_token_class_ids中的每一个元素都调用get_prediction_string函数, 并把结果保存在prediction_strings中

使用列表推导式遍历 `predicted_token_class_ids` 中的每个元素,并调用 `get_prediction_string` 函数。 将每个调用结果保存在 `prediction_strings` 列表中。

# 遍历 predicted_token_class_ids 中的每个元素,调用 get_prediction_string 函数

prediction_strings = [get_prediction_string(prediction,predicted_inputs_id) for prediction,predicted_inputs_id in zip(predicted_token_class_ids,inputs["input_ids"])]

print("Prediction Strings:", prediction_strings)

def get_prediction_string(predicted_token_class_id):

    predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_id]

    print("predicted_tokens_classes",predicted_tokens_classes)

    #token类别转化为词输出

    tokenized_sub_sentence = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

    print("tokenized_sub_sentence:", tokenized_sub_sentence)

    # 示例分类

    # predicted_tokens_classes=['Non-dataset description', 'Dataset description', 'Non-dataset description', 'Dataset description', 'Non-dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description']

    # 将预测类别为 'Dataset description' 的 token 所在的单词取出

    dataset_description_words = []

    current_word = ""

    current_word_pred = False

    for token, pred_class in zip(tokenized_sub_sentence, predicted_tokens_classes):

        if token.startswith("Ġ"):

            if (len(current_word)!=0) & current_word_pred:#前面有上一个单词, 且其中有描述token, 则把它存入句子

                dataset_description_words.append(current_word)

            current_word = token[1:]

            current_word_pred = (pred_class == 'Dataset description')

            # print("start: ",current_word)

            # print("dataset_description_words: ",dataset_description_words)

            # print("current_word_pred: ",current_word_pred)

        else:

            current_word += token

            current_word_pred = current_word_pred or (pred_class == 'Dataset description')#如果不是词开头, 现在token和之前已有token只要有1类的都行

            # print("mid: ",current_word)

            # print("current_word_pred: ",current_word_pred)

    #最后一个单词后没有下一个单词的开始符号, 无法进入循环, 单独判断

    if (len(current_word)!=0) & current_word_pred:

        dataset_description_words.append(current_word)

    # 拼接所有包含 'Dataset description' 类 token 的单词为一个完整的字符串

    dataset_description_string = " ".join(dataset_description_words)

    return dataset_description_string

保存训练过程损失和模型

分批训练:使用DataLoader对数据进行分批处理。

损失曲线保存:使用matplotlib保存训练过程中的损失曲线。

保存模型:在训练完成后保存模型。

from torch.utils.data import DataLoader, TensorDataset

import matplotlib.pyplot as plt

# 创建DataLoader

dataset = TensorDataset(inputs["input_ids"], inputs["attention_mask"], token_labels)

dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# 训练参数

epochs = 3

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

losses = []

# 训练模型

model.train()

for epoch in range(epochs):

    epoch_loss = 0

    for batch in dataloader:

        input_ids, attention_mask, labels = batch

        input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

       

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

        # outputs = model(**inputs, labels=labels)

        if isinstance(outputs, tuple):

            loss,logits = outputs

        else:

            loss = outputs.loss

        # loss = outputs.loss

        loss.backward()

       

        optimizer.step()

        optimizer.zero_grad()

       

        epoch_loss += loss.item()

   

    avg_epoch_loss = epoch_loss / len(dataloader)

    losses.append(avg_epoch_loss)

    print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_epoch_loss}")

# 保存训练过程中的损失曲线

plt.plot(range(1, epochs + 1), losses, marker='o')

plt.title('Training Loss')

plt.xlabel('Epoch')

plt.ylabel('Loss')

plt.savefig('output/training_loss.png')

# plt.show()

# 保存模型

model.save_pretrained("output/trained_model")

tokenizer.save_pretrained("output/trained_model")

print("Model and tokenizer saved to 'trained_model'")

保存训练过程loss值:

# 打开文件准备写入loss值

with open('output/training_loss.txt', 'w') as loss_file:

    loss_file.write("Epoch, Loss\n")

    # 将loss值加入文件

    with open('output/training_loss.txt', 'a') as loss_file:

        loss_file.write(f"{epoch + 1}, {avg_epoch_loss}\n")

部署到服务器

需要跑通tests/tkn_clsfy.py

这个是用的库的环境配置

GitHub - allenai/longformer: Longformer: The Long-Document Transformer

用obs工具将文件上传到华为卡

用df -TH看剩余空间

传输数据的文档:

初始化配置_对象存储服务 OBS

传输数据的,这块的ak、sk已经替换

1、下载压缩包

Linux

wget https://obs-community.obs.cn-north-1.myhuaweicloud.com/obsutil/current/obsutil_linux_arm64.tar.gz

Windows下载安装文档:

https://support.huaweicloud.com/utiltg-obs/obs_11_0003.html

2、进行解压,修改权限

Linux

tar -zxvf obsutil_linux_arm64.tar.gz

cd obsutil_linux_arm64_5.5.12

chmod 755 obsutil

Windows手动解压

3、配置认证(在cmd命令行中)

  • Windows操作系统

使用永久AK、SK进行初始化配置:

obsutil config -i=ak -k=sk -e=endpoint

使用临时AK、SK、SecurityToken进行初始化配置:

obsutil config -i=ak -k=sk -t=token -e=endpoint

  • macOS/Linux操作系统

使用永久AK、SK进行初始化配置:

./obsutil config -i=ak -k=sk -e=endpoint

使用临时AK、SK、SecurityToken进行初始化配置:

./obsutil config -i=ak -k=sk -t=token -e=endpoint

配置完成得到输出

Config file url:

  C:\Users\laugo\.obsutilconfig

Update config file successfully!

检查连通性

  • Windows操作系统

obsutil ls -s

  • macOS/Linux操作系统

./obsutil ls -s

成功得到返回

Start at 2024-07-11 08:06:51.1973799 +0000 UTC

obs://public1

Bucket number: 1

使用obs:

快速使用_对象存储服务 OBS

创建名为longformer桶

obsutil mb obs://longformer

Start at 2024-07-11 08:19:01.5905845 +0000 UTC

Create bucket [longformer] successfully, request id [00000190A0DFD13C8104F19C0C05A0A8]

Notice: If the configured endpoint is a global domain name, you may need to wait serveral minutes before performing uploading operations on the created bucket. Therefore, configure the endpoint to a regional domain name if you want instant uploading operations on the bucket.

将本地D:/Projects/longformer文件夹上传OBS桶中

obsutil cp D:/Projects/longformer/ obs://longformer -r -f

Start at 2024-07-11 08:19:39.2002282 +0000 UTC

Parallel:      5                   Jobs:          5

Threshold:     50.00MB             PartSize:      auto

VerifyLength:  false               VerifyMd5:     false

CheckpointDir: C:\Users\laugo\.obsutil_checkpoint

Task id: 9d39ff2a-2f08-4764-8c42-5d7b93a110b0

OutputDir: C:\Users\laugo\.obsutil_output

[-------------------] 100.00% tps:0.31 12.50MB/s 129/129 5.13GB/5.13GB 7m0.746s

Succeed count:      129       Failed count:       0

Succeed bytes:      5.13GB

Metrics [max cost:373973 ms, min cost:7 ms, average cost:11838.48 ms, average tps:0.31, transfered size:5.13GB]

登上服务器, df -TH看下剩余内存

Filesystem                Type      Size  Used Avail Use% Mounted on

/dev/nvme0n1p1            ext4      3.2T  393G  2.6T  14% /mnt/sdc

mnt/sdc这个是常用的, 看起来够用

从这里再安装一个obs下载文件

从这里的obs查看我的桶桶

obsutil ls -s

obs://longformer

longformer目录下载至本地longformer文件夹

./obsutil cp obs://longformer/longformer/ /mnt/sdc/longformer -r -f

如果报错说找不到目录, 可能需要等一会, 本地看到的目录和服务器的相比有时候会有延迟

下载成功后进入路径:

cd /mnt/sdc/longformer/longformer

重新在这个目录下按照github的方法配置虚拟环境

(如果是华为卡连不了外网, 装库会麻烦一些)

训练集构建

数据收集表:

https://docs.google.com/spreadsheets/d/1Lt8zjx-XWr9FlYDaFqFMhiJZRs5IXy1yDnsi0Br1Yr8/edit?usp=sharing

chatgpt提取例子:

https://chatgpt.com/share/df6316b4-feca-4d59-ad8f-1dafff72566d

根据论文列表paper_with_dataset.csv中的论文url解析文本内容.

将文本内容输入chatgpt接口, 使其返回关于论文提出的原创数据集的描述的字符串列表, 同时要求gpt返回的每一个句子保存为一个字符串, 且其中字符必须与原文完全一致. 格式是 ["description1", "description2", …]

在获得gpt返回的结果后与原文进行对比, 验证是否字符串均为原文中连续的字符.

最后将论文原文paper_texts与对应数据集描述dataset_descriptions保存在文件中, 方便下次读取为列表使用.

输入的需要提取描述的论文列表保存在paper_with_dataset.csv文件中, csv内容如下.

url,title,abstract,Dataset Description

https://arxiv.org/pdf/2407.08692,FAR-Trans: An Investment Dataset for Financial Asset Recommendation,"Financial asset recommendation (FAR) is a sub-domain of recommender systems which identifies useful financial securities for investors, with the expectation that they will invest capital on the recommended assets. FAR solutions analyse and learn from multiple data sources, including time series pricing data, customer profile information and expectations, as well as past investments. However, most models have been developed over proprietary datasets, making a comparison over a common benchmark impossible. In this paper, we aim to solve this problem by introducing FAR-Trans, the first public dataset for FAR, containing pricing information and retail investor transactions acquired from a large European financial institution. We also provide a bench-marking comparison between eleven FAR algorithms over the data for use as future baselines. The dataset can be downloaded from this https URL .",

需要输出的数据集格式

paper_texts = ["paper1 text"," paper1 text"]

dataset_descriptions = [["description1 in paper1", "description2 in paper1"],["description1 in paper2"]]

需要安装的库:

PyMuPDF库来解析PDF内容, pip install pymupdf

requests库来获取PDF文件

ChatGPT接口:pip install openai

GPT接口调用文档参考

https://juejin.cn/post/7199293850494091301

https://platform.openai.com/docs/guides/text-generation/chat-completions-api

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(

  model="gpt-3.5-turbo",

  messages=[

    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "Who won the world series in 2020?"},

    {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},

    {"role": "user", "content": "Where was it played?"}

  ]

)

申请密钥:

https://platform.openai.com/api-keys

import openai时导入报错, 是typing_extensions版本冲突的问题

import openai File , line 6, in <module> from typing_extensions import override ImportError: cannot import name 'override' from 'typing_extensions'

pip uninstall typing_extensions

pip install typing_extensions

读取CSV文件:读取paper_with_dataset.csv文件以获取每篇论文的URL和其他相关信息。

csv_file = 'paper_with_dataset.csv'

df = pd.read_csv(csv_file)

urls = df['url'].tolist()

titles = df['title'].tolist()

abstracts = df['abstract'].tolist()

解析论文文本:使用requests和PyMuPDF来获取和解析PDF论文的内容。

# 解析PDF内容

def extract_text_from_pdf(url):

    response = requests.get(url)

    response.raise_for_status()

    pdf_document = 'paper.pdf'

   

    with open(pdf_document, 'wb') as f:

        f.write(response.content)

   

    doc = fitz.open(pdf_document)

    paper_text = ""

    for page_num in range(len(doc)):

        page = doc.load_page(page_num)

        paper_text += page.get_text()

    return paper_text

# 存储论文文本

paper_texts = []

for url in urls:

    text = extract_text_from_pdf(url)

    paper_texts.append(text)

调用ChatGPT接口:将论文文本内容发送给ChatGPT,并请求返回每篇论文中有关原创数据集的描述。

def get_dataset_descriptions(paper_text):

    response = openai.Completion.create(

        engine="gpt-4",

        prompt=(

            "Extract all the sentences from the following paper text that describe the dataset proposed by the authorss. "

            "Each sentence should be preserved exactly as it appears in the text. Return the sentences as a list of strings: [\"description1\", \"description2\", …]\n\n"

            f"{paper_text}"

        ),

        max_tokens=max_tokens,

        # n=1,

        # stop=None,

        # temperature=0.5

    )

    return response.choices[0].text.strip().split('\n')

papers_dataset_descriptions = []

for text in paper_texts:

    descriptions = get_dataset_descriptions(text)

    papers_dataset_descriptions.append(descriptions)

报错, 代码版本问题:

, in __call__

    raise APIRemovedInV1(symbol=self._symbol)

openai.lib._old_api.APIRemovedInV1:

You tried to access openai.Completion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface.

Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`

A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742

新版例子参考:

https://github.com/openai/openai-python/discussions/742

import openai

# optional; defaults to `os.environ['OPENAI_API_KEY']`

openai.api_key = '...'

# all client options can be configured just like the `OpenAI` instantiation counterpart

openai.base_url = "https://..."

openai.default_headers = {"x-foo": "true"}

completion = openai.chat.completions.create(

    model="gpt-4",

    messages=[

        {

            "role": "user",

            "content": "How do I output all files in a directory using Python?",

        },

    ],

)

print(completion.choices[0].message.content)

使用 openai.ChatCompletion.create 替换 openai.Completion.create。

engine 参数被替换为 model。

将原来的 prompt 结构改为 messages 列表,以符合新接口要求。

将响应内容从 response.choices[0].text 修改为 response['choices'][0]['message']['content']。

def get_dataset_descriptions(paper_text):

    completion = openai.chat.completions.create(

        model="gpt-3.5-turbo",

        messages=[

            {

                "role": "user",

                "content": (

                    "Extract all the sentences from the following paper text that describe the dataset proposed by the authors. "

                    "Each sentence should be preserved exactly as it appears in the text. Return the sentences as a list of strings: [\"description1\", \"description2\", …]\n\n"

                    f"{paper_text}"

                ),

            },

        ],

        max_tokens=max_tokens,

    )

    return completion.choices[0].message.content.strip().split('\n')

验证描述的连续性:确保ChatGPT返回的描述在原文中是连续的字符序列。

def validate_descriptions(paper_texts, dataset_descriptions):

    validated_descriptions = []

    for paper_text, descriptions in zip(paper_texts, dataset_descriptions):

        valid_descriptions = []

        for description in descriptions:

            if description in paper_text:

                valid_descriptions.append(description)

        validated_descriptions.append(valid_descriptions)

    return validated_descriptions

validated_descriptions = validate_descriptions(paper_texts, dataset_descriptions)

这一部分后续可以考虑改成模糊查询. paper_text中查到与papers_dataset_description相似的句子则返回paper_text中这句话. (difflib 库中的 SequenceMatcher)

from difflib import SequenceMatcher

# 模糊查询以验证描述

def validate_descriptions(paper_texts, dataset_descriptions, threshold=0.8):

    validated_descriptions = []

    for paper_text, descriptions in zip(paper_texts, dataset_descriptions):

        valid_descriptions = []

        for description in descriptions:

            best_match = find_best_match(description, paper_text, threshold)

            if best_match:

                valid_descriptions.append(best_match)

        validated_descriptions.append(valid_descriptions)

    return validated_descriptions

  保存结果:将每篇论文的原文和对应的数据集描述保存在文件中。

其中数据集描述段落数目不确定, 不等长的list写入csv保存读取比较麻烦, 还是考虑json

result = {

    "paper_texts": paper_texts,

    "dataset_descriptions": validated_descriptions

}

with open('papers_and_datasets.json', 'w') as f:

    json.dump(result, f)

相关文章:

  • 北京网站建设多少钱?
  • 辽宁网页制作哪家好_网站建设
  • 高端品牌网站建设_汉中网站制作
  • 351_C++_自定义list容器的sort排序规则sortFileName,函数调用运算符 operator() 的重载,它使得一个对象可以像函数一样被调用
  • 支付通道安全:应对黑客攻击的策略与实践
  • 【SC05B】触摸芯片-高灵敏度、强抗干扰能力和稳定性
  • Matlab 判断直线上一点
  • Vue项目中禁用ESLint的几种常见方法
  • SSLRec代码分析
  • 从概念到完成:Midjourney——设计思维与AI技术的完美结合
  • 桃园南路上的红绿灯c++
  • C#身份证核验、身份证查询API、身份认证接口
  • 使用pip或conda离线下载安装包,使用pip或conda安装离线安装包
  • Django ORM中ExpressionWrapper的用途
  • 期货量化交易客户端开源教学第八节——TCP通信服务类
  • 线程安全(二)synchronized 的底层实现原理、锁升级、对象的内存结构
  • 精通Postman响应解析:正则表达式的实战应用
  • LangChain与GraphQL:开启API开发的新篇章
  • Angular 2 DI - IoC DI - 1
  • CSS魔法堂:Absolute Positioning就这个样
  • Essential Studio for ASP.NET Web Forms 2017 v2,新增自定义树形网格工具栏
  • Hibernate最全面试题
  • Less 日常用法
  • MySQL QA
  • node入门
  • ⭐ Unity 开发bug —— 打包后shader失效或者bug (我这里用Shader做两张图片的合并发现了问题)
  • webpack项目中使用grunt监听文件变动自动打包编译
  • 从tcpdump抓包看TCP/IP协议
  • 飞驰在Mesos的涡轮引擎上
  • 搞机器学习要哪些技能
  • 前嗅ForeSpider教程:创建模板
  • 前嗅ForeSpider中数据浏览界面介绍
  • 如何实现 font-size 的响应式
  • 一起来学SpringBoot | 第十篇:使用Spring Cache集成Redis
  • 宾利慕尚创始人典藏版国内首秀,2025年前实现全系车型电动化 | 2019上海车展 ...
  • 资深实践篇 | 基于Kubernetes 1.61的Kubernetes Scheduler 调度详解 ...
  • ‌JavaScript 数据类型转换
  • !!Dom4j 学习笔记
  • #mysql 8.0 踩坑日记
  • (利用IDEA+Maven)定制属于自己的jar包
  • (论文阅读31/100)Stacked hourglass networks for human pose estimation
  • (详细文档!)javaswing图书管理系统+mysql数据库
  • (一) 初入MySQL 【认识和部署】
  • (转)Android学习笔记 --- android任务栈和启动模式
  • (转)Scala的“=”符号简介
  • (转)socket Aio demo
  • (转)编辑寄语:因为爱心,所以美丽
  • (最新)华为 2024 届秋招-硬件技术工程师-单板硬件开发—机试题—(共12套)(每套四十题)
  • ***监测系统的构建(chkrootkit )
  • .dat文件写入byte类型数组_用Python从Abaqus导出txt、dat数据
  • .NET 程序如何获取图片的宽高(框架自带多种方法的不同性能)
  • .NET处理HTTP请求
  • .net开源工作流引擎ccflow表单数据返回值Pop分组模式和表格模式对比
  • .NET设计模式(8):适配器模式(Adapter Pattern)
  • .NET微信公众号开发-2.0创建自定义菜单
  • .NET中 MVC 工厂模式浅析
  • .secret勒索病毒数据恢复|金蝶、用友、管家婆、OA、速达、ERP等软件数据库恢复
  • ;号自动换行