当前位置：首页 > news >正文

大模型笔记3 Longformer for Extractive Summarization训练

news 来源：原创 2024/9/20 3:45:59

改为GPU运行

从文本label生成输入token label

多样本输出文本

保存训练过程损失和模型

部署到服务器

训练集构建

改为GPU运行

 1.检查是否有可用的GPU，并根据可用性设置设备。

 2.使用方法将模型和输入张量移动到GPU。.to(device)

 3.将所有相关的张量和模型移至GPU后进行计算。

# 检查是否有可用的GPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("run on ",device)

# 将模型移动到GPU

model.to(device)

我的电脑没有配cuda, 因此先尝试在colab运行

从文本label生成输入token label

之前样例的训练样本如:

labels: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

源码注释中对模型训练label的输入格式要求:

labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):

Labels for computing the token classification loss.

Indices should be in ``[0, ..., config.num_labels - 1]``.

构造token label

1. Tokenize训练句子

2. Tokenize 句子中所有label 短语

3. 对于每个标记标签，循环标记训练句子，只要找到匹配项，就将其标记为“1”，其余标记为“0”. 当一个句子中有两个label短语，将循环训练句子两次.

例子:

sentences=["HuggingFace is a company based in Paris and New York." ]

labels_texts=[ ["HuggingFace","York"]]

tokenized_sentence=['ĠHug', 'ging', 'Face', 'Ġis', 'Ġa', 'Ġcompany', 'Ġbased', 'Ġin', 'ĠParis', 'Ġand', 'ĠNew', 'ĠYork', '.']

#labels_cls为 tensor([[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

生成token label的代码

from transformers import AutoTokenizer

import torch

# 初始化tokenizer

tokenizer = AutoTokenizer.from_pretrained("tmp/Longformer-finetuned-norm")

sentences = ["HuggingFace is a company based in Paris and New York.","I am a little tiger."]

labels_texts = [["HuggingFace", "York"],["tiger"]]

def tokenize_and_align_labels(sentences, labels_texts, tokenizer):

tokenized_inputs = tokenizer(sentences, add_special_tokens=False, return_tensors="pt", padding=True, truncation=True)

all_labels_cls = []

for i, sentence in enumerate(sentences):

labels_text = labels_texts[i]

tokenized_sentence = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"][i])

labels_cls = [0] * len(tokenized_sentence)

for label_text in labels_text:

tokenized_label = tokenizer.tokenize(label_text)

label_length = len(tokenized_label)

for j in range(len(tokenized_sentence) - label_length + 1):#判断文本中这段token与label中的token完全一致

if tokenized_sentence[j:j + label_length] == tokenized_label:

# print("tokenized_sentence:",tokenized_sentence[j:j + label_length])

# print("tokenized_label:",tokenized_label)

labels_cls[j:j + label_length] = [1] * label_length

all_labels_cls.append(labels_cls)

return tokenized_inputs, torch.tensor(all_labels_cls)

inputs_id, labels_cls = tokenize_and_align_labels(sentences, labels_texts, tokenizer)

从pdf中读取的文字会有多余的\n

所有\n替换成空格

paper_text = paper_text.replace('\n', ' ')

去除参考文献

def remove_references(text):

keywords = ["References", "REFERENCES"]

for keyword in keywords:

index = text.find(keyword)

if index != -1:

return text[:index].strip()

return text

paper_text = remove_references(paper_text)

另外有个问题, 我从json给它论文样本时token粒度直接为字母. 对比一下可以发现它外面少套了一层list

descriptions = [df['Dataset Description'].tolist()]

多样本输出文本

将分类结果转换为文本的功能改为多样本循环版本

只转换predicted_token_class_ids中一个元素的代码是这样的

prediction_string=get_prediction_string(predicted_token_class_ids[0])

现在遍历predicted_token_class_ids中的每一个元素都调用get_prediction_string函数, 并把结果保存在prediction_strings中

使用列表推导式遍历 `predicted_token_class_ids` 中的每个元素，并调用 `get_prediction_string` 函数。将每个调用结果保存在 `prediction_strings` 列表中。

# 遍历 predicted_token_class_ids 中的每个元素，调用 get_prediction_string 函数

prediction_strings = [get_prediction_string(prediction,predicted_inputs_id) for prediction,predicted_inputs_id in zip(predicted_token_class_ids,inputs["input_ids"])]

print("Prediction Strings:", prediction_strings)

def get_prediction_string(predicted_token_class_id):

predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_id]

print("predicted_tokens_classes",predicted_tokens_classes)

#token类别转化为词输出

tokenized_sub_sentence = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

print("tokenized_sub_sentence:", tokenized_sub_sentence)

# 示例分类

# predicted_tokens_classes=['Non-dataset description', 'Dataset description', 'Non-dataset description', 'Dataset description', 'Non-dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description', 'Dataset description']

# 将预测类别为 'Dataset description' 的 token 所在的单词取出

dataset_description_words = []

current_word = ""

current_word_pred = False

for token, pred_class in zip(tokenized_sub_sentence, predicted_tokens_classes):

if token.startswith("Ġ"):

if (len(current_word)!=0) & current_word_pred:#前面有上一个单词, 且其中有描述token, 则把它存入句子

dataset_description_words.append(current_word)

current_word = token[1:]

current_word_pred = (pred_class == 'Dataset description')

# print("start: ",current_word)

# print("dataset_description_words: ",dataset_description_words)

# print("current_word_pred: ",current_word_pred)

else:

current_word += token

current_word_pred = current_word_pred or (pred_class == 'Dataset description')#如果不是词开头, 现在token和之前已有token只要有1类的都行

# print("mid: ",current_word)

# print("current_word_pred: ",current_word_pred)

#最后一个单词后没有下一个单词的开始符号, 无法进入循环, 单独判断

if (len(current_word)!=0) & current_word_pred:

dataset_description_words.append(current_word)

# 拼接所有包含 'Dataset description' 类 token 的单词为一个完整的字符串

dataset_description_string = " ".join(dataset_description_words)

return dataset_description_string

保存训练过程损失和模型

分批训练：使用DataLoader对数据进行分批处理。

损失曲线保存：使用matplotlib保存训练过程中的损失曲线。

保存模型：在训练完成后保存模型。

from torch.utils.data import DataLoader, TensorDataset

import matplotlib.pyplot as plt

# 创建DataLoader

dataset = TensorDataset(inputs["input_ids"], inputs["attention_mask"], token_labels)

dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# 训练参数

epochs = 3

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

losses = []

# 训练模型

model.train()

for epoch in range(epochs):

epoch_loss = 0

for batch in dataloader:

input_ids, attention_mask, labels = batch

input_ids, attention_mask, labels = input_ids.to(device), attention_mask.to(device), labels.to(device)

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

# outputs = model(**inputs, labels=labels)

if isinstance(outputs, tuple):

loss,logits = outputs

else:

loss = outputs.loss

# loss = outputs.loss

loss.backward()

optimizer.step()

optimizer.zero_grad()

epoch_loss += loss.item()

avg_epoch_loss = epoch_loss / len(dataloader)

losses.append(avg_epoch_loss)

print(f"Epoch {epoch + 1}/{epochs}, Loss: {avg_epoch_loss}")

# 保存训练过程中的损失曲线

plt.plot(range(1, epochs + 1), losses, marker='o')

plt.title('Training Loss')

plt.xlabel('Epoch')

plt.ylabel('Loss')

plt.savefig('output/training_loss.png')

# plt.show()

# 保存模型

model.save_pretrained("output/trained_model")

tokenizer.save_pretrained("output/trained_model")

print("Model and tokenizer saved to 'trained_model'")

保存训练过程loss值:

# 打开文件准备写入loss值

with open('output/training_loss.txt', 'w') as loss_file:

loss_file.write("Epoch, Loss\n")

# 将loss值加入文件

with open('output/training_loss.txt', 'a') as loss_file:

loss_file.write(f"{epoch + 1}, {avg_epoch_loss}\n")

部署到服务器

需要跑通tests/tkn_clsfy.py

这个是用的库的环境配置

GitHub - allenai/longformer: Longformer: The Long-Document Transformer

用obs工具将文件上传到华为卡

用df -TH看剩余空间

传输数据的文档:

初始化配置_对象存储服务 OBS

传输数据的，这块的ak、sk已经替换

1、下载压缩包

Linux

wget https://obs-community.obs.cn-north-1.myhuaweicloud.com/obsutil/current/obsutil_linux_arm64.tar.gz

Windows下载安装文档:

https://support.huaweicloud.com/utiltg-obs/obs_11_0003.html

2、进行解压，修改权限

Linux

tar -zxvf obsutil_linux_arm64.tar.gz

cd obsutil_linux_arm64_5.5.12

chmod 755 obsutil

Windows手动解压

3、配置认证(在cmd命令行中)

Windows操作系统

使用永久AK、SK进行初始化配置：

obsutil config -i=ak -k=sk -e=endpoint

使用临时AK、SK、SecurityToken进行初始化配置：

obsutil config -i=ak -k=sk -t=token -e=endpoint

macOS/Linux操作系统

使用永久AK、SK进行初始化配置：

./obsutil config -i=ak -k=sk -e=endpoint

使用临时AK、SK、SecurityToken进行初始化配置：

./obsutil config -i=ak -k=sk -t=token -e=endpoint

配置完成得到输出

Config file url:

C:\Users\laugo\.obsutilconfig

Update config file successfully!

检查连通性

Windows操作系统

obsutil ls -s

macOS/Linux操作系统

./obsutil ls -s

成功得到返回

Start at 2024-07-11 08:06:51.1973799 +0000 UTC

obs://public1

Bucket number: 1

使用obs:

快速使用_对象存储服务 OBS

创建名为longformer桶

obsutil mb obs://longformer

Start at 2024-07-11 08:19:01.5905845 +0000 UTC

Create bucket [longformer] successfully, request id [00000190A0DFD13C8104F19C0C05A0A8]

Notice: If the configured endpoint is a global domain name, you may need to wait serveral minutes before performing uploading operations on the created bucket. Therefore, configure the endpoint to a regional domain name if you want instant uploading operations on the bucket.

将本地D:/Projects/longformer文件夹上传OBS桶中

obsutil cp D:/Projects/longformer/ obs://longformer -r -f

Start at 2024-07-11 08:19:39.2002282 +0000 UTC

Parallel: 5 Jobs: 5

Threshold: 50.00MB PartSize: auto

VerifyLength: false VerifyMd5: false

CheckpointDir: C:\Users\laugo\.obsutil_checkpoint

Task id: 9d39ff2a-2f08-4764-8c42-5d7b93a110b0

OutputDir: C:\Users\laugo\.obsutil_output

[-------------------] 100.00% tps:0.31 12.50MB/s 129/129 5.13GB/5.13GB 7m0.746s

Succeed count: 129 Failed count: 0

Succeed bytes: 5.13GB

Metrics [max cost:373973 ms, min cost:7 ms, average cost:11838.48 ms, average tps:0.31, transfered size:5.13GB]

登上服务器, df -TH看下剩余内存

Filesystem Type Size Used Avail Use% Mounted on

/dev/nvme0n1p1 ext4 3.2T 393G 2.6T 14% /mnt/sdc

…

mnt/sdc这个是常用的, 看起来够用

从这里再安装一个obs下载文件

从这里的obs查看我的桶桶

obsutil ls -s

obs://longformer

longformer目录下载至本地longformer文件夹

./obsutil cp obs://longformer/longformer/ /mnt/sdc/longformer -r -f

如果报错说找不到目录, 可能需要等一会, 本地看到的目录和服务器的相比有时候会有延迟

下载成功后进入路径:

cd /mnt/sdc/longformer/longformer

重新在这个目录下按照github的方法配置虚拟环境

(如果是华为卡连不了外网, 装库会麻烦一些)

训练集构建

数据收集表:

https://docs.google.com/spreadsheets/d/1Lt8zjx-XWr9FlYDaFqFMhiJZRs5IXy1yDnsi0Br1Yr8/edit?usp=sharing

chatgpt提取例子:

https://chatgpt.com/share/df6316b4-feca-4d59-ad8f-1dafff72566d

根据论文列表paper_with_dataset.csv中的论文url解析文本内容.

将文本内容输入chatgpt接口, 使其返回关于论文提出的原创数据集的描述的字符串列表, 同时要求gpt返回的每一个句子保存为一个字符串, 且其中字符必须与原文完全一致. 格式是 ["description1", "description2", …]

在获得gpt返回的结果后与原文进行对比, 验证是否字符串均为原文中连续的字符.

最后将论文原文paper_texts与对应数据集描述dataset_descriptions保存在文件中, 方便下次读取为列表使用.

输入的需要提取描述的论文列表保存在paper_with_dataset.csv文件中, csv内容如下.

url,title,abstract,Dataset Description

https://arxiv.org/pdf/2407.08692,FAR-Trans: An Investment Dataset for Financial Asset Recommendation,"Financial asset recommendation (FAR) is a sub-domain of recommender systems which identifies useful financial securities for investors, with the expectation that they will invest capital on the recommended assets. FAR solutions analyse and learn from multiple data sources, including time series pricing data, customer profile information and expectations, as well as past investments. However, most models have been developed over proprietary datasets, making a comparison over a common benchmark impossible. In this paper, we aim to solve this problem by introducing FAR-Trans, the first public dataset for FAR, containing pricing information and retail investor transactions acquired from a large European financial institution. We also provide a bench-marking comparison between eleven FAR algorithms over the data for use as future baselines. The dataset can be downloaded from this https URL .",

需要输出的数据集格式

paper_texts = ["paper1 text"," paper1 text"]

dataset_descriptions = [["description1 in paper1", "description2 in paper1"],["description1 in paper2"]]

需要安装的库:

PyMuPDF库来解析PDF内容， pip install pymupdf

requests库来获取PDF文件

ChatGPT接口：pip install openai

GPT接口调用文档参考

https://juejin.cn/post/7199293850494091301

https://platform.openai.com/docs/guides/text-generation/chat-completions-api

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(

model="gpt-3.5-turbo",

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "Who won the world series in 2020?"},

{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},

{"role": "user", "content": "Where was it played?"}

]

)

申请密钥:

https://platform.openai.com/api-keys

import openai时导入报错, 是typing_extensions版本冲突的问题

import openai File , line 6, in <module> from typing_extensions import override ImportError: cannot import name 'override' from 'typing_extensions'

pip uninstall typing_extensions

pip install typing_extensions

读取CSV文件：读取paper_with_dataset.csv文件以获取每篇论文的URL和其他相关信息。

csv_file = 'paper_with_dataset.csv'

df = pd.read_csv(csv_file)

urls = df['url'].tolist()

titles = df['title'].tolist()

abstracts = df['abstract'].tolist()

解析论文文本：使用requests和PyMuPDF来获取和解析PDF论文的内容。

# 解析PDF内容

def extract_text_from_pdf(url):

response = requests.get(url)

response.raise_for_status()

pdf_document = 'paper.pdf'

with open(pdf_document, 'wb') as f:

f.write(response.content)

doc = fitz.open(pdf_document)

paper_text = ""

for page_num in range(len(doc)):

page = doc.load_page(page_num)

paper_text += page.get_text()

return paper_text

# 存储论文文本

paper_texts = []

for url in urls:

text = extract_text_from_pdf(url)

paper_texts.append(text)

调用ChatGPT接口：将论文文本内容发送给ChatGPT，并请求返回每篇论文中有关原创数据集的描述。

def get_dataset_descriptions(paper_text):

response = openai.Completion.create(

engine="gpt-4",

prompt=(

"Extract all the sentences from the following paper text that describe the dataset proposed by the authorss. "

"Each sentence should be preserved exactly as it appears in the text. Return the sentences as a list of strings: [\"description1\", \"description2\", …]\n\n"

f"{paper_text}"

max_tokens=max_tokens,

# n=1,

# stop=None,

# temperature=0.5

)

return response.choices[0].text.strip().split('\n')

papers_dataset_descriptions = []

for text in paper_texts:

descriptions = get_dataset_descriptions(text)

papers_dataset_descriptions.append(descriptions)

报错, 代码版本问题:

, in __call__

raise APIRemovedInV1(symbol=self._symbol)

openai.lib._old_api.APIRemovedInV1:

You tried to access openai.Completion, but this is no longer supported in openai>=1.0.0 - see the README at https://github.com/openai/openai-python for the API.

You can run `openai migrate` to automatically upgrade your codebase to use the 1.0.0 interface.

Alternatively, you can pin your installation to the old version, e.g. `pip install openai==0.28`

A detailed migration guide is available here: https://github.com/openai/openai-python/discussions/742

新版例子参考:

https://github.com/openai/openai-python/discussions/742

import openai

# optional; defaults to `os.environ['OPENAI_API_KEY']`

openai.api_key = '...'

# all client options can be configured just like the `OpenAI` instantiation counterpart

openai.base_url = "https://..."

openai.default_headers = {"x-foo": "true"}

completion = openai.chat.completions.create(

model="gpt-4",

messages=[

{

"role": "user",

"content": "How do I output all files in a directory using Python?",

)

print(completion.choices[0].message.content)

使用 openai.ChatCompletion.create 替换 openai.Completion.create。

engine 参数被替换为 model。

将原来的 prompt 结构改为 messages 列表，以符合新接口要求。

将响应内容从 response.choices[0].text 修改为 response['choices'][0]['message']['content']。

def get_dataset_descriptions(paper_text):

completion = openai.chat.completions.create(

model="gpt-3.5-turbo",

messages=[

{

"role": "user",

"content": (

"Extract all the sentences from the following paper text that describe the dataset proposed by the authors. "

"Each sentence should be preserved exactly as it appears in the text. Return the sentences as a list of strings: [\"description1\", \"description2\", …]\n\n"

f"{paper_text}"

max_tokens=max_tokens,

)

return completion.choices[0].message.content.strip().split('\n')

验证描述的连续性：确保ChatGPT返回的描述在原文中是连续的字符序列。

def validate_descriptions(paper_texts, dataset_descriptions):

validated_descriptions = []

for paper_text, descriptions in zip(paper_texts, dataset_descriptions):

valid_descriptions = []

for description in descriptions:

if description in paper_text:

valid_descriptions.append(description)