当前位置：首页 > news >正文

TorchProtein教程--蛋白质数据结构(2)

news 来源：原创 2024/5/6 4:26:30

TorchProtein教程–蛋白质数据结构(1)

本教程来自唐建团队的开源框架torchprotein

蛋白质数据结构

在本教程中，我们将学习 TorchProtein 中使用的基本蛋白质数据结构。在 TorchProtein 中，蛋白质可以看作是TorchDrug中一般图的特例，因为蛋白质的一级结构（即氨基酸序列）或三级结构（即3D 折叠结构）都可以看作是以原子或残差为节点和不同边构造方法的图。

在开始之前，建议您先阅读TorchDrug 中的图数据结构说明。

蛋白质数据结构 I/O

通常，我们可以从 PDB 文件中获取蛋白质结构信息，这是一种描述蛋白质结构的标准数据格式。在本教程中，我们以单链胰岛素 (PDB id: 2LWZ) 为例。让我们首先通过NGLView对其进行可视化。

import nglview
view = nglview.show_pdbid("2lwz")  
view

从 PDB 文件构建蛋白质数据结构

在 TorchProtein 中，我们可以使用Protein.from_pdb读取 PDB 文件并构建数据结构。原子、边缘和残差特征可以作为机器学习模型的输入。我们可以通过更改中的参数来指定不同的功能Protein.from_pdb

import torchdrug as td
from torchdrug import data, utils

pdb_file = utils.download("https://files.rcsb.org/download/2LWZ.pdb", "./")
protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
print(protein)
print(protein.residue_feature.shape)
print(protein.atom_feature.shape)
print(protein.bond_feature.shape)

构建的数据结构包含有关蛋白质的丰富信息。例如，您可以获得前 10 个残基的链 ID 和前 10 个原子的 3D 坐标，如下所示。

for residue_id, chain_id in zip(protein.residue_type.tolist()[:10], protein.chain_id.tolist()[:10]):
    print("%s: %s" % (data.Protein.id2residue[residue_id], chain_id))

for atom, position in zip(protein.atom_name.tolist()[:10], protein.node_position.tolist()[:10]):
    print("%s: %s" % (data.Protein.id2atom_name[atom], position))

蛋白质数据结构存储恢复蛋白质所需的所有信息，并提供一种to_pdb()以 PDB 格式保存蛋白质的方法。我们展示了单链胰岛素的回收率如下。

from rdkit import Chem

protein.to_pdb("new_2LWZ.pdb")
mol = Chem.MolFromPDBFile("new_2LWZ.pdb")
view = nglview.show_rdkit(mol)

从蛋白质序列构建蛋白质数据结构

在某些应用中，我们可能只访问蛋白质的氨基酸序列。对于这种情况，TorchProtein 提供了一种Protein.from_sequence方法和一种Protein.from_sequence_fast从序列构建蛋白质数据结构的方法。

前一种方法使用 RDKit 构建蛋白质对象，它将计算原子、残基和键的特征，因此速度较慢。后一种方法直接构建只有残基类型和特征的蛋白质数据结构，因此速度更快。

import time

aa_seq = protein.to_sequence()
print(aa_seq)
start_time = time.time()
seq_protein = data.Protein.from_sequence(aa_seq, atom_feature="symbol", bond_feature="length", residue_feature="symbol")
end_time = time.time()
print("Duration of construction: ", end_time - start_time)
print(seq_protein)

start_time = time.time()
seq_protein = data.Protein.from_sequence(aa_seq, atom_feature=None, bond_feature=None, residue_feature="default")
end_time = time.time()
print("Duration of construction: ", end_time - start_time)
print(seq_protein)

蛋白质操作

批次蛋白

为了充分利用硬件，TorchProtein 继承了 TorchDrug 中的结构，支持将多个蛋白质作为一个批次进行处理，并且该批次可以在 CPU 和 GPU 之间使用和方法data.Graph进行切换。给定多种蛋白质，我们可以通过构建蛋白质批次，并将其从 CPU 传输到 GPU 。此外，我们可以通过正常的索引操作从批次中提取几种特定的蛋白质。cpu()cuda()data.Protein.packcuda()

proteins = [protein] * 3
proteins = data.Protein.pack(proteins)
print(proteins)
proteins = proteins.cuda()
print(proteins)
proteins_ = proteins[[0, 2]]
print(proteins_)

原子和残基之间的引用

在 TorchProtein 中，我们提供了atom2residue检索每个原子对应的残基的residue2atom方法，并提供了检索每个残基的关联原子的方法。这两种方法的典型用法如下。

print(protein.atom2residue.tolist()[:20])
print(protein.atom_name.tolist()[:20])

for atom_id, (atom, residue_id) in enumerate(zip(protein.atom_name.tolist()[:20], protein.atom2residue.tolist()[:20])):
    print("[atom %s] %s: %s" % (atom_id, data.Protein.id2atom_name[atom], data.Protein.id2residue[residue_id]))

for residue_id in [0, 1]:
    atom_ids = protein.residue2atom(residue_id).sort()[0]
    for atom, position in zip(protein.atom_name[atom_ids].tolist(), protein.node_position[atom_ids].tolist()):
        print("[residue %s] %s: %s" % (residue_id, data.Protein.id2atom_name[atom], position))

亚蛋白和掩

在蛋白质研究中，我们有时需要从蛋白质中提取特定残基并对其进行分析。使用 TorchProtein，我们可以通过索引操作轻松实现这一点。我们给出了一个从蛋白质中提取前两个残基的例子，如下所示。 请注意，在提取过程中，提取残基的原子之间的键将保留。

first_two = protein[:2]
first_two.visualize()

在 TorchProtein 中，我们还提供了resiude_mask从蛋白质中提取某些特定残基的node_mask方法，并提供了从蛋白质中提取某些特定原子的方法。通过使用这两种方法，我们还可以从蛋白质中提取前两个残基，如下所示。

is_first_two_ = (protein.residue_number == 1) | (protein.residue_number == 2)
first_two_ = protein.residue_mask(is_first_two_, compact=True)
assert first_two == first_two_

is_first_two_ = (protein.atom2residue == 0) | (protein.atom2residue == 1)
first_two_ = protein.node_mask(is_first_two_, compact=True)
assert first_two == first_two_

原子和残基视图

对于基于序列的蛋白质编码模型，我们通常将残基视为蛋白质图中的节点，而有时我们也希望将原子特征用作基于结构的蛋白质编码模型的节点特征。为了支持原子和残基特征之间的灵活切换，TorchProtein 定义了view属性来选择我们想要使用哪些特征作为节点特征。

protein.view = "atom"
print(protein.node_feature.shape)
protein.view = "residue"
print(protein.node_feature.shape)

注册您自己的属性

虽然Protein该类带有几个原子级和残基级属性，但我们可能还想定义自己的属性。这只需要用上下文管理器包装属性分配行。我们可以将protein.atom(),protein.residue()和protein.graph()分别用于原子级、残基级和图形级属性。

注册残基和原子属性

我们在这里给出两个注册残基和原子属性的例子。第一个示例定义了一个自定义残基属性来编码每个残基是否后跟“GLY”残基。第二个示例定义了一个自定义原子属性来编码每个原子是否与氮相连。

from torch_scatter import scatter_add
import torch 
next_residue_type = torch.cat([protein.residue_type[1:], torch.full((1,), -1, dtype=protein.residue_type.dtype)])
followed_by_GLY = next_residue_type == data.Protein.residue2id["GLY"]
with protein.residue():
    protein.followed_by_GLY = followed_by_GLY

atom_in, atom_out = protein.edge_list.t()[:2]
attached_to_N = scatter_add(protein.atom_type[atom_in] == td.NITROGEN, atom_out, dim_size=protein.num_node)
with protein.atom():
    protein.attached_to_N = attached_to_N

注册残基和原子之间的引用

在某些情况下，我们希望将一个残基/原子链接到另一个残基/原子。protein.residue_reference()我们可以通过在or的上下文中注册来实现这一点protein.atom_reference()。例如，我们可以在protein.residue()和的上下文中注册每个残基对应的α碳的索引protein.atom_reference()。需要注意的是，在任何提取部分蛋白质的操作下，以这种方式注册的索引将自动更改为新提取的蛋白质下的索引。

from torch_scatter import scatter_max

range = torch.arange(protein.num_node)
calpha = torch.where(protein.atom_name == protein.atom_name2id["CA"], range, -1)
residue2calpha = scatter_max(calpha, protein.atom2residue, dim_size=protein.num_residue)[0]
with protein.residue(), protein.atom_reference():
    protein.residue2calpha = residue2calpha

sub_protein = protein[3:10]
for calpha_index in sub_protein.residue2calpha.tolist():
    atom_name = data.Protein.id2atom_name[sub_protein.atom_name[calpha_index].item()]
    print("New index %d: %s" % (calpha_index, atom_name))