一、目录

1.架构
2.表征学习
3.不同数据 格式选择不同的损失函数、衍生不同的架构
4.选择不同损失函数的原因
5.基座模型选择
6.模型训练
7.测评

二、实现

1.架构
网址:https://www.sbert.net/docs/training/overview.html
https://huggingface.co/blog/zh/train-sentence-transformers
1. sentence-transformers
架构:由bert 衍生而来,选择不同的损失函数,则准备不同的数据集,从而训练不同的任务。
在这里插入图片描述
2. 表征学习:表示学习指学习对观测样本X有效的表示。BERT的本质是一种文本表征(context representation),做的是一个文本->矩阵(max length× \times×hidden size)或者文本->向量(1× \times×hidden size),word2vec也可以做到同样的事,但是word2vec是静态的,而BERT是动态的,因为BERT是将输入考虑语序后经过transformer输出的。

  1. 不同数据集类型,选择不同的损失函数: 网址:https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss
    1. 如果损失函数,选择余弦相似度,则完成相似度的模型训练,模型架构变为:在这里插入图片描述
训练案例如下:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("distilbert-base-nli-mean-tokens")

# Define your train examples. You need more than just two examples...   #准备样本,1 句子对  2. 标签 0-1  
train_examples = [
    InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
    InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
]

# Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)        #选择损失函数

# Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)  #训练
评估:
from sentence_transformers import evaluation

sentences1 = [
    "This list contains the first column",
    "With your sentences",
    "You want your model to evaluate on",
]
sentences2 = [
    "Sentences contains the other column",
    "The evaluator matches sentences1[i] with sentences2[i]",
    "Compute the cosine similarity and compares it to scores[i]",
]
scores = [0.3, 0.6, 0.2]

evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)

# ... Your other code to load training data

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    warmup_steps=100,
    evaluator=evaluator,
    evaluation_steps=500,
)
  1. 如果损失函数,选择对比损失函数,则完成相似度的模型训练
Contrastive Loss(对比损失)是一种损失函数,通常用于训练对比学习(Contrastive Learning)模型,这些模型旨在学习数据中的相似性和差异性。对比学习的主要目标是将相似的样本对映射到接近的位置,而将不相似的样本对映射到远离的位置。Contrastive Loss 有助于实现这一目标。
Contrastive Loss 可以认为是ranking loss类型,我们平时ML任务的时候,用的最多的是cross entropy loss或者MSE loss。需要有一个明确的目标,比如一个具体的数值或者是一个具体的分类类别。但是ranking loss实际上是一种metric learning,他们学习的相对距离,相关关系,而对具体数值不是很关心。

训练: 标签为0,1 在这里插入图片描述

from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.readers import InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer('all-MiniLM-L6-v2')
train_examples = [
    InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1),
    InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0),
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)
train_loss = losses.ContrastiveLoss(model=model)

model.fit(
    [(train_dataloader, train_loss)],
    epochs=10,
)

3 如果损失函数,选择 MultipleNegativesRankingLoss,则完成相似度的模型训练
输入 句子对,不需要标签在这里插入图片描述

from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader

model = SentenceTransformer('distilbert-base-uncased')
train_examples = [
    InputExample(texts=['Anchor 1', 'Positive 1']),
    InputExample(texts=['Anchor 2', 'Positive 2']),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
model.fit(
    [(train_dataloader, train_loss)],
    epochs=10,
)

4 选择不同损失函数的原因

  1. 基座模型选择
    评估指标: 皮卡尔系数 相似分布在这里插入图片描述
'''测试'''
from sentence_transformers import evaluation
from sentence_transformers import SentenceTransformer, InputExample, losses,util
from torch.utils.data import DataLoader
# Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("C:/Users/86188/Downloads/hge")
import pandas as pd
from tqdm import tqdm
dfs=pd.read_csv("./resource/test.csv")
dfs=dfs.loc[0:200,:]
all=[]
for i in tqdm(range(len(dfs))):
    data=dfs.loc[i,"data"]
    data=eval(data)

    sent1=data["sen1"]
    sent2=data["sen2"]
    label=data["label"]
    vec1=model.encode(sent1)
    vec2=model.encode(sent2)
    pred=util.cos_sim(vec1,vec2)[0,0].item()
    data["pred"]=pred
    all.append(data)

df=pd.DataFrame(all)
df.to_csv("./resource/test_pre.csv")


def cal_spearman(pred,label):
    from scipy.stats import spearmanr
    return spearmanr(pred,label).correlation

labels,preds=[],[]
dfs=pd.read_csv("./resource/test_pre.csv")

pred1={}
pred2={}
for i in range(len(dfs)):
    label=dfs.loc[i,"label"]
    pred = dfs.loc[i, "pred"]
    p = int(pred * 10)
    if label==0:
        pred1[p]=pred1.get(p,0)+1
    else:
        pred2[p]=pred2.get(p,0)+1
    pred=dfs.loc[i,"pred"]
    labels.append(label)
    preds.append(pred)

res=cal_spearman(preds,labels)
print(res)
print(pred1,pred2)
#相似分布
from matplotlib import pyplot as plt


x =[pred1[i] for i in range(11)]
x1=[pred2[i] for i in range(11)]
data = x
print(data)     #weight 代表高度  data 代表x 坐标
plt.hist([i for i in range(11)], weights=x,bins=11, color='skyblue', alpha=0.3, edgecolor='black', linewidth=1)
plt.hist([i for i in range(11)], weights=x1,bins=11, color='green', alpha=0.3, edgecolor='black', linewidth=1)
plt.show()
Logo

尧米是由西云算力与CSDN联合运营的AI算力和模型开源社区品牌,为基于DaModel智算平台的AI应用企业和泛AI开发者提供技术交流与成果转化平台。

更多推荐