sentence-transformers
BERT的本质是一种文本表征(context representation),做的是一个文本->矩阵(max length× \times×hidden size)或者文本->向量(1× \times×hidden size),word2vec也可以做到同样的事,但是word2vec是静态的,而BERT是动态的,因为BERT是将输入考虑语序后经过transformer输出的。架构:由bert 衍生
一、目录
1.架构
2.表征学习
3.不同数据 格式选择不同的损失函数、衍生不同的架构
4.选择不同损失函数的原因
5.基座模型选择
6.模型训练
7.测评
二、实现
1.架构
网址:https://www.sbert.net/docs/training/overview.html
https://huggingface.co/blog/zh/train-sentence-transformers
1. sentence-transformers
架构:由bert 衍生而来,选择不同的损失函数,则准备不同的数据集,从而训练不同的任务。
2. 表征学习:表示学习指学习对观测样本X有效的表示。BERT的本质是一种文本表征(context representation),做的是一个文本->矩阵(max length× \times×hidden size)或者文本->向量(1× \times×hidden size),word2vec也可以做到同样的事,但是word2vec是静态的,而BERT是动态的,因为BERT是将输入考虑语序后经过transformer输出的。
- 不同数据集类型,选择不同的损失函数: 网址:https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss
- 如果损失函数,选择余弦相似度,则完成相似度的模型训练,模型架构变为:
训练案例如下:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("distilbert-base-nli-mean-tokens")
# Define your train examples. You need more than just two examples... #准备样本,1 句子对 2. 标签 0-1
train_examples = [
InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
]
# Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model) #选择损失函数
# Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100) #训练
评估:
from sentence_transformers import evaluation
sentences1 = [
"This list contains the first column",
"With your sentences",
"You want your model to evaluate on",
]
sentences2 = [
"Sentences contains the other column",
"The evaluator matches sentences1[i] with sentences2[i]",
"Compute the cosine similarity and compares it to scores[i]",
]
scores = [0.3, 0.6, 0.2]
evaluator = evaluation.EmbeddingSimilarityEvaluator(sentences1, sentences2, scores)
# ... Your other code to load training data
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=1,
warmup_steps=100,
evaluator=evaluator,
evaluation_steps=500,
)
- 如果损失函数,选择对比损失函数,则完成相似度的模型训练
Contrastive Loss(对比损失)是一种损失函数,通常用于训练对比学习(Contrastive Learning)模型,这些模型旨在学习数据中的相似性和差异性。对比学习的主要目标是将相似的样本对映射到接近的位置,而将不相似的样本对映射到远离的位置。Contrastive Loss 有助于实现这一目标。
Contrastive Loss 可以认为是ranking loss类型,我们平时ML任务的时候,用的最多的是cross entropy loss或者MSE loss。需要有一个明确的目标,比如一个具体的数值或者是一个具体的分类类别。但是ranking loss实际上是一种metric learning,他们学习的相对距离,相关关系,而对具体数值不是很关心。
训练: 标签为0,1
from sentence_transformers import SentenceTransformer, losses
from sentence_transformers.readers import InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer('all-MiniLM-L6-v2')
train_examples = [
InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1),
InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)
train_loss = losses.ContrastiveLoss(model=model)
model.fit(
[(train_dataloader, train_loss)],
epochs=10,
)
3 如果损失函数,选择 MultipleNegativesRankingLoss,则完成相似度的模型训练
输入 句子对,不需要标签
from sentence_transformers import SentenceTransformer, losses, InputExample
from torch.utils.data import DataLoader
model = SentenceTransformer('distilbert-base-uncased')
train_examples = [
InputExample(texts=['Anchor 1', 'Positive 1']),
InputExample(texts=['Anchor 2', 'Positive 2']),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
model.fit(
[(train_dataloader, train_loss)],
epochs=10,
)
4 选择不同损失函数的原因
- 基座模型选择
评估指标: 皮卡尔系数 相似分布
'''测试'''
from sentence_transformers import evaluation
from sentence_transformers import SentenceTransformer, InputExample, losses,util
from torch.utils.data import DataLoader
# Define the model. Either from scratch of by loading a pre-trained model
model = SentenceTransformer("C:/Users/86188/Downloads/hge")
import pandas as pd
from tqdm import tqdm
dfs=pd.read_csv("./resource/test.csv")
dfs=dfs.loc[0:200,:]
all=[]
for i in tqdm(range(len(dfs))):
data=dfs.loc[i,"data"]
data=eval(data)
sent1=data["sen1"]
sent2=data["sen2"]
label=data["label"]
vec1=model.encode(sent1)
vec2=model.encode(sent2)
pred=util.cos_sim(vec1,vec2)[0,0].item()
data["pred"]=pred
all.append(data)
df=pd.DataFrame(all)
df.to_csv("./resource/test_pre.csv")
def cal_spearman(pred,label):
from scipy.stats import spearmanr
return spearmanr(pred,label).correlation
labels,preds=[],[]
dfs=pd.read_csv("./resource/test_pre.csv")
pred1={}
pred2={}
for i in range(len(dfs)):
label=dfs.loc[i,"label"]
pred = dfs.loc[i, "pred"]
p = int(pred * 10)
if label==0:
pred1[p]=pred1.get(p,0)+1
else:
pred2[p]=pred2.get(p,0)+1
pred=dfs.loc[i,"pred"]
labels.append(label)
preds.append(pred)
res=cal_spearman(preds,labels)
print(res)
print(pred1,pred2)
#相似分布
from matplotlib import pyplot as plt
x =[pred1[i] for i in range(11)]
x1=[pred2[i] for i in range(11)]
data = x
print(data) #weight 代表高度 data 代表x 坐标
plt.hist([i for i in range(11)], weights=x,bins=11, color='skyblue', alpha=0.3, edgecolor='black', linewidth=1)
plt.hist([i for i in range(11)], weights=x1,bins=11, color='green', alpha=0.3, edgecolor='black', linewidth=1)
plt.show()
更多推荐
所有评论(0)