AWD-LSTM

基础 NLP 模块

在 PyTorch 或 fastai 的 layers 之上，语言模型使用一些 NLP 特有的自定义层。

dropout_mask

 dropout_mask (x:torch.Tensor, sz:list, p:float)

返回与 x 类型相同、大小为 sz 的 dropout mask，其中元素以概率 p 被取消。

	类型	详情
x	Tensor	源 Tensor，输出将与 `x` 类型相同
sz	list	dropout mask 的大小，以 `int` 列表表示
p	float	Dropout 概率
返回值	Tensor	乘法 dropout mask

t = dropout_mask(torch.randn(3,4), [4,3], 0.25)
test_eq(t.shape, [4,3])
assert ((t == 4/3) + (t==0)).all()

源码

RNNDropout

 RNNDropout (p:float=0.5)

在 seq_len 维度上具有一致性的 dropout，概率为 p。

dp = RNNDropout(0.3)
tst_inp = torch.randn(4,3,7)
tst_out = dp(tst_inp)
for i in range(4):
    for j in range(7):
        if tst_out[i,0,j] == 0: assert (tst_out[i,:,j] == 0).all()
        else: test_close(tst_out[i,:,j], tst_inp[i,:,j]/(1-0.3))

它还支持在图像序列上进行 dropout，其中时间维度是第一个轴，例如 10 张 3 通道、32x32 的图像。

_ = dp(torch.rand(4,10,3,32,32))

源码

WeightDropout

 WeightDropout (module:nn.Module, weight_p:float,
                layer_names:str|MutableSequence='weight_hh_l0')

一个模块，用于包裹另一层，训练期间该层的一些权重将被替换为 0。

	类型	默认值	详情
module	Module		包裹的模块
weight_p	float		Weight dropout 概率
layer_names	str \| collections.abc.MutableSequence	weight_hh_l0	应用 dropout 的参数名称

module = nn.LSTM(5,7)
dp_module = WeightDropout(module, 0.4)
wgts = dp_module.module.weight_hh_l0
tst_inp = torch.randn(10,20,5)
h = torch.zeros(1,20,7), torch.zeros(1,20,7)
dp_module.reset()
x,h = dp_module(tst_inp,h)
loss = x.sum()
loss.backward()
new_wgts = getattr(dp_module.module, 'weight_hh_l0')
test_eq(wgts, getattr(dp_module, 'weight_hh_l0_raw'))
assert 0.2 <= (new_wgts==0).sum().float()/new_wgts.numel() <= 0.6
assert dp_module.weight_hh_l0_raw.requires_grad
assert dp_module.weight_hh_l0_raw.grad is not None
assert ((dp_module.weight_hh_l0_raw.grad == 0.) & (new_wgts == 0.)).any()

源码

EmbeddingDropout

 EmbeddingDropout (emb:nn.Embedding, embed_p:float)

对嵌入层 emb 应用 dropout，概率为 embed_p。

	类型	详情
emb	Embedding	包裹的嵌入层
embed_p	float	嵌入层 dropout 概率

enc = nn.Embedding(10, 7, padding_idx=1)
enc_dp = EmbeddingDropout(enc, 0.5)
tst_inp = torch.randint(0,10,(8,))
tst_out = enc_dp(tst_inp)
for i in range(8):
    assert (tst_out[i]==0).all() or torch.allclose(tst_out[i], 2*enc.weight[tst_inp[i]])

源码

AWD_LSTM

 AWD_LSTM (vocab_sz:int, emb_sz:int, n_hid:int, n_layers:int,
           pad_token:int=1, hidden_p:float=0.2, input_p:float=0.6,
           embed_p:float=0.1, weight_p:float=0.5, bidir:bool=False)

受 https://arxiv.org/abs/1708.02182 启发的 AWD-LSTM

	类型	默认值	详情
vocab_sz	int		词汇表大小
emb_sz	int		嵌入向量大小
n_hid	int		隐藏状态中的特征数量
n_layers	int		LSTM 层数量
pad_token	int	1	Padding 标记 ID
hidden_p	float	0.2	层间隐藏状态的 dropout 概率
input_p	float	0.6	LSTM 堆栈输入的 dropout 概率
embed_p	float	0.1	嵌入层 dropout 概率
weight_p	float	0.5	LSTM 层的隐藏层间权重 dropout 概率
bidir	bool	False	如果设置为 `True`，则使用双向 LSTM 层

这是 AWD-LSTM 模型的核心，使用来自 vocab_sz 和 emb_sz 的嵌入层，以及可能堆叠的 n_layers 个 bidir LSTM 层。第一个 LSTM 层从 emb_sz 到 n_hid，最后一个从 n_hid 到 emb_sz，所有中间层都从 n_hid 到 n_hid。pad_token 被传递给 PyTorch 嵌入层。dropout 的应用方式如下：

嵌入层被包裹在概率为 embed_p 的 EmbeddingDropout 中；
嵌入层的结果经过概率为 input_p 的 RNNDropout；
每个 LSTM 应用概率为 weight_p 的 WeightDropout；
在两个内部 LSTM 层之间，应用概率为 hidden_p 的 RNNDropout。

该模块返回两个列表：每个内部 LSTM 层的原始输出（未应用 hidden_p 的 dropout）和应用了 dropout 的输出列表。由于最后一个输出未应用 dropout，因此这两个列表的最后一个元素相同，该元素应馈送给解码器（在语言模型的情况下）。

tst = AWD_LSTM(100, 20, 10, 2, hidden_p=0.2, embed_p=0.02, input_p=0.1, weight_p=0.2)
x = torch.randint(0, 100, (10,5))
r = tst(x)
test_eq(tst.bs, 10)
test_eq(len(tst.hidden), 2)
test_eq([h_.shape for h_ in tst.hidden[0]], [[1,10,10], [1,10,10]])
test_eq([h_.shape for h_ in tst.hidden[1]], [[1,10,20], [1,10,20]])

test_eq(r.shape, [10,5,20])
test_eq(r[:,-1], tst.hidden[-1][0][0]) #hidden state is the last timestep in raw outputs

tst.eval()
tst.reset()
tst(x);
tst(x);

源码

awd_lstm_lm_split

 awd_lstm_lm_split (model)

将 RNN model 分组以应用差异学习率。

源码

awd_lstm_clas_split

 awd_lstm_clas_split (model)

将 RNN model 分组以应用差异学习率。