MultiHeadAttention module¶
- class multiHeadAttention.MultiHeadAttention(d_model, q, v, h, attention_size=None)¶
Bases:
Module
Multi Head Attention block from Attention is All You Need.
Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model).
- Parameters
- property attention_map: Tensor¶
Attention map after a forward propagation, variable score in the original paper.
- forward(query, key, value, mask=None)¶
Propagate forward the input through the MHB.
We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).
- Parameters
query (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute queries.key (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute keys.value (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute values.mask (
Optional
[str
]) – Mask to apply on scores before computing attention. One of'subsequent'
, None. Default is None.
- Return type
- Returns
Self attention tensor with shape (batch_size, K, d_model).
- class multiHeadAttention.MultiHeadAttentionChunk(d_model, q, v, h, attention_size=None, chunk_size=168, **kwargs)¶
Bases:
MultiHeadAttention
Multi Head Attention block with chunk.
Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model). Queries, keys and values are divided in chunks of constant size.
- Parameters
d_model (
int
) – Dimension of the input vector.q (
int
) – Dimension of all query matrix.v (
int
) – Dimension of all value matrix.h (
int
) – Number of heads.attention_size (
Optional
[int
]) – Number of backward elements to apply attention. Deactivated ifNone
. Default isNone
.chunk_size (
Optional
[int
]) – Size of chunks to apply attention on. Last one may be smaller (seetorch.Tensor.chunk
). Default is 168.
- forward(query, key, value, mask=None)¶
Propagate forward the input through the MHB.
We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).
- Parameters
query (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute queries.key (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute keys.value (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute values.mask (
Optional
[str
]) – Mask to apply on scores before computing attention. One of'subsequent'
, None. Default is None.
- Return type
- Returns
Self attention tensor with shape (batch_size, K, d_model).
- class multiHeadAttention.MultiHeadAttentionWindow(d_model, q, v, h, attention_size=None, window_size=168, padding=42, **kwargs)¶
Bases:
MultiHeadAttention
Multi Head Attention block with moving window.
Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model). Queries, keys and values are divided in chunks using a moving window.
- Parameters
d_model (
int
) – Dimension of the input vector.q (
int
) – Dimension of all query matrix.v (
int
) – Dimension of all value matrix.h (
int
) – Number of heads.attention_size (
Optional
[int
]) – Number of backward elements to apply attention. Deactivated ifNone
. Default isNone
.window_size (
Optional
[int
]) – Size of the window used to extract chunks. Default is 168padding (
Optional
[int
]) – Padding around each window. Padding will be applied to input sequence. Default is 168 // 4 = 42.
- forward(query, key, value, mask=None)¶
Propagate forward the input through the MHB.
We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).
- Parameters
query (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute queries.key (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute keys.value (
Tensor
) – Input tensor with shape (batch_size, K, d_model) used to compute values.mask (
Optional
[str
]) – Mask to apply on scores before computing attention. One of'subsequent'
, None. Default is None.
- Return type
- Returns
Self attention tensor with shape (batch_size, K, d_model).