MultiHeadAttention module¶

class multiHeadAttention.MultiHeadAttention(d_model, q, v, h, attention_size=None)¶

Bases: Module

Multi Head Attention block from Attention is All You Need.

Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model).

Parameters

d_model (int) – Dimension of the input vector.
q (int) – Dimension of all query matrix.
v (int) – Dimension of all value matrix.
h (int) – Number of heads.
attention_size (Optional[int]) – Number of backward elements to apply attention. Deactivated if None. Default is None.

property attention_map: Tensor¶: Attention map after a forward propagation, variable score in the original paper.

forward(query, key, value, mask=None)¶

Propagate forward the input through the MHB.

We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).

Parameters

query (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute queries.
key (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute keys.
value (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute values.
mask (Optional[str]) – Mask to apply on scores before computing attention. One of 'subsequent', None. Default is None.

Return type

Tensor

Returns

Self attention tensor with shape (batch_size, K, d_model).

class multiHeadAttention.MultiHeadAttentionChunk(d_model, q, v, h, attention_size=None, chunk_size=168, **kwargs)¶

Bases: MultiHeadAttention

Multi Head Attention block with chunk.

Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model). Queries, keys and values are divided in chunks of constant size.

Parameters

d_model (int) – Dimension of the input vector.
q (int) – Dimension of all query matrix.
v (int) – Dimension of all value matrix.
h (int) – Number of heads.
attention_size (Optional[int]) – Number of backward elements to apply attention. Deactivated if None. Default is None.
chunk_size (Optional[int]) – Size of chunks to apply attention on. Last one may be smaller (see torch.Tensor.chunk). Default is 168.

forward(query, key, value, mask=None)¶

Propagate forward the input through the MHB.

We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).

Parameters

query (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute queries.
key (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute keys.
value (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute values.
mask (Optional[str]) – Mask to apply on scores before computing attention. One of 'subsequent', None. Default is None.

Return type

Tensor

Returns

Self attention tensor with shape (batch_size, K, d_model).

class multiHeadAttention.MultiHeadAttentionWindow(d_model, q, v, h, attention_size=None, window_size=168, padding=42, **kwargs)¶

Bases: MultiHeadAttention

Multi Head Attention block with moving window.

Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model). Queries, keys and values are divided in chunks using a moving window.

Parameters

d_model (int) – Dimension of the input vector.
q (int) – Dimension of all query matrix.
v (int) – Dimension of all value matrix.
h (int) – Number of heads.
attention_size (Optional[int]) – Number of backward elements to apply attention. Deactivated if None. Default is None.
window_size (Optional[int]) – Size of the window used to extract chunks. Default is 168
padding (Optional[int]) – Padding around each window. Padding will be applied to input sequence. Default is 168 // 4 = 42.

forward(query, key, value, mask=None)¶

Propagate forward the input through the MHB.

We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).

Parameters

query (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute queries.
key (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute keys.
value (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute values.
mask (Optional[str]) – Mask to apply on scores before computing attention. One of 'subsequent', None. Default is None.

Return type

Tensor

Returns

Self attention tensor with shape (batch_size, K, d_model).