MultiHeadAttention module

class multiHeadAttention.MultiHeadAttention(d_model, q, v, h, attention_size=None)

Bases: Module

Multi Head Attention block from Attention is All You Need.

Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model).

Parameters
  • d_model (int) – Dimension of the input vector.

  • q (int) – Dimension of all query matrix.

  • v (int) – Dimension of all value matrix.

  • h (int) – Number of heads.

  • attention_size (Optional[int]) – Number of backward elements to apply attention. Deactivated if None. Default is None.

property attention_map: Tensor

Attention map after a forward propagation, variable score in the original paper.

forward(query, key, value, mask=None)

Propagate forward the input through the MHB.

We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).

Parameters
  • query (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute queries.

  • key (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute keys.

  • value (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute values.

  • mask (Optional[str]) – Mask to apply on scores before computing attention. One of 'subsequent', None. Default is None.

Return type

Tensor

Returns

Self attention tensor with shape (batch_size, K, d_model).

class multiHeadAttention.MultiHeadAttentionChunk(d_model, q, v, h, attention_size=None, chunk_size=168, **kwargs)

Bases: MultiHeadAttention

Multi Head Attention block with chunk.

Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model). Queries, keys and values are divided in chunks of constant size.

Parameters
  • d_model (int) – Dimension of the input vector.

  • q (int) – Dimension of all query matrix.

  • v (int) – Dimension of all value matrix.

  • h (int) – Number of heads.

  • attention_size (Optional[int]) – Number of backward elements to apply attention. Deactivated if None. Default is None.

  • chunk_size (Optional[int]) – Size of chunks to apply attention on. Last one may be smaller (see torch.Tensor.chunk). Default is 168.

forward(query, key, value, mask=None)

Propagate forward the input through the MHB.

We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).

Parameters
  • query (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute queries.

  • key (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute keys.

  • value (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute values.

  • mask (Optional[str]) – Mask to apply on scores before computing attention. One of 'subsequent', None. Default is None.

Return type

Tensor

Returns

Self attention tensor with shape (batch_size, K, d_model).

class multiHeadAttention.MultiHeadAttentionWindow(d_model, q, v, h, attention_size=None, window_size=168, padding=42, **kwargs)

Bases: MultiHeadAttention

Multi Head Attention block with moving window.

Given 3 inputs of shape (batch_size, K, d_model), that will be used to compute query, keys and values, we output a self attention tensor of shape (batch_size, K, d_model). Queries, keys and values are divided in chunks using a moving window.

Parameters
  • d_model (int) – Dimension of the input vector.

  • q (int) – Dimension of all query matrix.

  • v (int) – Dimension of all value matrix.

  • h (int) – Number of heads.

  • attention_size (Optional[int]) – Number of backward elements to apply attention. Deactivated if None. Default is None.

  • window_size (Optional[int]) – Size of the window used to extract chunks. Default is 168

  • padding (Optional[int]) – Padding around each window. Padding will be applied to input sequence. Default is 168 // 4 = 42.

forward(query, key, value, mask=None)

Propagate forward the input through the MHB.

We compute for each head the queries, keys and values matrices, followed by the Scaled Dot-Product. The result is concatenated and returned with shape (batch_size, K, d_model).

Parameters
  • query (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute queries.

  • key (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute keys.

  • value (Tensor) – Input tensor with shape (batch_size, K, d_model) used to compute values.

  • mask (Optional[str]) – Mask to apply on scores before computing attention. One of 'subsequent', None. Default is None.

Return type

Tensor

Returns

Self attention tensor with shape (batch_size, K, d_model).