deepmatcher.word_contextualizers¶
RNN¶
SelfAttention¶
-
class
deepmatcher.word_contextualizers.
SelfAttention
(heads=1, hidden_size=None, input_dropout=0, alignment_network='decomposable', scale=False, score_dropout=0, value_transform_network=None, value_merge='concat', transform_dropout=0, output_transform_network=None, output_dropout=0, bypass_network='highway', input_size=None)[source]¶ Self Attention based Word Contextualizer.
Supports vanilla self attention and multi-head self attention.
Parameters: - heads (int) – Number of attention heads to use. Defaults to 1.
- hidden_size (int) – The default hidden size of the alignment_network and transform networks, if they are not disabled.
- input_dropout (float) – If non-zero, applies dropout to the input to this module. Dropout probability must be between 0 and 1.
- alignment_network (string or
deepmatcher.modules.AlignmentNetwork
or callable) – The neural network takes the input sequence, aligns the words in the sequence with other words in the sequence, and returns the corresponding alignment score matrix. Argument must specify a Align operation. - scale (bool) – Whether to scale the alignment scores by the square root of the hidden_size parameter. Based on scaled dot-product attention
- score_dropout (float) – If non-zero, applies dropout to the alignment score matrix. Dropout probability must be between 0 and 1.
- value_transform_network (string or
Transform
or callable) – For each word embedding in the input sequence, SelfAttention takes a weighted average of the aligning values, i.e., the aligning word embeddings based on the alignment scores. This parameter specifies the neural network to transform the values (word embeddings) before taking the weighted average. Argument must be None or specify a Transform operation. If the argument is a string, the hidden size of the transform operation is computed ashidden_size // heads
. If argument is None, and heads is 1, then the values are not transformed. If argument is None and heads is > 1, then a 1 layer highway network without any non-linearity is used. The hidden size for this is computed as mentioned above. - value_merge (string or
Merge
or callable) – For each word embedding in the input sequence, each SelfAttention head produces one corresponding vector as output. This parameter specifies how to merge the outputs of all attention heads for each word embedding. Concatenates the outputs of all heads by default. Argument must specify a Merge operation. - transform_dropout (float) – If non-zero, applies dropout to the output of the value_transform_network, if applicable. Dropout probability must be between 0 and 1.
- output_transform_network (string or
Transform
or callable) – For each word embedding in the input sequence, SelfAttention produces one corresponding vector as output. This neural network specifies how to transform each of these output vectors to obtain a hidden representation of size hidden_size. Argument must be None or specify a Transform operation. If argument is None, and heads is 1, then the output vectors are not transformed. If argument is None and heads is > 1, then a 1 layer highway network without any non-linearity is used. - output_dropout (float) – If non-zero, applies dropout to the output of the output_transform_network, if applicable. Dropout probability must be between 0 and 1.
- bypass_network (string or
Bypass
or callable) – The bypass network (e.g. residual or highway network) to use. The input word embedding sequence to this module is considered as the raw input to the bypass network and the final output vector sequence (output of value_merge or output_transform_network if applicable) is considered as the transformed input. Argument must specify a Bypass operation. If None, does not use a bypass network. - input_size (int) – The number of features in the input to the module. This parameter will be
automatically specified by
LazyModule
.