deepmatcher

The deepmatcher package contains high level modules used in the construction of deep learning modules for entity matching.

Main Modules

MatchingModel

class deepmatcher.MatchingModel(attr_summarizer='hybrid', attr_condense_factor='auto', attr_comparator=None, attr_merge='concat', classifier='2-layer-highway', hidden_size=300)[source]

A neural network model for entity matching.

Refer to the Matching Models tutorial for details on how to customize a MatchingModel. A brief intro is below:

This network consists of the following components:

  1. Attribute Summarizers
  2. Attribute Comparators
  3. A Classifier

Creating a MatchingModel instance does not immediately construct the neural network. The network will be constructed just before training based on metadata from the training set:

  1. For each attribute (e.g., Product Name, Address, etc.), an Attribute Summarizer is constructed using the specified attr_summarizer template.
  2. For each attribute, an Attribute Comparator is constructed using the specified attr_summarizer template.
  3. A Classifier is constructed based on the specified classifier template.
Parameters:
  • attr_summarizer (string or AttrSummarizer or callable) – The attribute summarizer. Takes in two word embedding sequences and summarizes the information in them to produce two summary vectors as output. Options listed here. Defaults to ‘hybrid’, i.e., the Hybrid model.
  • attr_comparator (string or Merge or callable) – The attribute comparator. Takes as input the two summary vectors and applies a comparison function over those summaries to obtain the final similarity representation of the two attribute values. Argument must specify a Merge operation. Default is selected based on attr_summarizer choice.
  • attr_condense_factor (string or int) – The factor by which to condense each attribute similarity vector. E.g. if attr_condense_factor is set to 3 and the attribute similarity vector size is 300, then each attribute similarity vector is transformed to a 100 dimensional vector using a linear transformation. The purpose of condensing is to reduce the number of parameters in the classifier module. This parameter can be set to a number or ‘auto’. If ‘auto’, then the condensing factor is set to be equal to the number attributes, but if there are more than 6 attributes, then the condensing factor is set to 6. Defaults to ‘auto’.
  • attr_merge (string or Merge or callable) – The operation used to merge the (optionally condensed) attribute similarity vectors to obtain the input to the classifier. Argument must specify a Merge operation. Defaults to ‘concat’, i.e., concatenate all attribute similarity vectors to form the classifier input.
  • classifier (string or Classifier or callable) – The neural network to perform match / mismatch classification based on attribute similarity representations. Options listed here. Defaults to ‘2-layer-highway’, i.e., use a two layer highway network followed by a softmax layer for classification.
  • hidden_size (int) – The hidden size to use for the attr_summarizer and the classifier modules, if they are string arguments. If a module or callable input is specified for a component, this argument is ignored for that component.
forward(input)[source]

Performs a forward pass through the model.

Overrides torch.nn.Module.forward().

Parameters:input (MatchingBatch) – A batch of tuple pairs processed into tensors.
initialize(train_dataset, init_batch=None)[source]

Initialize (not lazily) the matching model given the actual training data.

Instantiates all sub-components and their trainable parameters.

Parameters:
  • train_dataset (MatchingDataset) – The training dataset obtained using deepmatcher.data.process().
  • init_batch (MatchingBatch) – A batch of data to forward propagate through the model. If None, a batch is drawn from the training dataset.
load_state(path)[source]

Load the model state from a file in a certain path.

Parameters:path (string) – The path to load the model state from.
run_eval(dataset, batch_size=32, device=None, progress_style='bar', log_freq=5, sort_in_buckets=None)[source]

Evaluate the model on the specified dataset.

Parameters:
  • dataset (MatchingDataset) – The evaluation dataset obtained using deepmatcher.data.process().
  • batch_size (int) – Mini-batch size for SGD. For details on what this is see this video. Defaults to 32. This is a keyword only param.
  • device (int) – The device index of the GPU on which to train the model. Set to -1 to use CPU only, even if GPU is available. If None, will use first available GPU, or use CPU if no GPUs are available. Defaults to None. This is a keyword only param.
  • progress_style (string) – Sets the progress update style. One of ‘bar’ or ‘log’. If ‘bar’, uses a progress bar, updated every N batches. If ‘log’, prints training stats every N batches. N is determined by the log_freq param. This is a keyword only param.
  • log_freq (int) – Number of batches between progress updates. Defaults to 5. This is a keyword only param.
  • sort_in_buckets (bool) – Whether to batch examples of similar lengths together. If True, minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. Implemented using torchtext.data.pool(). Defaults to True. This is a keyword only param.
Returns:

The F1 score obtained by the model on the dataset.

Return type:

float

run_prediction(dataset, output_attributes=False, batch_size=32, device=None, progress_style='bar', log_freq=5, sort_in_buckets=None)[source]

Use the model to obtain predictions, i.e., match scores on the specified dataset.

Parameters:
  • dataset (MatchingDataset) – The dataset (labeled or not) obtained using deepmatcher.data.process() or deepmatcher.data.process_unlabeled().
  • output_attributes (bool) – Whether to include all attributes in the original CSV file of the dataset in the returned pandas table.
  • batch_size (int) – Mini-batch size for SGD. For details on what this is see this video. Defaults to 32. This is a keyword only param.
  • device (int) – The device index of the GPU on which to train the model. Set to -1 to use CPU only, even if GPU is available. If None, will use first available GPU, or use CPU if no GPUs are available. Defaults to None. This is a keyword only param.
  • progress_style (string) – Sets the progress update style. One of ‘bar’ or ‘log’. If ‘bar’, uses a progress bar, updated every N batches. If ‘log’, prints training stats every N batches. N is determined by the log_freq param. This is a keyword only param.
  • log_freq (int) – Number of batches between progress updates. Defaults to 5. This is a keyword only param.
  • sort_in_buckets (bool) – Whether to batch examples of similar lengths together. If True, minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. Implemented using torchtext.data.pool(). Defaults to True. This is a keyword only param.
Returns:

A pandas DataFrame containing tuple pair IDs (in the “id”

column) and the corresponding match score predictions (in the “match_score” column). Will also include all attributes in the original CSV file of the dataset if output_attributes is True.

Return type:

pandas.DataFrame

run_train(train_dataset, validation_dataset, best_save_path, epochs=30, criterion=None, optimizer=None, pos_neg_ratio=None, pos_weight=None, label_smoothing=0.05, save_every_prefix=None, save_every_freq=None, batch_size=32, device=None, progress_style='bar', log_freq=5, sort_in_buckets=None)[source]

Train the model using the specified training set.

Parameters:
  • train_dataset (MatchingDataset) – The training dataset obtained using deepmatcher.data.process().
  • validation_dataset (MatchingDataset) – The validation dataset obtained using deepmatcher.data.process(). This is used for early stopping.
  • best_save_path (string) – The path to save the best model to. At the end of each epoch, if the new model accuracy (F1) is better than all previous epochs, then it is saved to this location.
  • epochs (int) – Number of training epochs, i.e., number of times to cycle through the entire training set. Defaults to 50.
  • criterion (torch.nn.Module) – The loss function to use. Refer to the losses section of the PyTorch API for options. By default, deepmatcher will output a 2d tensor of shape (N, C) where N is the batch size and C is 2 - the number of classes. Keep this in mind when picking the loss. Defaults to SoftNLLLoss with label smoothing.
  • optimizer (Optimizer) – The optimizer to use for updating the trainable parameters of the MatchingModel neural network after each iteration. If not specified an Optimizer with Adam optimizer will be constructed.
  • pos_neg_ratio (int) – The weight of the positive class (match) wrt the negative class (non-match). This parameter must be specified if there is a significant class imbalance in the dataset.
  • label_smoothing (float) – The label_smoothing parameter to constructor of SoftNLLLoss criterion. Only used when criterion param is None. Defaults to 0.05.
  • save_every_prefix (string) – Prefix of the path to save model to, after end of every N epochs, where N is determined by save_every_freq param. E.g. setting this to “/path/to/saved/model” will save models to “/path/to/saved/model_ep1.pth”, “/path/to/saved/model_ep2.pth”, etc. Models will not be saved periodically if this is None. Defaults to None.
  • save_every_freq (int) – Determines the frequency (number of epochs) for saving models periodically to the path specified by the save_every_prefix param (has no effect if that param is not set). Defaults to 1.
  • batch_size (int) – Mini-batch size for SGD. For details on what this is see this video. Defaults to 32. This is a keyword only param.
  • device (int) – The device index of the GPU on which to train the model. Set to -1 to use CPU only, even if GPU is available. If None, will use first available GPU, or use CPU if no GPUs are available. Defaults to None. This is a keyword only param.
  • progress_style (string) – Sets the progress update style. One of ‘bar’ or ‘log’. If ‘bar’, uses a progress bar, updated every N batches. If ‘log’, prints training stats every N batches. N is determined by the log_freq param. This is a keyword only param.
  • log_freq (int) – Number of batches between progress updates. Defaults to 5. This is a keyword only param.
  • sort_in_buckets (bool) – Whether to batch examples of similar lengths together. If True, minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. Implemented using torchtext.data.pool(). Defaults to True. This is a keyword only param.
Returns:

The best F1 score obtained by the model on the validation dataset.

Return type:

float

save_state(path, include_meta=True)[source]

Save the model state to a certain path.

Parameters:
  • path (string) – The path to save the model state to.
  • include_meta (bool) – Whether to include training dataset metadata along with the model parameters when saving. If False, the model will not be automatically initialized upon loading - you will need to initialize manually using initialize().

AttrSummarizer

class deepmatcher.AttrSummarizer(word_contextualizer, word_comparator, word_aggregator, hidden_size=None)[source]

The Attribute Summarizer.

Summarizes the two word embedding sequences of an attribute. Refer this tutorial for details. Sub-classes that implement various built-in options for this module are defined in deepmatcher.attr_summarizers.

Parameters:
  • word_contextualizer (string or WordContextualizer or callable) – Module that takes a word embedding sequence and produces a context-aware word embedding sequence as output. Options listed here.
  • word_comparator (string or WordComparator or callable) – Module that takes two word embedding sequences, aligns words in the two sequences, and performs a word by word comparison. Options listed here.
  • word_aggregator (string or WordAggregator or callable) – Module that takes a sequence of vectors and aggregates it into a single vector. Options listed here.
  • hidden_size (int) – The hidden size used for the three sub-modules (i.e., word_contextualizer, word_comparator, and word_aggregator). If None, uses the input size, i.e., the size of the last dimension of the input to this module as the hidden size. Defaults to None.

Classifier

class deepmatcher.Classifier(transform_network, hidden_size=None)[source]

The Classifier Network.

Predicts whether a tuple pair matches or not given a representation of all the attribute summarizations. Refer this tutorial for details.

Parameters:
  • transform_network (string or Transform or callable) – The neural network to transform the input vector of the classifier to a hidden representation of size hidden_size. Argument must specify a Transform operation.
  • hidden_size (int) – The size of the hidden representation generated by the transformation network. If None, uses the size of the input vector to this module as the hidden size.

Components of Attribute Summarizer

WordContextualizer

class deepmatcher.WordContextualizer[source]

The Word Contextualizer.

Takes a word embedding sequence and produces a context-aware word embedding sequence. Refer this tutorial for details. Sub-classes that implement various options for this module are defined in deepmatcher.word_contextualizers.

WordComparator

class deepmatcher.WordComparator[source]

The Word Comparator.

Takes two word embedding sequences, aligns words in the two sequences, and performs a word by word comparison. Refer this tutorial for details. Sub-classes that implement various options for this module are defined in deepmatcher.word_comparators.

WordAggregator

class deepmatcher.WordAggregator[source]

The Word Aggregator.

Takes a sequence of vectors and aggregates it into a single vector. Refer this tutorial for details. Sub-classes that implement various options for this module are defined in deepmatcher.word_aggregators.