deepmatcher.data package

Submodules

deepmatcher.data.dataset module

class deepmatcher.data.dataset.MatchingDataset(fields, column_naming, path=None, format='csv', examples=None, metadata=None, **kwargs)[source]

Bases: torchtext.data.dataset.TabularDataset

Represents dataset with associated metadata.

Holds all information about one split of a dataset (e.g. training set).

Variables:
  • fields (dict) – A mapping from attribute names (e.g. “left_address”) to corresponding MatchingField objects that specify how to process the field.
  • examples (list) – A list containing all the examples (labeled tuple pairs) in this dataset.
  • metadata (dict) – Metadata about the dataset (e.g. word probabilities). See compute_metadata() for details.
  • corresponding_field (dict) – A mapping from left table attribute names (e.g. “left_address”) to corresponding right table attribute names (e.g. “right_address”) and vice versa.
  • text_fields (dict) – A mapping from canonical attribute names (e.g. “address”) to tuples of the corresponding left and right attribute names (e.g. (“left_address”, “right_address”)).
  • all_left_fields (list) – A list of all left table attribute names.
  • all_right_fields (list) – A list of all right table attribute names.
  • canonical_text_fields (list) – A list of all canonical attribute names.
  • label_field (str) – Name of the column containing labels.
  • id_field (str) – Name of the column containing tuple pair ids.
exception CacheStaleException[source]

Bases: Exception

Raised when the dataset cache is stale and no fallback behavior is specified.

compute_metadata(pca=False)[source]

Computes metadata about the dataset.

Computes the following metadata about the dataset:

  • word_probs: For each attribute in the dataset, a mapping from words to
    word (token) probabilities.
  • totals: For each attribute in the dataset, a count of the total number of
    words present in all attribute examples.
  • pc: For each attribute in the dataset, the first principal component of the
    sequence embeddings for all attribute examples. The sequence embedding of an attribute value is computed by taking the weighted average of its word embeddings, where the weight is the soft inverse word probability. Refer “A simple but tough-to-beat baseline for sentence embeddings.” by Arora et al. (2017) for details.
Parameters:pca (bool) – Whether to compute the pc metadata.
finalize_metadata()[source]

Perform final touches to dataset metadata.

This allows performing modifications to metadata that cannot be serialized into the cache.

get_raw_table()[source]

Create a raw pandas table containing all examples (tuple pairs) in the dataset.

To resurrect tokenized attributes, this method currently naively joins the tokens using the whitespace delimiter.

static load_cache(fields, datafiles, cachefile, column_naming, state_args)[source]

Load datasets and corresponding metadata from cache.

This method also checks whether any of the data loading arguments have changes that make the cache contents invalid. The following kinds of changes are currently detected automatically:

  • Data filename changes (e.g. different train filename)
  • Data file modifications (e.g. train data modified)
  • Column changes (e.g. using a different subset of columns in CSV file)
  • Column specification changes (e.g. changing lowercasing behavior)
  • Column naming convention changes (e.g. different labeled data column)
Parameters:
  • fields (dict) – Mapping from attribute names (e.g. “left_address”) to corresponding MatchingField objects that specify how to process the field.
  • datafiles (list) – A list of the data files.
  • cachefile (str) – The cache file path.
  • column_naming (dict) – A dict containing column naming conventions. See __init__ for details.
  • state_args (dict) – A dict containing other information about the state under which the cache was created.
Returns:

Tuple containing unprocessed cache data dict and a list of cache staleness causes, if any.

Warning

Note that if a column specification, i.e., arguments to MatchingField include callable arguments (e.g. lambdas or functions) these arguments cannot be serialized and hence will not be checked for modifications.

static restore_data(fields, cached_data)[source]

Recreate datasets and related data from cache.

This restores all datasets, metadata and attribute information (including the vocabulary and word embeddings for all tokens in each attribute).

static save_cache(datasets, fields, datafiles, cachefile, column_naming, state_args)[source]

Save datasets and corresponding metadata to cache.

This method also saves as many data loading arguments as possible to help ensure that the cache contents are still relevant for future data loading calls. Refer to load_cache() for more details.

Parameters:
  • datasets (list) – List of datasets to cache.
  • fields (dict) – Mapping from attribute names (e.g. “left_address”) to corresponding MatchingField objects that specify how to process the field.
  • datafiles (list) – A list of the data files.
  • cachefile (str) – The cache file path.
  • column_naming (dict) – A dict containing column naming conventions. See __init__ for details.
  • state_args (dict) – A dict containing other information about the state under which the cache was created.
sort_key(ex)[source]

Sort key for dataset examples.

A key to use for sorting dataset examples for batching together examples with similar lengths to minimize padding.

classmethod splits(path, train, validation=None, test=None, unlabeled=None, fields=None, embeddings=None, embeddings_cache=None, column_naming=None, cache=None, check_cached_data=True, auto_rebuild_cache=False, train_pca=False, **kwargs)[source]

Create Dataset objects for multiple splits of a dataset.

Parameters:
  • path (str) – Common prefix of the splits’ file paths.
  • train (str) – Suffix to add to path for the train set.
  • validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
  • test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
  • unlabeled (str) – Suffix to add to path for an unlabeled dataset (e.g. for prediction). Default is None.
  • fields (list(tuple(str, MatchingField))) – A list of tuples containing column name (e.g. “left_address”) and corresponding MatchingField pairs, in the same order that the columns occur in the CSV file. Tuples of (name, None) represent columns that will be ignored.
  • embeddings (str or list) – Same as embeddings parameter of process().
  • embeddings_cache (str) – Directory to store dowloaded word vector data.
  • column_naming (dict) – Same as column_naming paramter of __init__.
  • cache (str) – Suffix to add to path for cache file. If None disables caching.
  • check_cached_data (bool) – Verify that data files haven’t changes since the cache was constructed and that relevant field options haven’t changed.
  • auto_rebuild_cache (bool) – Automatically rebuild the cache if the data files are modified or if the field options change. Defaults to False.
  • train_pca (bool) – Whether to compute PCA for each attribute as part of dataset metadata compuatation. Defaults to False.
  • filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None. This is a keyword-only parameter.
Returns:

Datasets for (train, validation, and test) splits in

that order, if provided, or dataset for unlabeled, if provided.

Return type:

Tuple[MatchingDataset]

static state_args_compatibility(cur_state, old_state)[source]
deepmatcher.data.dataset.interleave_keys(keys)[source]

Interleave bits from two sort keys to form a joint sort key.

Examples that are similar in both of the provided keys will have similar values for the key defined by this function. Useful for tasks with two text fields like machine translation or natural language inference.

deepmatcher.data.field module

class deepmatcher.data.field.FastText(suffix='wiki-news-300d-1M.vec.zip', url_base='https://s3-us-west-1.amazonaws.com/fasttext-vectors/', **kwargs)[source]

Bases: torchtext.vocab.Vectors

class deepmatcher.data.field.FastTextBinary(language='en', url_base=None, cache=None)[source]

Bases: torchtext.vocab.Vectors

cache(name, cache, url=None)[source]
name_base = 'wiki.{}.bin'
class deepmatcher.data.field.MatchingField(tokenize='moses', id=False, **kwargs)[source]

Bases: torchtext.data.field.Field

build_vocab(*args, vectors=None, cache=None, **kwargs)[source]
numericalize(arr, *args, **kwargs)[source]
preprocess_args()[source]

deepmatcher.data.iterator module

class deepmatcher.data.iterator.MatchingIterator(dataset, train_dataset, batch_size, sort_in_buckets=True, **kwargs)[source]

Bases: torchtext.data.iterator.BucketIterator

create_batches()[source]
classmethod splits(datasets, batch_sizes=None, **kwargs)[source]

Create Iterator objects for multiple splits of a dataset.

Parameters:
  • datasets – Tuple of Dataset objects corresponding to the splits. The first such object should be the train set.
  • batch_sizes – Tuple of batch sizes to use for the different splits, or None to use the same batch_size for all splits.
  • keyword arguments (Remaining) – Passed to the constructor of the iterator class being used.

deepmatcher.data.process module

deepmatcher.data.process.process(path, train=None, validation=None, test=None, unlabeled=None, cache='cacheddata.pth', check_cached_data=True, auto_rebuild_cache=False, lowercase=True, embeddings='fasttext.en.bin', embeddings_cache_path='~/.vector_cache', ignore_columns=(), include_lengths=True, id_attr='id', label_attr='label', left_prefix='left_', right_prefix='right_', pca=True)[source]

Creates dataset objects for multiple splits of a dataset.

This involves the following steps (if data cannot be retrieved from the cache): #. Read CSV header of a data file and verify header is sane. #. Create fields, i.e., column processing specifications (e.g. tokenization, label

conversion to integers etc.)
  1. Load each data file:
    1. Read each example (tuple pair) in specified CSV file.
    2. Preprocess example. Involves lowercasing and tokenization (unless disabled).
    3. Compute metadata if training data file.
      See MatchingDataset.compute_metadata() for details.
  2. Create vocabulary consisting of all tokens in all attributes in all datasets.
  3. Download word embedding data if necessary.
  4. Create mapping from each word in vocabulary to its word embedding.
  5. Compute metadata
  6. Write to cache
Parameters:
  • path (str) – Common prefix of the splits’ file paths.
  • train (str) – Suffix to add to path for the train set.
  • validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
  • test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
  • unlabeled (str) – Suffix to add to path for an unlabeled dataset (e.g. for prediction). Default is None.
  • cache (str) – Suffix to add to path for cache file. If None disables caching.
  • check_cached_data (bool) – Verify that data files haven’t changes since the cache was constructed and that relevant field options haven’t changed.
  • auto_rebuild_cache (bool) – Automatically rebuild the cache if the data files are modified or if the field options change. Defaults to False.
  • lowercase (bool) – Whether to lowercase all words in all attributes.
  • embeddings (str or list) –

    One or more of the following strings: * fasttext.{lang}.bin: Uses binary models from “wiki word vectors”

    released by FastText. {lang} is ‘en’ or any other 2 letter ISO 639-1 Language Code, or 3 letter ISO 639-2 Code, if the language does not have a 2 letter code. 300d vectors. fasttext.en.bin is the default.
    • fasttext.wiki.vec: Uses wiki news word vectors released as part of
      ”Advances in Pre-Training Distributed Word Representations” by Mikolov et al. (2018). 300d vectors.
    • fasttext.wiki.vec: Uses Common Crawl word vectors released as part
      of “Advances in Pre-Training Distributed Word Representations” by Mikolov et al. (2018). 300d vectors.
    • glove.6B.{dims}: Uses uncased Glove trained on Wiki + Gigaword.
      {dims} is one of (50d, 100d, 200d, or 300d).
    • glove.42B.300d: Uses uncased Glove trained on Common Crawl.
      300d vectors.
    • glove.840B.300d: Uses cased Glove trained on Common Crawl.
      300d vectors.
    • glove.twitter.27B.{dims}: Uses cased Glove trained on Twitter.
      {dims} is one of (25d, 50d, 100d, or 200d).
  • embeddings_cache_path (str) – Directory to store dowloaded word vector data.
  • ignore_columns (list) – A list of columns to ignore in the CSV files.
  • include_lengths (bool) – Whether to provide the model with the lengths of each attribute sequence in each batch. If True, length information can be used by the neural network, e.g. when picking the last RNN output of each attribute sequence.
  • id_attr (str) – The name of the tuple pair ID column in the CSV file.
  • label_attr (str) – The name of the tuple pair match label column in the CSV file.
  • left_prefix (str) – The prefix for attribute names belonging to the left table.
  • right_prefix (str) – The prefix for attribute names belonging to the right table.
  • pca (bool) – Whether to compute PCA for each attribute (needed for SIF model). Defaults to False.
Returns:

Datasets for (train, validation, and test) splits in that

order, if provided, or dataset for unlabeled, if provided.

Return type:

Tuple[MatchingDataset]

Module contents