deepmatcher.data

process

deepmatcher.data.process(path, train=None, validation=None, test=None, unlabeled=None, cache='cacheddata.pth', check_cached_data=True, auto_rebuild_cache=True, tokenize='nltk', lowercase=True, embeddings='fasttext.en.bin', embeddings_cache_path='~/.vector_cache', ignore_columns=(), include_lengths=True, id_attr='id', label_attr='label', left_prefix='left_', right_prefix='right_', pca=True)[source]

Creates dataset objects for multiple splits of a dataset.

This involves the following steps (if data cannot be retrieved from the cache): #. Read CSV header of a data file and verify header is sane. #. Create fields, i.e., column processing specifications (e.g. tokenization, label

conversion to integers etc.)
  1. Load each data file:
    1. Read each example (tuple pair) in specified CSV file.
    2. Preprocess example. Involves lowercasing and tokenization (unless disabled).
    3. Compute metadata if training data file. See MatchingDataset.compute_metadata() for details.
  2. Create vocabulary consisting of all tokens in all attributes in all datasets.
  3. Download word embedding data if necessary.
  4. Create mapping from each word in vocabulary to its word embedding.
  5. Compute metadata
  6. Write to cache
Parameters:
  • path (str) – Common prefix of the splits’ file paths.
  • train (str) – Suffix to add to path for the train set.
  • validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
  • test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
  • cache (str) – Suffix to add to path for cache file. If None disables caching.
  • check_cached_data (bool) – Verify that data files haven’t changes since the cache was constructed and that relevant field options haven’t changed.
  • auto_rebuild_cache (bool) – Automatically rebuild the cache if the data files are modified or if the field options change. Defaults to False.
  • lowercase (bool) – Whether to lowercase all words in all attributes.
  • embeddings (str or list) –

    One or more of the following strings:

    • fasttext.{lang}.bin:
      This uses sub-word level word embeddings based on binary models from “wiki word vectors” released by FastText. {lang} is ‘en’ or any other 2 letter ISO 639-1 Language Code, or 3 letter ISO 639-2 Code, if the language does not have a 2 letter code. 300d vectors. fasttext.en.bin is the default.
    • fasttext.wiki.vec:
      Uses wiki news word vectors released as part of “Advances in Pre-Training Distributed Word Representations” by Mikolov et al. (2018). 300d vectors.
    • fasttext.crawl.vec:
      Uses Common Crawl word vectors released as part of “Advances in Pre-Training Distributed Word Representations” by Mikolov et al. (2018). 300d vectors.
    • glove.6B.{dims}:
      Uses uncased Glove trained on Wiki + Gigaword. {dims} is one of (50d, 100d, 200d, or 300d).
    • glove.42B.300d:
      Uses uncased Glove trained on Common Crawl. 300d vectors.
    • glove.840B.300d:
      Uses cased Glove trained on Common Crawl. 300d vectors.
    • glove.twitter.27B.{dims}:
      Uses cased Glove trained on Twitter. {dims} is one of (25d, 50d, 100d, or 200d).
  • embeddings_cache_path (str) – Directory to store dowloaded word vector data.
  • ignore_columns (list) – A list of columns to ignore in the CSV files.
  • include_lengths (bool) – Whether to provide the model with the lengths of each attribute sequence in each batch. If True, length information can be used by the neural network, e.g. when picking the last RNN output of each attribute sequence.
  • id_attr (str) – The name of the tuple pair ID column in the CSV file.
  • label_attr (str) – The name of the tuple pair match label column in the CSV file.
  • left_prefix (str) – The prefix for attribute names belonging to the left table.
  • right_prefix (str) – The prefix for attribute names belonging to the right table.
  • pca (bool) – Whether to compute PCA for each attribute (needed for SIF model). Defaults to False.
Returns:

Datasets for (train, validation, and test) splits in that

order, if provided, or dataset for unlabeled, if provided.

Return type:

Tuple[MatchingDataset]

process_unlabeled

deepmatcher.data.process_unlabeled(path, trained_model, ignore_columns=None)[source]

Creates a dataset object for an unlabeled dataset.

Parameters:
  • path (string) – The full path to the unlabeled data file (not just the directory).
  • trained_model (MatchingModel) – The trained model. The model is aware of the configuration of the training data on which it was trained, and so this method reuses the same configuration for the unlabeled data.
  • ignore_columns (list) – A list of columns to ignore in the unlabeled CSV file.