deepmatcher.data package¶
Submodules¶
deepmatcher.data.dataset module¶
-
class
deepmatcher.data.dataset.
MatchingDataset
(fields, column_naming, path=None, format='csv', examples=None, metadata=None, **kwargs)[source]¶ Bases:
torchtext.data.dataset.TabularDataset
Represents dataset with associated metadata.
Holds all information about one split of a dataset (e.g. training set).
Variables: - fields (dict) – A mapping from attribute names (e.g. “left_address”) to
corresponding
MatchingField
objects that specify how to process the field. - examples (list) – A list containing all the examples (labeled tuple pairs) in this dataset.
- metadata (dict) – Metadata about the dataset (e.g. word probabilities).
See
compute_metadata()
for details. - corresponding_field (dict) – A mapping from left table attribute names (e.g. “left_address”) to corresponding right table attribute names (e.g. “right_address”) and vice versa.
- text_fields (dict) – A mapping from canonical attribute names (e.g. “address”) to tuples of the corresponding left and right attribute names (e.g. (“left_address”, “right_address”)).
- all_left_fields (list) – A list of all left table attribute names.
- all_right_fields (list) – A list of all right table attribute names.
- canonical_text_fields (list) – A list of all canonical attribute names.
- label_field (str) – Name of the column containing labels.
- id_field (str) – Name of the column containing tuple pair ids.
-
exception
CacheStaleException
[source]¶ Bases:
Exception
Raised when the dataset cache is stale and no fallback behavior is specified.
-
compute_metadata
(pca=False)[source]¶ Computes metadata about the dataset.
Computes the following metadata about the dataset:
word_probs
: For each attribute in the dataset, a mapping from words to- word (token) probabilities.
totals
: For each attribute in the dataset, a count of the total number of- words present in all attribute examples.
pc
: For each attribute in the dataset, the first principal component of the- sequence embeddings for all attribute examples. The sequence embedding of an attribute value is computed by taking the weighted average of its word embeddings, where the weight is the soft inverse word probability. Refer “A simple but tough-to-beat baseline for sentence embeddings.” by Arora et al. (2017) for details.
Parameters: pca (bool) – Whether to compute the pc
metadata.
-
finalize_metadata
()[source]¶ Perform final touches to dataset metadata.
This allows performing modifications to metadata that cannot be serialized into the cache.
-
get_raw_table
()[source]¶ Create a raw pandas table containing all examples (tuple pairs) in the dataset.
To resurrect tokenized attributes, this method currently naively joins the tokens using the whitespace delimiter.
-
static
load_cache
(fields, datafiles, cachefile, column_naming, state_args)[source]¶ Load datasets and corresponding metadata from cache.
This method also checks whether any of the data loading arguments have changes that make the cache contents invalid. The following kinds of changes are currently detected automatically:
- Data filename changes (e.g. different train filename)
- Data file modifications (e.g. train data modified)
- Column changes (e.g. using a different subset of columns in CSV file)
- Column specification changes (e.g. changing lowercasing behavior)
- Column naming convention changes (e.g. different labeled data column)
Parameters: - fields (dict) – Mapping from attribute names (e.g. “left_address”) to
corresponding
MatchingField
objects that specify how to process the field. - datafiles (list) – A list of the data files.
- cachefile (str) – The cache file path.
- column_naming (dict) – A dict containing column naming conventions. See __init__ for details.
- state_args (dict) – A dict containing other information about the state under which the cache was created.
Returns: Tuple containing unprocessed cache data dict and a list of cache staleness causes, if any.
Warning
Note that if a column specification, i.e., arguments to
MatchingField
include callable arguments (e.g. lambdas or functions) these arguments cannot be serialized and hence will not be checked for modifications.
-
static
restore_data
(fields, cached_data)[source]¶ Recreate datasets and related data from cache.
This restores all datasets, metadata and attribute information (including the vocabulary and word embeddings for all tokens in each attribute).
-
static
save_cache
(datasets, fields, datafiles, cachefile, column_naming, state_args)[source]¶ Save datasets and corresponding metadata to cache.
This method also saves as many data loading arguments as possible to help ensure that the cache contents are still relevant for future data loading calls. Refer to
load_cache()
for more details.Parameters: - datasets (list) – List of datasets to cache.
- fields (dict) – Mapping from attribute names (e.g. “left_address”) to
corresponding
MatchingField
objects that specify how to process the field. - datafiles (list) – A list of the data files.
- cachefile (str) – The cache file path.
- column_naming (dict) – A dict containing column naming conventions. See __init__ for details.
- state_args (dict) – A dict containing other information about the state under which the cache was created.
-
sort_key
(ex)[source]¶ Sort key for dataset examples.
A key to use for sorting dataset examples for batching together examples with similar lengths to minimize padding.
-
classmethod
splits
(path, train, validation=None, test=None, unlabeled=None, fields=None, embeddings=None, embeddings_cache=None, column_naming=None, cache=None, check_cached_data=True, auto_rebuild_cache=False, train_pca=False, **kwargs)[source]¶ Create Dataset objects for multiple splits of a dataset.
Parameters: - path (str) – Common prefix of the splits’ file paths.
- train (str) – Suffix to add to path for the train set.
- validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
- test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
- unlabeled (str) – Suffix to add to path for an unlabeled dataset (e.g. for prediction). Default is None.
- fields (list(tuple(str, MatchingField))) – A list of tuples containing column
name (e.g. “left_address”) and corresponding
MatchingField
pairs, in the same order that the columns occur in the CSV file. Tuples of (name, None) represent columns that will be ignored. - embeddings (str or list) – Same as embeddings parameter of
process()
. - embeddings_cache (str) – Directory to store dowloaded word vector data.
- column_naming (dict) – Same as column_naming paramter of __init__.
- cache (str) – Suffix to add to path for cache file. If None disables caching.
- check_cached_data (bool) – Verify that data files haven’t changes since the cache was constructed and that relevant field options haven’t changed.
- auto_rebuild_cache (bool) – Automatically rebuild the cache if the data files are modified or if the field options change. Defaults to False.
- train_pca (bool) – Whether to compute PCA for each attribute as part of dataset metadata compuatation. Defaults to False.
- filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None. This is a keyword-only parameter.
Returns: - Datasets for (train, validation, and test) splits in
that order, if provided, or dataset for unlabeled, if provided.
Return type: Tuple[MatchingDataset]
- fields (dict) – A mapping from attribute names (e.g. “left_address”) to
corresponding
-
deepmatcher.data.dataset.
interleave_keys
(keys)[source]¶ Interleave bits from two sort keys to form a joint sort key.
Examples that are similar in both of the provided keys will have similar values for the key defined by this function. Useful for tasks with two text fields like machine translation or natural language inference.
deepmatcher.data.field module¶
-
class
deepmatcher.data.field.
FastText
(suffix='wiki-news-300d-1M.vec.zip', url_base='https://s3-us-west-1.amazonaws.com/fasttext-vectors/', **kwargs)[source]¶ Bases:
torchtext.vocab.Vectors
deepmatcher.data.iterator module¶
-
class
deepmatcher.data.iterator.
MatchingIterator
(dataset, train_dataset, batch_size, sort_in_buckets=True, **kwargs)[source]¶ Bases:
torchtext.data.iterator.BucketIterator
-
classmethod
splits
(datasets, batch_sizes=None, **kwargs)[source]¶ Create Iterator objects for multiple splits of a dataset.
Parameters: - datasets – Tuple of Dataset objects corresponding to the splits. The first such object should be the train set.
- batch_sizes – Tuple of batch sizes to use for the different splits, or None to use the same batch_size for all splits.
- keyword arguments (Remaining) – Passed to the constructor of the iterator class being used.
-
classmethod
deepmatcher.data.process module¶
-
deepmatcher.data.process.
process
(path, train=None, validation=None, test=None, unlabeled=None, cache='cacheddata.pth', check_cached_data=True, auto_rebuild_cache=False, lowercase=True, embeddings='fasttext.en.bin', embeddings_cache_path='~/.vector_cache', ignore_columns=(), include_lengths=True, id_attr='id', label_attr='label', left_prefix='left_', right_prefix='right_', pca=True)[source]¶ Creates dataset objects for multiple splits of a dataset.
This involves the following steps (if data cannot be retrieved from the cache): #. Read CSV header of a data file and verify header is sane. #. Create fields, i.e., column processing specifications (e.g. tokenization, label
conversion to integers etc.)- Load each data file:
- Read each example (tuple pair) in specified CSV file.
- Preprocess example. Involves lowercasing and tokenization (unless disabled).
- Compute metadata if training data file.
- See
MatchingDataset.compute_metadata()
for details.
- Create vocabulary consisting of all tokens in all attributes in all datasets.
- Download word embedding data if necessary.
- Create mapping from each word in vocabulary to its word embedding.
- Compute metadata
- Write to cache
Parameters: - path (str) – Common prefix of the splits’ file paths.
- train (str) – Suffix to add to path for the train set.
- validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
- test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
- unlabeled (str) – Suffix to add to path for an unlabeled dataset (e.g. for prediction). Default is None.
- cache (str) – Suffix to add to path for cache file. If None disables caching.
- check_cached_data (bool) – Verify that data files haven’t changes since the cache was constructed and that relevant field options haven’t changed.
- auto_rebuild_cache (bool) – Automatically rebuild the cache if the data files are modified or if the field options change. Defaults to False.
- lowercase (bool) – Whether to lowercase all words in all attributes.
- embeddings (str or list) –
One or more of the following strings: *
fasttext.{lang}.bin
: Uses binary models from “wiki word vectors”released by FastText. {lang} is ‘en’ or any other 2 letter ISO 639-1 Language Code, or 3 letter ISO 639-2 Code, if the language does not have a 2 letter code. 300d vectors.fasttext.en.bin
is the default.fasttext.wiki.vec
: Uses wiki news word vectors released as part of- ”Advances in Pre-Training Distributed Word Representations” by Mikolov et al. (2018). 300d vectors.
fasttext.wiki.vec
: Uses Common Crawl word vectors released as part- of “Advances in Pre-Training Distributed Word Representations” by Mikolov et al. (2018). 300d vectors.
glove.6B.{dims}
: Uses uncased Glove trained on Wiki + Gigaword.- {dims} is one of (50d, 100d, 200d, or 300d).
glove.42B.300d
: Uses uncased Glove trained on Common Crawl.- 300d vectors.
glove.840B.300d
: Uses cased Glove trained on Common Crawl.- 300d vectors.
glove.twitter.27B.{dims}
: Uses cased Glove trained on Twitter.- {dims} is one of (25d, 50d, 100d, or 200d).
- embeddings_cache_path (str) – Directory to store dowloaded word vector data.
- ignore_columns (list) – A list of columns to ignore in the CSV files.
- include_lengths (bool) – Whether to provide the model with the lengths of each attribute sequence in each batch. If True, length information can be used by the neural network, e.g. when picking the last RNN output of each attribute sequence.
- id_attr (str) – The name of the tuple pair ID column in the CSV file.
- label_attr (str) – The name of the tuple pair match label column in the CSV file.
- left_prefix (str) – The prefix for attribute names belonging to the left table.
- right_prefix (str) – The prefix for attribute names belonging to the right table.
- pca (bool) – Whether to compute PCA for each attribute (needed for SIF model). Defaults to False.
Returns: - Datasets for (train, validation, and test) splits in that
order, if provided, or dataset for unlabeled, if provided.
Return type: Tuple[MatchingDataset]