mangoes.dataset module¶

Module to access to available datasets and create new ones.

Datasets available in this module :

WS353 for the WordSim353 dataset (Finkelstein et al., 2002) [1].
Also partitioned by [2] into :
- WS_SIM : WordSim Similarity
- WS_REL : WordSim Relatedness
RG65 for Rubenstein and Goodenough (1965) dataset [3]
RAREWORD for the Luong et al.’s (2013) Rare Word (RW) Similarity Dataset [4]
MEN for the Bruni et al.’s (2012) MEN dataset [5]
MTURK for the Radinsky et al.’s (2011) Mechanical Turk dataset [6]
SIMLEX for the Hill et al.’s (2016) SimLex-999 dataset [7]
GOOGLE for the Mikolov et al.’s (2013) Google dataset [8] .
Also partitionned into :
- GOOGLE_SEMANTIC for semantic analogies
- GOOGLE_SYNTACTIC for syntactic analogies
MSR for the Mikolov et al.’s (2013) Microsoft Research dataset [9]
OD_8_8_8 [10]
WIKI_SEM_500 [11]

Warnings¶

The Simlex dataset is not compatible with this version of mangoes

References¶

1: Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001, April). Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web (pp. 406-414). ACM.
2: Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa, A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, In Proceedings of NAACL-HLT 2009.
3: Rubenstein, Herbert, and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.
4: Luong, T., Socher, R., & Manning, C. D. (2013, August). Better word representations with recursive neural networks for morphology. In CoNLL (pp. 104-113).
5: Bruni, E., Boleda, G., Baroni, M., & Tran, N. K. (2012, July). Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 136-145). Association for Computational Linguistics.
6: Radinsky, K., Agichtein, E., Gabrilovich, E., & Markovitch, S. (2011, March). A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web (pp. 337-346). ACM.
7: Hill, F., Reichart, R., & Korhonen, A. (2016). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics.
8: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
9: Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In hlt-Naacl (Vol. 13, pp. 746-751).
10: José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, Berlin, Germany, August 12, 2016.
11

class mangoes.dataset.Dataset(content, name='', language='en', lower=True, question_parser=<function _default_question_parser>)¶

Bases: object

Base class to create an evaluation dataset.

Parameters

content: str or list or dict

Content of the dataset.Can be :

a path to a file containing the questions
a path to a directory containing such files : each subdirectory and each file will be

considered as subsets. - a list of questions (each question is a string) Ex with analogies : [‘king queen man woman’, ‘boy girl man woman’, …] - a dictionary containing list of questions structured in subsets. Ex : {“root”: {“subset1”:[…], “subset2”:[…]}

name: str

name of the dataset

lower: boolean

True if the questions have to be lowercased

Examples

>>> dataset = mangoes.dataset.Dataset(['a b c d', 'e f g h'], name="My dataset")
>>> dataset.subsets_to_questions
{'My dataset': ['a b c d', 'e f g h']}
>>> dataset.questions_to_subsets
{'a b c d': ['/My dataset'], 'e f g h': ['/My dataset']}

>>> dataset = mangoes.dataset.Dataset({"subset1": ['a b c d','e f g h'], "subset2": ['a b c d']}, name="My dataset")
>>> dataset.subsets_to_questions
{'My dataset': {'subset1': ['a b c d', 'e f g h'], 'subset2': ['a b c d']}}
>>> dataset.questions_to_subsets
{'a b c d': {'/My dataset', '/My dataset/subset1', '/My dataset/subset2'},
 'e f g h': {'/My dataset', '/My dataset/subset1'}}

Attributes

name
lower
subsets_to_questions: dict: dictionary where keys are the name of the subsets and values are list of questions or nested subsets
questions_to_subsets: dict: dictionary where keys are the questions and values are the list of the subsets they belong to

Methods

get_questions_and_gold([subset])

Returns a list of tuples with questions and expected answers

property questions_to_subsets¶

property subsets_to_questions¶

get_questions_and_gold(subset=None)¶

Returns a list of tuples with questions and expected answers

Parameters

subset: str or None: if None, return all the questions in the dataset, else, only the questions of the given subset.

Returns

list of tuples (question, gold)

Examples

>>> dataset = Dataset({"My dataset": {"subset1": ['a b c d', 'e f g h'],
>>>                                   "subset2": ['a b c d']}})
>>> dataset.get_questions_and_gold()
[Question(question='e f g', gold='h'), Question(question='a b c', gold='d')]
>>> dataset.get_questions_and_gold("/My dataset/subset2")
['a b c d']

class mangoes.dataset.OutlierDetectionDataset(*args, **kwargs)¶

Bases: mangoes.dataset.Dataset

Base class for dataset for Outlier Detection task

Attributes

questions_to_subsets
subsets_to_questions

Methods

get_questions_and_gold([subset])

Returns a list of tuples with questions and expected answers

mangoes.dataset.nb_questions(subset)¶

Returns the total number of questions in a subset of a Dataset

Parameters

subset: dict: subset of a Dataset

Returns

int

mangoes.dataset.load(dataset_name, language='en', lower=True)¶

Loads a dataset from the AVAILABLE_DATASETS

Parameters

dataset_name: str: the name of the dataset, must be in AVAILABLE_DATASETS
language: {‘en’, ‘fr’}: Code of the language of the dataset (default = ‘en’)
lower: boolean: whether the questions of the dataset should be lowercased

Returns

mangoes.Dataset