mangoes.dataset module

Module to access to available datasets and create new ones.

Datasets available in this module :

  • WS353 for the WordSim353 dataset (Finkelstein et al., 2002) [1].

    Also partitioned by [2] into :
    • WS_SIM : WordSim Similarity

    • WS_REL : WordSim Relatedness

  • RG65 for Rubenstein and Goodenough (1965) dataset [3]

  • RAREWORD for the Luong et al.’s (2013) Rare Word (RW) Similarity Dataset [4]

  • MEN for the Bruni et al.’s (2012) MEN dataset [5]

  • MTURK for the Radinsky et al.’s (2011) Mechanical Turk dataset [6]

  • SIMLEX for the Hill et al.’s (2016) SimLex-999 dataset [7]

  • GOOGLE for the Mikolov et al.’s (2013) Google dataset [8] .

    Also partitionned into :
    • GOOGLE_SEMANTIC for semantic analogies

    • GOOGLE_SYNTACTIC for syntactic analogies

  • MSR for the Mikolov et al.’s (2013) Microsoft Research dataset [9]

  • OD_8_8_8 [10]

  • WIKI_SEM_500 [11]

Warnings

The Simlex dataset is not compatible with this version of mangoes

References

1

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001, April). Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web (pp. 406-414). ACM.

2

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa, A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, In Proceedings of NAACL-HLT 2009.

3

Rubenstein, Herbert, and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.

4

Luong, T., Socher, R., & Manning, C. D. (2013, August). Better word representations with recursive neural networks for morphology. In CoNLL (pp. 104-113).

5

Bruni, E., Boleda, G., Baroni, M., & Tran, N. K. (2012, July). Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 136-145). Association for Computational Linguistics.

6

Radinsky, K., Agichtein, E., Gabrilovich, E., & Markovitch, S. (2011, March). A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web (pp. 337-346). ACM.

7

Hill, F., Reichart, R., & Korhonen, A. (2016). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics.

8

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

9

Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In hlt-Naacl (Vol. 13, pp. 746-751).

10

José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, Berlin, Germany, August 12, 2016.

11
class mangoes.dataset.Dataset(content, name='', language='en', lower=True, question_parser=<function _default_question_parser>)

Bases: object

Base class to create an evaluation dataset.

Parameters
content: str or list or dict
Content of the dataset.Can be :
  • a path to a file containing the questions

  • a path to a directory containing such files : each subdirectory and each file will be

considered as subsets. - a list of questions (each question is a string) Ex with analogies : [‘king queen man woman’, ‘boy girl man woman’, …] - a dictionary containing list of questions structured in subsets. Ex : {“root”: {“subset1”:[…], “subset2”:[…]}

name: str

name of the dataset

lower: boolean

True if the questions have to be lowercased

Examples

>>> dataset = mangoes.dataset.Dataset(['a b c d', 'e f g h'], name="My dataset")
>>> dataset.subsets_to_questions
{'My dataset': ['a b c d', 'e f g h']}
>>> dataset.questions_to_subsets
{'a b c d': ['/My dataset'], 'e f g h': ['/My dataset']}
>>> dataset = mangoes.dataset.Dataset({"subset1": ['a b c d','e f g h'], "subset2": ['a b c d']}, name="My dataset")
>>> dataset.subsets_to_questions
{'My dataset': {'subset1': ['a b c d', 'e f g h'], 'subset2': ['a b c d']}}
>>> dataset.questions_to_subsets
{'a b c d': {'/My dataset', '/My dataset/subset1', '/My dataset/subset2'},
 'e f g h': {'/My dataset', '/My dataset/subset1'}}
Attributes
name
lower
subsets_to_questions: dict

dictionary where keys are the name of the subsets and values are list of questions or nested subsets

questions_to_subsets: dict

dictionary where keys are the questions and values are the list of the subsets they belong to

Methods

get_questions_and_gold([subset])

Returns a list of tuples with questions and expected answers

property questions_to_subsets
property subsets_to_questions
get_questions_and_gold(subset=None)

Returns a list of tuples with questions and expected answers

Parameters
subset: str or None

if None, return all the questions in the dataset, else, only the questions of the given subset.

Returns
list of tuples (question, gold)

Examples

>>> dataset = Dataset({"My dataset": {"subset1": ['a b c d', 'e f g h'],
>>>                                   "subset2": ['a b c d']}})
>>> dataset.get_questions_and_gold()
[Question(question='e f g', gold='h'), Question(question='a b c', gold='d')]
>>> dataset.get_questions_and_gold("/My dataset/subset2")
['a b c d']
class mangoes.dataset.OutlierDetectionDataset(*args, **kwargs)

Bases: mangoes.dataset.Dataset

Base class for dataset for Outlier Detection task

Attributes
questions_to_subsets
subsets_to_questions

Methods

get_questions_and_gold([subset])

Returns a list of tuples with questions and expected answers

mangoes.dataset.nb_questions(subset)

Returns the total number of questions in a subset of a Dataset

Parameters
subset: dict

subset of a Dataset

Returns
int
mangoes.dataset.load(dataset_name, language='en', lower=True)

Loads a dataset from the AVAILABLE_DATASETS

Parameters
dataset_name: str

the name of the dataset, must be in AVAILABLE_DATASETS

language: {‘en’, ‘fr’}

Code of the language of the dataset (default = ‘en’)

lower: boolean

whether the questions of the dataset should be lowercased

Returns
mangoes.Dataset