mangoes.dataset module¶
Module to access to available datasets and create new ones.
Datasets available in this module :
WS353 for the WordSim353 dataset (Finkelstein et al., 2002) [1].
- Also partitioned by [2] into :
WS_SIM : WordSim Similarity
WS_REL : WordSim Relatedness
RG65 for Rubenstein and Goodenough (1965) dataset [3]
RAREWORD for the Luong et al.’s (2013) Rare Word (RW) Similarity Dataset [4]
MEN for the Bruni et al.’s (2012) MEN dataset [5]
MTURK for the Radinsky et al.’s (2011) Mechanical Turk dataset [6]
SIMLEX for the Hill et al.’s (2016) SimLex-999 dataset [7]
GOOGLE for the Mikolov et al.’s (2013) Google dataset [8] .
- Also partitionned into :
GOOGLE_SEMANTIC for semantic analogies
GOOGLE_SYNTACTIC for syntactic analogies
MSR for the Mikolov et al.’s (2013) Microsoft Research dataset [9]
OD_8_8_8 [10]
WIKI_SEM_500 [11]
Warnings¶
The Simlex dataset is not compatible with this version of mangoes
References¶
- 1
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2001, April). Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web (pp. 406-414). ACM.
- 2
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa, A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, In Proceedings of NAACL-HLT 2009.
- 3
Rubenstein, Herbert, and John B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8(10):627–633, 1965.
- 4
Luong, T., Socher, R., & Manning, C. D. (2013, August). Better word representations with recursive neural networks for morphology. In CoNLL (pp. 104-113).
- 5
Bruni, E., Boleda, G., Baroni, M., & Tran, N. K. (2012, July). Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 (pp. 136-145). Association for Computational Linguistics.
- 6
Radinsky, K., Agichtein, E., Gabrilovich, E., & Markovitch, S. (2011, March). A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web (pp. 337-346). ACM.
- 7
Hill, F., Reichart, R., & Korhonen, A. (2016). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics.
- 8
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- 9
Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In hlt-Naacl (Vol. 13, pp. 746-751).
- 10
José Camacho-Collados and Roberto Navigli. Find the word that does not belong: A Framework for an Intrinsic Evaluation of Word Vector Representations. In Proceedings of the ACL Workshop on Evaluating Vector Space Representations for NLP, Berlin, Germany, August 12, 2016.
- 11
-
class
mangoes.dataset.
Dataset
(content, name='', language='en', lower=True, question_parser=<function _default_question_parser>)¶ Bases:
object
Base class to create an evaluation dataset.
- Parameters
- content: str or list or dict
- Content of the dataset.Can be :
a path to a file containing the questions
a path to a directory containing such files : each subdirectory and each file will be
considered as subsets. - a list of questions (each question is a string) Ex with analogies : [‘king queen man woman’, ‘boy girl man woman’, …] - a dictionary containing list of questions structured in subsets. Ex : {“root”: {“subset1”:[…], “subset2”:[…]}
- name: str
name of the dataset
- lower: boolean
True if the questions have to be lowercased
Examples
>>> dataset = mangoes.dataset.Dataset(['a b c d', 'e f g h'], name="My dataset") >>> dataset.subsets_to_questions {'My dataset': ['a b c d', 'e f g h']} >>> dataset.questions_to_subsets {'a b c d': ['/My dataset'], 'e f g h': ['/My dataset']}
>>> dataset = mangoes.dataset.Dataset({"subset1": ['a b c d','e f g h'], "subset2": ['a b c d']}, name="My dataset") >>> dataset.subsets_to_questions {'My dataset': {'subset1': ['a b c d', 'e f g h'], 'subset2': ['a b c d']}} >>> dataset.questions_to_subsets {'a b c d': {'/My dataset', '/My dataset/subset1', '/My dataset/subset2'}, 'e f g h': {'/My dataset', '/My dataset/subset1'}}
- Attributes
- name
- lower
- subsets_to_questions: dict
dictionary where keys are the name of the subsets and values are list of questions or nested subsets
- questions_to_subsets: dict
dictionary where keys are the questions and values are the list of the subsets they belong to
Methods
get_questions_and_gold
([subset])Returns a list of tuples with questions and expected answers
-
property
questions_to_subsets
¶
-
property
subsets_to_questions
¶
-
get_questions_and_gold
(subset=None)¶ Returns a list of tuples with questions and expected answers
- Parameters
- subset: str or None
if None, return all the questions in the dataset, else, only the questions of the given subset.
- Returns
- list of tuples (question, gold)
Examples
>>> dataset = Dataset({"My dataset": {"subset1": ['a b c d', 'e f g h'], >>> "subset2": ['a b c d']}}) >>> dataset.get_questions_and_gold() [Question(question='e f g', gold='h'), Question(question='a b c', gold='d')] >>> dataset.get_questions_and_gold("/My dataset/subset2") ['a b c d']
-
class
mangoes.dataset.
OutlierDetectionDataset
(*args, **kwargs)¶ Bases:
mangoes.dataset.Dataset
Base class for dataset for Outlier Detection task
- Attributes
- questions_to_subsets
- subsets_to_questions
Methods
get_questions_and_gold
([subset])Returns a list of tuples with questions and expected answers
-
mangoes.dataset.
nb_questions
(subset)¶ Returns the total number of questions in a subset of a Dataset
- Parameters
- subset: dict
subset of a Dataset
- Returns
- int
-
mangoes.dataset.
load
(dataset_name, language='en', lower=True)¶ Loads a dataset from the AVAILABLE_DATASETS
- Parameters
- dataset_name: str
the name of the dataset, must be in AVAILABLE_DATASETS
- language: {‘en’, ‘fr’}
Code of the language of the dataset (default = ‘en’)
- lower: boolean
whether the questions of the dataset should be lowercased
- Returns
- mangoes.Dataset