Lexical databases

On this page I post lexical databases of interest for educational or research purposes. If you know any other good resource, please send me a message!

Children’s Printed Word Database

Computerized database of printed word frequencies as read by children aged between 5 and 9. The database may be used to develop stimuli for experimental work investigating the literacy acquisition of young children.

Link: http://www.essex.ac.uk/psychology/cpwd/


Frequencies of over 75,000 English words. The frequencies are based on the subtitles from American films and television series (for a total of 51 million words). These word frequencies correlate substantially higher with the word processing times from the Elexicon Project than the frequencies from Kucera and Francis (1967) or Celex (1993). See also: SUBTLEX-ch (frequencies of Chinese words) and SUBTLEX-nl (frequencies of Dutch words).

Brysbaert, M., & New, B. (2009). Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977-990.

Brysbaert, M., New, B., & Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 1-7. Retrieved from http://dx.doi.org/10.3758/s13428-012-0190-4

Link: http://expsy.ugent.be/subtlexus/

Google Ngram

The Google Ngram corpus was based originally on 5.2 million books, published between 1500 and 2008, containing 500 billion words in different languages (American English, British English, French, German, Spanish, Russian, and Chinese).

Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., & Petrov, S. (2012). Syntactic Annotations for the Google Books Ngram Corpus. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 169-174. Retrieved from http://www.aclweb.org/anthology/P12-3029

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Team, T. G. B., et al. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014), 176-182.

Link: http://books.google.com/ngrams

The English Lexicon Project

The English Lexicon Project affords access to a large set of lexical characteristics, along with behavioral data from visual lexical decision and naming studies of 40,481 words and 40,481 nonwords. The goal of the English Lexicon Project is to collect normative data for speeded naming and lexical decision for over 40,000 words across 1200 subjects at 6 different universities. These data will be integrated into a database along with descriptive characteristics of the words used in the study. Researchers interested in psycholinguistics, human memory, computational modeling, and other fields will find these data useful.

Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., et al. (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445-459.

Link: http://elexicon.wustl.edu/

The British Lexicon project

The British Lexicon project contains lexical decision data for over 28,000 monosyllabic and disyllabic English words. Average reaction time, accuracy, and other stimulus characteristics are available (word frequency measures, neighborhood measures, morphological and syntactical [PoS] information).

Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287-304.

Link: http://crr.ugent.be/programs-data/lexicon-projects

Age-of-acquisition ratings for 30,000 English words

Kuperman, Stadthagen-Gonzalez, and Brysbaert collected age-of-acquisition (AoA) ratings for 30,121 English content words (nouns, verbs, and adjectives). The collection of these new AoA norms was possible because they made use of the web-based crowdsourcing technology offered by the Amazon Mechanical Turk. Correlations with existing AoA measures suggest that these estimates are as good as the existing ones.

Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 1-13. Retrieved from http://dx.doi.org/10.3758/s13428-012-0210-4

Link: http://crr.ugent.be/archives/806

Age of acquisition ratings for 3,000 monosyllabic words

Cortese and Khanna obtained from 32 participants Age of acquisition (AoA) ratings made on a 1–7 scale for 3,000 monosyllabic words.

Cortese, M. J., & Khanna, M. M. (2008). Age of acquisition ratings for 3,000 monosyllabic words. Behavior Research Methods, 40(3), 791-794.

Link: http://dx.doi.org/10.3758/BRM.40.3.791

Age of acquisition estimates for 3,000 disyllabic words

Schock, Cortese, Khanna, and Toppi obtained from 32 participants age of acquisition (AoA) ratings based on a 1-7 scale for 3,000 disyllabic words.

Schock, J., Cortese, M., Khanna, M., & Toppi, S. (2012). Age of acquisition estimates for 3,000 disyllabic words. Behavior Research Methods, 1-7.

Link: http://dx.doi.org/10.3758/s13428-012-0209-x