As administrator of this blog, I am able to look at keywords that people use to find this site. One that comes up often is word frequency databases. I’m always curious about what people are looking for and how they might be using such databases. One corpus I’ve used several times is the one by Mark Davies at BYU. The kinds of things I’m interested in is what words are high or low frequency in both languages a bilingual individual speaks and which ones occur higher in one language vs. another. I think that in research controlling for these different frequencies across languages might help us understand why people learn one word in one language and a different word in another language. I think part of the answer is that different words are needed in different contexts (and I’ve written about that with colleagues e.g., Peña, Bedore & Rappazzo) and partly there are cultural reasons (e.g., Peña & Quinn). But, there are likely to be frequency effects as well. By having access to these corpora we can look up how often a word occurs in written or spoken language (usually express in units per million). Some corpora are even divided by time. This is because languages change over time and certain words that may have occurred at a high frequency 200 years ago, might not be as common today. This helps to make the right comparison.
One thing I’ve learned is to pay attention to the form of the word or part of speech. This is especially true when comparing word frequencies across languages. Since I do research in Spanish-English bilingualism something to think about is that Spanish is highly inflected. If I only look for the infinitive form in Spanish I may end up underestimating the frequency relative to English. Also, words can be used as nouns or verbs, etc. So, it’s important to compare verbs to verbs and nouns to nouns. Overall, I find these corpora very helpful and I very much appreciate having them available. I’m also grateful that I don’t have to build my own!