Vol. 69 No. 3 (2021)

Corpus-Based Methods for Recognizing the Gender of Anthroponyms

Published 2021-08-16


  • anthroponymy,
  • co-occurrence statistics,
  • corpus linguistics,
  • gender recognition,
  • given names,
  • Spanish
  • ...More


This paper presents a series of methods for automatically determining the gender of proper names, based on their co-occurrence with words and grammatical features in a large corpus. Although the results obtained were for Spanish given names, the method presented here can be easily replicated and used for names in other languages. Most methods reported in the literature use pre-existing lists of first names that require costly manual processing and tend to become quickly outdated. Instead, we propose using corpora. Doing so offers the possibility of obtaining real and up-to-date name-gender links. To test the effectiveness of our method, we explored various machine-learning methods as well as another method based on simple frequency of co-occurrence. The latter produced the best results: 93% precision and 88% recall on a database of ca. 10,000 mixed names. Our method can be applied to a variety of natural language processing tasks such as information extraction, machine translation, anaphora resolution or large-scale delivery or email correspondence, among others.


  1. Ali, Daler, Malik Muhammad Saad Missen, Nadeem Akhtar, Nadeem Salamat, Hina Asmat, and Amnah Firdous. 2016. “Gender Prediction for Expert Finding Task.” International Journal of Advanced Computer Science and Applications 7, no. 5: 161–5.
  2. Barry, Herbert, III, and Aylene S. Harper. 2014. “Unisex Names for Babies Born in Pennsylvania. 1990–2010.” Names 62, no. 1: 13–22.
  3. Frietsch, Rainer, Inna Haller, Melanie Vrohlings, and Hariolf Grupp. 2009. “Gender-Specific Patterns in Patenting and Publishing.” Research Policy 38, no. 4: 590–9.
  4. Gao, Ge. 2011. “Shall I Name Her ‘Wisdom’ or ‘Elegance’? Naming in China.” Names 59, no. 3: 164–74.
  5. Giménez, Iván. 2017. “Nombres de bebés, bares, viajes … la locura desatada por Juego de Tronos.” [Names of babies, bars, trips … the madness unleashed by Game of Thrones]. La Vanguardia, June 20, 2017. Accessed May 5, 2020. http://www.lavanguardia.com/series/20170720/424201948012/juego-de-tronos-locuradesatada-fenomeno-mundial-brl.html.
  6. Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. “The Weka Data Mining Software: An Update.” ACM SIGKDD Explorations Newsletter 11, no. 1: 10–8.
  7. Kilgarriff, Adam and Irene Renau. 2013. “EsTenTen, a Vast Web Corpus of Peninsular and American Spanish.” Procedia. Social and Behavioral Sciences 95: 12–9.
  8. Kugele, Kordula. 2010. “Analysis of Women’s Participation in High-Technology Patenting.” In Innovating Women: Contributions to Technological Advancement, edited by Pooran Wynarczyk and Susan Marlow, vol. 1, 123–51. Bingley, UK: Emerald.
  9. Larivière, Vincent, Chaoqun Ni, Yves Gingras, Blaise Cronin, and Cassidy R. Sugimoto. 2013. “Bibliometrics: Global Gender Disparities in Science.” Nature 504, no. 7479: 211–3.
  10. Lax Martínez, Gema, Julio Raffo, and Kaori Saito. 2016. “Identifying the Gender of PC Inventors.” Economic Research Working Paper Nr. 33. Geneva: World Intellectual Property Organization.
  11. Manning, Christopher D. and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press.
  12. Maron, Melvin Earl. 1961. “Automatic Indexing: An Experimental Inquiry.” Journal of the ACM 8, no. 3: 404–17.
  13. Maslowska, Ewa, Edith G. Smit, and Bas van den Putte. 2016. “It is All in the Name: A Study of Consumers’ Responses to Personalized Communication.” Journal of Interactive Advertising 16, no. 1: 74–85.
  14. Mosteller, Frederik, and David L. Wallace. 1964. Inference and Disputed Authorship: The Federalist Papers. Massachusetts: Addison-Wesley.
  15. Motschenbacher, Heiko. 2020. “Corpus Linguistic Onomastics: A Plea for a Corpus-Based Investigation of Names.” Names 68, no. 2: 88–103.
  16. Naldi, Fluvio and Ilaria Vannini Parenti. 2002. “Scientific and Technological Performance by Gender.” In A Feasibility Study on Patent and Bibliometric Indicators, edited by Henk F. Moed, Wolfgang Glänzel, and Ulrich Schmoch, 299-314. Luxembourg: European Union.
  17. Naldi, Fulvio, Daniela Luzi, Adriana Valente, and Ilaria Vannini Parenti. 2005. “Scientific and Technological Performance by Gender.” Handbook of Quantitative Science and Technology Research: The Use of Publication and Patent Statistics in Studies of S&T Systems, edited by Henk F. Moed, Wolfgang Glänzel, and Ulrich Schmoch, 299–314. New York: Springer-Verlag.
  18. National Women’s Business Council. 2012. Intellectual Property and Women Entrepreneurs: Quantitative Analysis. Washington DC: National Women’s Business Council.
  19. Parada, Maryann. 2016. “Ethnolinguistic and Gender Aspects of Latino Naming in Chicago: Exploring Regional Variation.” Names 64, no. 1: 19–35.
  20. Park, Seong-Bae, and Hee-Geun Yoon. 2007. “Determining the Gender of Korean Names for Pronoun Generation.” International Journal of Computer Science and Engineering 1, no. 4: 226–30.
  21. Platt, John C. 1998. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Washington DC: Microsoft Research.
  22. Quinlan, J. Ross. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
  23. Sabet, Peyman G., and Grace Zhang. 2020. “First Names in Social and Ethnic Contexts: A Socio- Onomastic Approach.” Language & Communication 70: 1–12.
  24. Sugimoto, Cassidy R., Chaoqun Ni, Jevin D. West, and Vincent Larivière. 2015. “The Academic Advantage: Gender Disparities in Patenting.” PLoS One 10, no. 5: e0128000.
  25. Tang, Cong, Keith Ross, Nitesh Saxena, and Ruichuan Chen. 2011. “What’s in a Name: A Study of Names, Gender Inference, and Gender Behavior in Facebook.” Database Systems for Advanced Applications, edited by Jianliang Xu, Ge Yu, Shuigeng Zhou, and Rainer Unland, 344–56. Luxembourg: Springer.
  26. Tripathi, Anshuman, and Manaal Faruqui. 2011. “Gender Prediction of Indian Names.” Proceedings of the 2011 IEEE Students’ Technology Symposium, 137–41. Kharagpur: IEEE.
  27. Vapnik, Vladimir N. 1998. Statistical Learning Theory. New York: Wiley-Interscience.
  28. Yoon, Hee-Geun, Seong-Bae Park, Yong-Jin Han, and Sang-Jo Lee. 2008. “Determining Gender of Korean Names with Context.” ALPIT 2008. Proceedings of the Seventh International Conference on Advanced Language Processing and Web Information Technology, edited by Maosong Sun, Cheol Young Ock, Jeong Yong Byun, Yu De Bi, and Hong Fei Lin, 121–6. Los Alamitos, CA: IEEE.