Vol. 72 No. 1 (2024): NAMES: A Journal of Onomastics

Navigating Linguistic Similarities Among Countries Using Fuzzy Sets of Proper Names

Davor Lauc
University of Zagreb

Published 2024-03-12


  • first name,
  • language similarity,
  • language distance,
  • phonetic similarity,
  • socioonomastics,
  • proper name,
  • anthroponomastics
  • ...More


This paper examines the commonalities among several countries and languages through the lens of proper names, especially forenames. It posits that the investigation of these names offers a fresh perspective on language similarity due to their distinct influence from cross-cultural interactions and language contact compared to regular vocabulary. The study introduces a novel measure that generalizes the similarity between sets by considering the distances between elements. This metric is employed to assess phonetic commonalities in forenames. The results of this analysis show a notable correlation between the commonality of proper names across languages and the overarching commonality of the languages themselves. In addition, the forename commonalities also provided more insights. As this investigation shows, proper names can also serve as a potentially potent metric for language similarity and may be used to unveil additional cultural commonalities and disparities among nations. The paper concludes by addressing the constraints of this research and discussing prospects for subsequent studies.


  1. Almasoud, Ameera, Hend S. Al-Khalifa, and Abdulmalik S. Al-Salman. 2019. "Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures." BioMed Research International 2019. https://doi.org/10.1155/2019/6750296
  2. Batsuren, Khuyagbaatar, Gábor Bella, and Fausto Giunchiglia. 2019. “Cognet: A Large-scale Cognate Database.” In ACL 2019: The 57th Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference, 3136—3145. Boston, Massachusetts: Association for Computational Linguistics.
  3. Bero, S. A., A. K. Muda, Y. H. Choo, N. A. Muda, and S. F. Pratama. 2017. "Similarity Measure for Molecular Structure: A Brief Review." In Journal of Physics: Conference Series 892, no. 1: 012015. Bristol, UK: IOP Publishing.
  4. Carlier, Chiara, Julian Karch, Peter Kuppens, and Eva Ceulemans. 2023. "A Comprehensive Comparison of Measures for Assessing Profile Similarity." PsyArXiv. May 10. doi:10.31234/osf.io/zbrd7
  5. Cha, Sung-Hyuk. 2007. "Comprehensive Survey on Distance/Similarity Measures Between Probability Density Functions." City 1, no. 2: 1.
  6. Ding, Ning, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. “Few-NERD: A Few-Shot Named Entity Recognition Dataset.” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 3198-3213. Boston: Association for Computational Linguistics.
  7. Dryer, Matthew S. and Martin Haspelmath, eds. 2013. “The World Atlas of Language Structures Online.” Zenodo. Accessed August 15, 2023. https://doi.org/10.5281/zenodo.7385533
  8. ESPEAK. 2015. “Pronunciation Dictionary.” Accessed August 15, 2023. https://espeak.sourceforge.net/index.html
  9. Goldhahn, Dirk and Uwe Quasthoff. 2014. “Vocabulary-Based Language Similarity Using Web Corpora.” In Proceedings of the Ninth International Conference on Language Resource and Evaluation, 26–31. Reykjavik: European Language Resources Association.
  10. Goodman, Nelson. 1983. Fact, Fiction, and Forecast. Cambridge, MA: Harvard University Press.
  11. Gooskens, Charlotte and Vincent J van Heuven. 2021. “Mutual Intelligibility.” In Similar Languages, Varieties, and Dialect: A Computational Perspective, 50-95. Cambridge: Cambridge University Press.
  12. Gooskens, Charlotte, Vincent J van Heuven, Jelena Golubović, Anja Schüppert, Femke Swarte, and Stefanie Voigt. 2018. “Mutual Intelligibility Between Closely Related Languages in Europe.” International Journal of Multilingualism 15, no. 2: 169–193.
  13. Jaccard, P. 1908. “Nouvelles Recherches Sur La Distribution Florale.” [New Research on Floral Distribution] Bulletin de La Société Vaudoise Des Sciences Naturelles 44, no. 1: 223–270.
  14. Johnson, Jeff, Matthijs Douze, and Hervé Jégou. 2019. “Billion-Scale Similarity Search with GPUs.” IEEE Transactions on Big Data 7, no. 3: 535–547.
  15. Katz, Leonard, and Ram Frost. 1992. ‘’The Reading Process is Different for Different Orthographies: The Orthographic Depth Hypothesis.’’ In Advances in Psychology 94, 67-84. North-Holland.
  16. Kessler, Brett. 2005. “Phonetic Comparison Algorithms1.” Transactions of the Philological Society 103, no. 2: 243–260.
  17. Lauc, Davor. 2018. “How Gruesome Are the No-Free-Lunch Theorems for Machine Learning?” Croatian Journal of Philosophy 18, no. 54: 479–486.
  18. Lee, Jackson L., Lucas F. E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman. 2020. “Massively Multilingual Pronunciation Modeling with WikiPron.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, 4223–4228. Marseille: European Language Resources Association.
  19. Li, Xinjian, Siddharth Dalmia, David Mortensen, Juncheng Li, Alan Black, and Florian Metze. 2020. “Towards Zero-Shot Learning for Automatic Phonemic Transcription.” Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 1: 8261–8268.
  20. Li, Xinjian, Florian Metze, David R. Mortensen, Alan W Black, and Shinji Watanabe. 2022. “Phone Inventories and Recognition for Every Language.” In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 1061-1067. Marseille: European Language Resources Association.
  21. Liang, Jie. 2008 "Estimation Methods for the Size of Deep Web Textural Data Source: A Survey." Accessed November 1, 2023. https://richard.myweb.cs.uwindsor.ca/cs510/survey_jie_liang.pdf
  22. Llarena, Jose. 2017. “Britfone.” GitHub Repository. Accessed August 15, 2023. https://github.com/JoseLlarena/Britfone
  23. Mondonomo. 2023. “Mondonomo Knowledge Graph.” Accessed August 15, 2023. https://mondonomo.ai
  24. Mortensen, David R., Siddharth Dalmia, and Patrick Littell. 2018. “Epitran: Precision G2P for Many Languages.” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 23–31. Paris: European Language Resources Association.
  25. Mortensen, David R., Patrick Littell, Akash Bharadwaj, Kartik Goyal, Chris Dyer, and Lori S. Levin. 2016. “PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors.” In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics, 3475–3484. Boston: Association for Computational Linguistics.
  26. Müller, André, Søren Wichmann, Viveka Velupillai, Cecil H Brown, Pamela Brown, Sebastian Sauppe, Eric W Holman, et al. 2010. Asjp World Language Tree of Lexical Similarity: Version 3. Accessed August 15, 2023. https://asjp.clld.org/download
  27. Open-Dictionary-Data. 2023. “IPA-Dict: Monolingual Wordlists with Pronunciation Information in IPA.” GitHub Repository.Accessed August 15, 2023. https://github.com/open-dict-data/ipa-dict
  28. Park, Jongseok, Kyubyong & Kim. 2019. “g2p: English Grapheme to Phoneme Conversion.” GitHub Repository. Accessed August 15, 2023. https://github.com/Kyubyong/g2p
  29. Phatthiyaphaibun, Wannaphong. 2020. “Thai-g2p.” GitHub Repository. Accessed August 15, 2023. https://github.com/sigmorphon/2020/tree/master/task1/
  30. Schüppert, Anja. 2011. Origin of Asymmetry: Mutual Intelligibility of Spoken Danish and Swedish. Accessed August 15, 2023. https://research.rug.nl/en/publications/origin-of-asymmetry-mutual-intelligibility-of-spoken-danish-and-s
  31. Strobl, Michael, Amine Trabelsi, and Osmar Zaiane. 2020. “WEXEA: Wikipedia EXhaustive Entity Annotation.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, 1951-1958. Marseille: European Language Resources Association.
  32. Sun, Hao, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng Zhao, Tao Qin, and Tie-Yan Liu. 2019. “Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion.” arXiv preprint arXiv:1904.03446. Accessed August 15, 2023. https://arxiv.org/abs/1904.03446
  33. Swadesh, Morris. 1955. “Towards Greater Accuracy in Lexicostatistic Dating.” International Journal of American Linguistics 21, no. 2: 121–137.
  34. Taubert, Stefan. 2022. “Pronunciation Dictionary.” Zenodo. Accessed August 15, 2023. https://doi.org/10.5281/zenodo.7386813
  35. Tjong Kim Sang, Erik F., and Fien De Meulder. 2003. “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.” In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147. Boston: Association for Computational Linguistics.
  36. Tversky, Amos. 1977. “Features of Similarity.” Psychological Review 84, no. 4: 327–352.
  37. VIAF. 2023. Virtual International Authority File (VIAF). Accessed August 15, 2023. https://viaf.org/
  38. Vijaymeena, M. K. and K. Kavitha. 2016. “A Survey on Similarity Measures in Text Mining.” Machine Learning and Applications: An International Journal, 3, no 2: 19–28.
  39. Voorhees, Ellen M. 1986. “Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval.’’ Information Processing & Management 22, no. 6: 465–476.
  40. Wang, Wen-June. 1997. “New Similarity Measures on Fuzzy Sets and on Elements.” Fuzzy Sets and Systems 85, no. 3: 305–309.
  41. Watanabe, Satoshi. 1969. “Modified Concepts of Logic, Probability, and Information Based on Generalized Continuous Characteristic Function.” Information and Control 15, no. 1: 1–21.
  42. Watanabe, Satoshi. 1986. “Epistemological Relativity Logico-Linguistic Source of Relativity.” Annals of the Japan Association for Philosophy of Science 7, no. 1: 1–14.
  43. Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. “Evaluating Linguistic Distance Measures.” Physica A: Statistical Mechanics and Its Applications 389, no. 17: 3632–3639.
  44. Wikidata. 2023. Wikidata. 2023. Accessed August 15, 2023. https://www.wikidata.org/dumps/
  45. Wolpert, David H. and William G Macready. 1997. “No Free Lunch Theorems for Optimization.” IEEE Transactions on Evolutionary Computation 1, no. 1: 67–82.
  46. Xue, Linting, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. 2022. “Byt5: Towards a Token-Free Future with Pre-Trained Byte-to-Byte Models.” Transactions of the Association for Computational Linguistics 10, no 1: 291–306.
  47. Zhu, Jian, Cong Zhang, and David Jurgens. 2022 “ByT5 Model for Massively Multilingual Grapheme-to-Phoneme Conversion.” arXiv preprint arXiv:2204.03067. Accessed August 15, 2023. https://arxiv.org/abs/2204.03067