Text classification: an approach using machine learning

Authors

DOI:

https://doi.org/10.62758/re.v3i3.212

Keywords:

Classification, Machine Learning, Algorithms, Information, Information Science

Abstract

Text classification has been employed as a foundation for organizing knowledge across a wide range of fields, as it allows for the grouping of categories to guide the segmentation of these domains. In the digital information age, where there is an abundance of data spread across cloud computing environments, the use of informational technologies is essential to facilitate the classification process of this data. Within this framework, Information Science plays a pivotal role in the production, organization, transmission, and utilization of information across diverse fields, including computer science, mathematics, artificial intelligence, among others. Through technology, when information is appropriately classified, it can be made available to society more effectively. The primary aim of this article is to address contexts regarding text classification using Machine Learning. This research is exploratory, adopting an experimental method, and employs a quantitative approach as its data analysis technique. As a result, after utilizing the Euclidean distance algorithm, a distance matrix and hierarchical grouping were established, along with a word cloud, highlighting terms of significance from the documents.

Author Biographies

Edberto Ferneda, Universidade Estadual Paulista (Unesp)

Full Professor in Information Retrieval (2016). Postdoctorate from the Federal University of Paraíba (2013). PhD in Communication Sciences (Information Science) from the University of São Paulo (2003). Master's in Informatics from the Federal University of Paraíba (1997). Holds a degree in Data Processing from the former Educational Foundation of Bauru (1985). Currently an Associate Professor in the Department of Information Science at the São Paulo State University Julio Mesquita Filho (UNESP) - Marília Campus. Works in Information Science, mainly in the areas of Automatic Indexing and Information Retrieval. CNPq Research Productivity Fellow - Level 2

Leonardo Botega, Universidade Estadual Paulista (Unesp)

Doctor in Computer Science from the Federal University of São Carlos - UFSCar with a Postdoctorate from the University of São Paulo - USP. Permanent Member of the Postgraduate Program in Information Science at UNESP-Marília. Collaborating Member of the Postgraduate Program in Computer Science at UNESP-Bauru/Prudente. Collaborating Researcher at the Institute of Computing at UNICAMP. Data Product Manager at PISMO company. Leader of the Human-Computer Interaction Group (GIHC) - UNESP. Reviewer for journals in the areas of data fusion, critical decision-making systems, semantic web, and information systems. Has academic and professional experience in the following topics: Data and Information Fusion, Data Mining, Data and Information Quality, Semantic Web, Management of Critical Data and Critical Decision-Making Systems. Has obtained various publications in national and international events and journals, in addition to guiding various undergraduate, master's, and doctoral works with scholarships from CAPES, CNPq, and FAPESP.

References

Aggarwal, C. C., Zhao, Y., e Yu, P. S. (2014). On the use of side information for mining text data. IEEE Transactions on Knowledge and Data En-gineering, 26(6):1415–1429. DOI: https://doi.org/10.1109/TKDE.2012.148

Aha, David W; KIBLER, Dennis; ALBERT, Marc K. Instance-based learning algorithms. Machine learning 6.1, p. 37-66, 1991. DOI: https://doi.org/10.1007/BF00153759

Barite, M.G., The Notion of “Category”: Its Implica-tions in Subject Analysis and in the Construction and Evaluation of Indexing Languages. School of Library Science University of the Republic of Uruguay. 2000.

Bekkerman, R. e Allan, J. (2004). Using bigrams in text categorization. Relatório Técnico IR-408, Center of Intelligent Information Retrieval, UMass Amherst.

Bennett, J., Orange Data Mining, in https://www.predictiveanalyticstoday.com/Orange-data-mining/. 2018. Acesso em 03 de maio de 2023.

Breve, F. A., Zhao, L., Quiles, M. G., Pedrycz, W., e Liu, J. (2012). Particle competition and coope-ration in networks for semi-supervised lear-ning. IEEE Transactions on Knowledge and Da-ta Engineering, 24(9):1686–1698. DOI: https://doi.org/10.1109/TKDE.2011.119

Burke, W. W., & Nourmair, D. A. (2001). The role of personality assessment in organization development. In J. Waclawski & A. H. Church (Eds.), Organization development: A data-driven approach to organizational change (pp. 55-77). Jossey-Bass.

Campos, M.L.A.; Gomes, H.E.; Oliveira, L.L. As Categorias de Ranganathan na organização dos conteúdos de um portal científico. Data-GramaZero, Rio de Janeiro, v. 14, n.3, jun. 2013.

DiFonzo, N., & Bordia, P. (2007). Rumor psychology: Social and organizational approaches. American Psychological Association. DOI: https://doi.org/10.1037/11503-000

Prado, H. A. do, E. Ferneda, E., editors (2008). Emerging Technologies of Text Mining: Techni-ques and Applications. Information Science Re-ference.

Fayyad, U.M., G.Piatetsky–Shapiro, P.Smyth. Kno-wledge Discovery and Data Mining: Towards a Unifying Framework. Proceeding of the Second International Conference on Knowledge Disco-very and Data Mining (KDD-96), Portland, Ore-gon, august, 1996.

Forman, G., An Experimental Study of Feature Se-lection Metrics for Text Categorization. Journal of Machine Learning Research, 3 2003, pp. 1289-1305

Galvão, N. D.; Marin, H. F. Técnica de mineração de dados: uma revisão da literatura. Acta Paulista de Enfermagem, São Paulo, v.22, n.5, p. 686-690, 2009. DOI: https://doi.org/10.1590/S0103-21002009000500014

Goldberg, David E; HOLLAND, John H. Genetic algo-rithms and machine learning. Machine lear-ning 3.2, p. 95-99, 1988. DOI: https://doi.org/10.1023/A:1022602019183

He, W., Zha, S. & Li, L. social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Infor-mation Management, 33(3), 464-472. 2013. DOI: https://doi.org/10.1016/j.ijinfomgt.2013.01.001

Ignoatto M. L., Webber C. G., “Inteligência Competi-tiva nas Mídias Sociais: Um Estudo de Caso na Moda”. Revista SCIENTIA CUM INDUSTRIA, V. 7, N. 2, PP. 156 — 164, 2019 DOI: https://doi.org/10.18226/23185279.v7iss2p156

Ikonomakis, M; kotsiantis, Sotiris; Tampakas, V. Text classification using machine learning tech-niques. WSEAS Transactions on Computers 4.8, p. 966-974, 2005.

King, M. L., Jr. (2010). Stride toward freedom: The Montgomery story. Beacon Press.

Kotsiantis, Sotiris B; zaharakis, I; pintelas, Pa-nayiotis. Supervised machine learning: A re-view of classification techniques. p. 3-24, 2007 DOI: https://doi.org/10.1007/s10462-007-9052-3

Kriegel, David. A brief introduction to neural networks. 2007.

Leopold, Edda & Kindermann, Jörg, "Categorização de Texto com Máquinas V etoriais de Apoio". Como representar textos no espaço de entra-da", Machine Learning 46, 2002, pp. 423 - 444. DOI: https://doi.org/10.1023/A:1012491419635

Madsen R. E., Sigurdsson S., Hansen L. K. e Lansen J., "Pruning the Vocabulary for Better Context Recognition", 7th International Conference on Pattern Recognition, 2004 DOI: https://doi.org/10.1109/ICPR.2004.1334270

Mazzochi, F. Gnoli, C. S.R. Ranganathan´s PMEST Categories: Analyzing their Philosophical Back-ground Cognitive Function. Information Studies, v.16, p. 133-147, 2010.

Posluszny, D., Spencer, S., & Baum, A. (2007). Post-traumatic stress disorder. In S. Ayers, A. Baum, C. McManus, & et al. (Eds.), Cambridge handbook of psychology, health and medicine (2nd ed.). Cambridge University Press.

Rossi, Rafael G., Classificação automática de textos por meio de aprendizado de máquina baseado em redes. Tese – Programa de Pós-graduação em Ciências de Computação e Matemática Computacional. ICMC/USP. São Carlos. 2015

Sebastiani, F. (2002). Machine learning in automa-ted text categorization. ACM Computing Sur-veys, 34(1):1–47. DOI: https://doi.org/10.1145/505282.505283

Skinner, Burrhus F. Are theories of learning neces-sary? Psychological review 57.4, p. 193, 1950. DOI: https://doi.org/10.1037/h0054367

Somers, C. L., Day, A. G., Niewiadomski, J., Sutter, C., Baroni, B. A., & Hong, J. S. (2018). Under-standing how school climate affects overall mood in residential care: Perspectives of ado-lescent girls in foster care and juvenile justice systems. Juvenile & Family Court Journal, 69(4), 43-58. https://doi.org/10.1111/jfcj.12120. DOI: https://doi.org/10.1111/jfcj.12120

Soucy P. e Mineau G., "Feature Selection Strategies for Text Categorization", AI 2003, LNAI 2671, 2003, pp. 505-509. DOI: https://doi.org/10.1007/3-540-44886-1_41

Uysal, A. K. e Gunal, S. (2014). The impact of pre-processing on text classification. Information Processing & Management, 50(1):104–112. DOI: https://doi.org/10.1016/j.ipm.2013.08.006

Published

2023-12-21

How to Cite

Cardoso, F. E. ., Ferneda, E., & Botega, L. (2023). Text classification: an approach using machine learning. Revista EDICIC, 3(3), 1–17. https://doi.org/10.62758/re.v3i3.212