Text clustering applied to imbalanced legal data.
2030 UN Agenda; Machine Learning; Clustering and Text Classification.
The Federal Supreme Court (STF), the highest instance of the Brazilian judicial system, produces, as well as courts of other instances, an immense amount of data organized in text form, through decisions, petitions, injunctions, appeals and other legal documents. Such documents are classified and grouped by public employees specialized in cataloging of judicial processes, which in specific cases use technological support tools. Some processes in the STF, for example, are classified under one or more sustainable development goals (SDGs) of the United Nations (UN) 2030 Agenda. As it is a repetitive task related to pattern recognition, it is possible to develop tools based on machine learning for this purpose. In this work, Natural Language Processing (NLP) models are proposed for clustering processes, in order to increase the database on certain sustainable development goals (SDGs) with few inputs naturally. The activity of clustering, which is of enormous importance in its own right, is also able to gather unlabeled entries around cases already classified by court officials, thus allowing new labels to be allocated to similar cases. Preliminary results show that cluster-augmented sets can be used in supervised learning flows to aid in legal texts classification, especially in contexts with unbalanced data.