Maestría en Ciencias de los Datos y Analítica (tesis)
URI permanente para esta colección
Examinar
Envíos recientes
Publicación Integración de modelos estadísticos y de aprendizaje automático para predecir y mitigar la rotación voluntaria de empleados(Universidad EAFIT, 2025) González Ruiz, John Jairo; Almonacid Hurtado, Paula MaríaPublicación Discriminación étnica en préstamos hipotecarios en estados unidos : un análisis predictivo con métodos causales y de aprendizaje automático(Universidad EAFIT, 2025) Galeano Naranjo, Juan Pablo; Almonacid Hurtado, Paula María; Álvarez Franco, Pilar Beatriz; Cruz Castañeda, VivianPublicación Identificación de riesgos emergentes por medio del análisis de sentimientos con técnicas de aprendizaje automático(Universidad EAFIT, 2025) Merizalde Maya, Pablo; Peña Palacio, Juan AlejandroPublicación Ajuste fino de un modelo LLM para realizar reportes resumidos de expertos en trading, con integración de datos desde redes sociales(Universidad EAFIT, 2025) Restrepo Acevedo, Andrés Felipe; Martínez Vargas, Juan DavidThe contemporary financial market is characterized by its high complexity and the massive volume of structured and unstructured data generated daily, posing significant challenges for individual investors in terms of analysis and informed decision making. This project proposes the fine-tuning of a Small Language Model (SLM) integrated into a tool capable of generating financial analysis reports similar to those produced by experts. For the proof of concept (PoC), transcripts from financial analysis videos published by experts on their YouTube channels are utilized. The SLM is fine-tuned using instruction-based techniques and the incorporation of the LoRa(Low-Rank Adapters) method, with the aim of extracting and summarizing key information relevant to individual investors. The main objective of this tool is to assist individual investors by generating efficient and accessible reports, facilitating access to valuable information in natural language, and enhancing their ability to make data-driven decisions from unstructured data, all with minimal investment of time and resources. Experimental results demonstrate the viability of using fine-tuned Small Language Models (SLMs) for the generation of high-quality financial reports. Specifically, the selected model, finetune qlora unsloth llama 3.1 8B Instruct bnb 4bit v2 Q8 0, achieved an average score of 5.67 out of 10 in the evaluation conducted by a judge LLM, with an average cosine distance of 0.159 compared to the reference summaries generated by the foundational pretrained model GPT-4.1. This improvement represents a 97.5% increase in performance compared to the same base model, Llama 3.1 8B Instruct, without fine-tuning. Qualitatively, the model exhibits high fidelity and coherence in the extraction and synthesis of key information in moderately long contexts, although it faces challenges in thematic interpretation when dealing with considerably lengthy transcripts. Additionally, implementation of this tool is projected to save 560 hours annually for individual investors, along with an estimated annual reduction in API costs ranging from 7.52 to 25 for the channels analyzed in the proof of concept.Publicación Estrategias de optimización para canales transaccionales físicos en el sector bancario colombiano(Universidad EAFIT, 2025) Zapata Jiménez, John Fredy; López Moreno, Ana MaríaDigital transformation is rapidly reshaping the landscape of traditional banking, creating a dilemma for financial institutions: integrate new digital channels or improve the distribution of existing physical ones. This thesis explores how multi-objective optimization techniques, such as integer linear programming and discrete-event stochastic simulation, can help address this dilemma within the context of the Colombian financial system. In an environment where customer habits and distribution models are constantly evolving, decision-makers must consider the impact of technology, implementation costs, and the adaptability of channels. This research addresses these challenges by developing a theoretical framework based on heuristic modeling and advanced techniques such as clustering and NLP algorithms. The aim is to provide recommendations for optimizing the distribution of transactional channels to enhance operational efficiency and customer experience. The thesis focuses on four specific objectives: retrieving and storing transaction data from distribution channels; preparing this information for clustering modeling; developing an optimization model for the distribution of physical channels; and analyzing the information to segment channels according to the optimization model. Optimizing distribution channels is essential to maintaining a competitive advantage in an increasingly digital environment. By effectively combining digital and physical channels, the banking system can improve operational efficiency, broaden its reach, and respond more agilely to market demands. This study offers a comprehensive perspective and practical solutions to address current challenges in the distribution of physical transactional channels in the Colombian banking sector.Publicación Modelo de predicción de venta en una compañía textil con técnicas de Machine Learning(Universidad EAFIT, 2025) Lezcano Echeverri, Jhon Wilder; Puerta Puerta, Henry DanielThis study explores the implementation of sales forecasting models in a Colombian textile company, combining traditional techniques with Machine Learning-based approaches. Daily sales data from 187 stores between 2021 and 2025 were analyzed. The methodology followed five stages: (1) exploratory analysis, (2) feature engineering, (3) model implementation, (4) model optimization and fine-tunning, and (5) comparative validation. The models implemented were: Prophet, XGBoost, Random Forest, and regularized Linear Regression. Prophet achieved the best overall performance for units sold (R² = 0.7121), standing out for its ability to capture complex seasonal patterns and adapt to store-level variability. XGBoost demonstrated high accuracy in non-linear scenarios, Random Forest showed robustness to noise, and Linear Regression provided greater interpretability. Feature engineering resulted in 83 variables, including temporal components, trends, volatility, and special effects. A cross-sectional analysis revealed common patterns such as peak underestimation, higher error in smaller stores and weekends, and lower accuracy in predicting monetary values compared to units. The findings confirm that sales forecasting using Machine Learning offers substantial improvements over traditional methods, enhancing operational efficiency, inventory optimization, and financial planning. Prophet is recommended as the primary model, along with the establishment of monthly recalibration cycles to maintain accuracy.Publicación Detección de tópicos con aprendizaje automático para la identificación de riesgos emergentes(Universidad EAFIT, 2025) Hernández Martínez, Felipe; Peña Palacio, Juan AlejandroPublicación On a Combination of Skewness and Kurtosis Matrices for Pro jection Pursuit Exploratory Cluster Analysis(Universidad EAFIT, 2025) Jaramillo Osorio, Esteban; Ortiz Arias, SantiagoSkewness and kurtosis are statistical measures critical for understanding distribu- tion characteristics, particularly in normality testing, clustering, and outlier detec- tion. While kurtosis has been widely explored in the literature, skewness remains un- derutilized despite its potential for identifying asymmetrical patterns in data. Com- bining these measures could create a robust tool for exploratory data analysis (EDA). This research proposes a novel approach by developing a convex combination of skew- ness and kurtosis matrices. Using iterative procedures to maximize or minimize this combination, we aim to construct a matrix serving as a projection index for a projec- tion pursuit algorithm. This matrix can identify clusters and outliers more effectively than either measure alone. To validate the methodology, experiments on artificial datasets and real-world data demonstrate the benefits of this combined approach in detecting non-normal features, evaluating clustering performance, and enhancing outlier detection.Publicación Optimización de lotes de fabricación en una industria cosmética para maximizar el GMROI : un enfoque integrado de algoritmos de aprendizaje automático y ARIMA(Universidad EAFIT, 2025) Idárraga Ojeda, Leidy Viviana; Almonacid Hurtado, Paula MaríaPublicación Respuestas a preguntas en contratos de arrendamiento bajo la normativa ASC (Accounting Standards Codification) 842 utilizando grandes modelos de lenguaje(Universidad EAFIT, 2025) Armendáriz Peña, David Adrián; Olarte Hernández, TomásThe ASC 842 standard, part of GAAP (Generally Accepted Accounting Principles) in the United States, establishes rules for recording leases in financial statements, enhancing transparency and comparability. However, its implementation poses significant challenges, such as interpreting complex contracts and extracting key information, tasks often performed manually, leading to high costs and errors. This thesis develops an automated system to address relevant questions about lease contracts using Natural Language Processing, Large Language Models, and Retrieval Augmented Generation. The goal is to reduce reliance on external consultants by identifying the information needed to draft technical accounting memos automatically. The GenAI Lifecycle methodology was employed, including text vectorization using embedding models and data storage in vector databases like Pinecone. Using lease contracts obtained from the Security Exchange Comission, the system was developed to answer key questions such as dates, purchase options, or renewal terms, achieving at least 70% accuracy. The results demonstrate that the system significantly reduces the time and costs associated with contract analysis, improving the accuracy in compliance with ASC 842. This approach has practical implications for the accounting industry, offering a scalable solution that democratizes access to advanced artificial intelligence tools, enabling companies to efficiently manage their regulatory processes. This work represents a significant step forward in integrating artificial intelligence to solve real-world accounting problems, fostering innovation in the extraction and analysis of regulatory information.Publicación Characterization of Phytosanitary Risks in Agricultural Crops using Multispectral Images(Universidad EAFIT, 2025) García Montenegro, Michell; Peña Palacio, Juan Alejandro; Martínez Vargas, Juan David; Royal Academy of Engineering, Distinguished International Associates Program and RiSE group.Publicación Modelo predictivo para optimizar el proceso de selección de aspirantes a becas talento en la Universidad EAFIT(Universidad EAFIT, 2025) Acosta Ospina, Juan Pablo; Tabares Betancur, Marta Silvia del SocorroPublicación Comparación de métodos de aprendizaje de máquina en el análisis de series temporales para la predicción de tasas de cambio(Universidad EAFIT, 2025) Restrepo Vallejo, Stevens; Almonacid Hurtado, Paula MaríaThe study of global financial markets represents a complex field of research, characterized by high competitiveness and volatility. The analysis of exchange rates serves as a focal point for investors and firms aiming to maximize profitability while minimizing risks. Although various techniques currently exist for estimating exchange rate price changes, the inherent stochastic nature of the market, coupled with the influence of political-economic factors, continues to pose significant challenges for precise and reliable data analysis. This study addresses the prediction of the prices of some of the most significant exchange rates in this market. Machine learning methods, which have demonstrated outstanding performance in the literature on time series forecasting, are compared and evaluated against a baseline linear model. The study primarily employs Random Forest models, Long Short-Term Memory (LSTM) neural networks, and a hybrid model combining Convolutional Neural Networks (CNNs) with LSTMs. Additionally, the robustness of these models is explored in the presence of outliers, with the aim of mitigating the risks associated with predictions involving highly variable data behaviors. The goal is to develop an adaptable analytical framework that enables investors and financial analysts to anticipate market movements, thereby enhancing their ability to make data-driven, informed decisions.Publicación Estimación del crecimiento poblacional de Leptopharsa Gibbicarina en palma de aceite (caso de estudio)(Universidad EAFIT, 2025) Salazar Hoyos, Alejandro; Restrepo Arias, Juan FelipePublicación Detección temprana de melanoma : aplicación de técnicas de procesamiento de imágenes y aprendizaje profundo(Universidad EAFIT, 2025) Lacouture Fierro, Juan David; Álvarez Barrera, Claudia PatriciaSkin cancer is the most common type of cancer worldwide, with melanoma accounting for only 1% of cases but causing most deaths associated with this disease. In the United States, 97,610 new cases of melanoma were diagnosed in 2023, with a mortality rate of 7,990. In Colombia, the incidence of melanoma has increased significantly in recent years. According to the Cuenta de Alto Costo, 7,881 new cases were reported in 2024, with 11.94% of diagnoses concentrated in Bogotá and the Central region. Additionally, the total number of cases treated in the country increased from 53,622 in 2017 to more than 105,000 in 2021. These figures place Colombia as the fourth country in the Americas with the highest incidence of melanoma, highlighting the urgent need to implement innovative tools for early diagnosis. This project develops a deep learning model to diagnose melanoma through medical imaging, utilizing convolutional neural networks and advanced image processing techniques. The model includes data collection, training, and validation, aiming to deliver rapid and accurate diagnoses. The research encourages for the integration of artificial intelligence into medical practice, enabling early diagnosis in regions with limited access to specialists and alleviating the burden on the healthcare system. In conclusion, this initiative represents a milestone in dermatological care in Colombia, benefiting both high-incidence areas and rural communities.Publicación Medellín seguro : predicción inteligente del número de hurtos a personas con algoritmos basados en series temporales(Universidad EAFIT, 2025) Guerra Medina, Cindy Paola; Moreno Reyes, Nicolas AlbertoToday, we are immersed in the data revolution, an era characterized by the importance of understanding past events to predict the future, and from these, support strategies that facilitate decision-making in advance. In this context, Colombia faces important challenges in terms of security and coexistence, challenges that can be addressed or estimated through data analysis; in Medellín, the open data portal Medata (medata.gov.co), allows access to historical and descriptive statistics on the incidence of crimes against persons such as theft; which is a recurring crime that affects the security, quality of life and economy of citizens. This project proposes the use of time series algorithms implemented in the IBM SPSS Modeler platform, a robust and flexible tool that facilitates the programming of predictive model competition (IBM, 2023, SPSS Modeler. Through its ability to identify patterns, trends and seasonality in historical data, it seeks to estimate the future incidence of theft from persons in the city of Medellín, disaggregating the analysis at the level of communes and neighborhoods. The projections will be made on a monthly basis for the months of October, November and December 2024, which will serve as input for the planning of preventive security strategies that contribute to the prioritization of areas that require greater attention and optimize available resources that minimize the negative impacts of crime and generate a greater sense of tranquility and confidence in citizens.Publicación Optimización de portafolios de inversión mediante pronósticos de volatilidad de «commodities» y acciones, utilizando modelos GARCH y «deep learning»(Universidad EAFIT, 2024) Villa Cardona, Jairo Alonso; Cruz Castañeda, VivianPublicación Estimación del efecto de las variables ambientales en la producción agrícola exportable en Antioquia usando modelos de ML(Universidad EAFIT, 2025) Páez Bermúdez, Johan Stiven; García Vargas, Johan FelipePublicación Análisis de patrones de violencia armada en la frontera de Colombia con Venezuela usando algoritmos de aprendizaje automático(Universidad EAFIT, 2025) Lopera Pai, Daniela; Aguilar Castro, José Lisandro