Maestría en Ciencias de los Datos y Analítica (tesis)
URI permanente para esta colección
Examinar
Envíos recientes
Publicación Modelo de predicción de venta en una compañía textil con técnicas de Machine Learning(Universidad EAFIT, 2025) Lezcano Echeverri, Jhon Wilder; Puerta Puerta, Henry DanielThis study explores the implementation of sales forecasting models in a Colombian textile company, combining traditional techniques with Machine Learning-based approaches. Daily sales data from 187 stores between 2021 and 2025 were analyzed. The methodology followed five stages: (1) exploratory analysis, (2) feature engineering, (3) model implementation, (4) model optimization and fine-tunning, and (5) comparative validation. The models implemented were: Prophet, XGBoost, Random Forest, and regularized Linear Regression. Prophet achieved the best overall performance for units sold (R² = 0.7121), standing out for its ability to capture complex seasonal patterns and adapt to store-level variability. XGBoost demonstrated high accuracy in non-linear scenarios, Random Forest showed robustness to noise, and Linear Regression provided greater interpretability. Feature engineering resulted in 83 variables, including temporal components, trends, volatility, and special effects. A cross-sectional analysis revealed common patterns such as peak underestimation, higher error in smaller stores and weekends, and lower accuracy in predicting monetary values compared to units. The findings confirm that sales forecasting using Machine Learning offers substantial improvements over traditional methods, enhancing operational efficiency, inventory optimization, and financial planning. Prophet is recommended as the primary model, along with the establishment of monthly recalibration cycles to maintain accuracy.Publicación Detección de tópicos con aprendizaje automático para la identificación de riesgos emergentes(Universidad EAFIT, 2025) Hernández Martínez, Felipe; Peña Palacio, Juan AlejandroPublicación On a Combination of Skewness and Kurtosis Matrices for Pro jection Pursuit Exploratory Cluster Analysis(Universidad EAFIT, 2025) Jaramillo Osorio, Esteban; Ortiz Arias, SantiagoSkewness and kurtosis are statistical measures critical for understanding distribu- tion characteristics, particularly in normality testing, clustering, and outlier detec- tion. While kurtosis has been widely explored in the literature, skewness remains un- derutilized despite its potential for identifying asymmetrical patterns in data. Com- bining these measures could create a robust tool for exploratory data analysis (EDA). This research proposes a novel approach by developing a convex combination of skew- ness and kurtosis matrices. Using iterative procedures to maximize or minimize this combination, we aim to construct a matrix serving as a projection index for a projec- tion pursuit algorithm. This matrix can identify clusters and outliers more effectively than either measure alone. To validate the methodology, experiments on artificial datasets and real-world data demonstrate the benefits of this combined approach in detecting non-normal features, evaluating clustering performance, and enhancing outlier detection.Publicación Optimización de lotes de fabricación en una industria cosmética para maximizar el GMROI : un enfoque integrado de algoritmos de aprendizaje automático y ARIMA(Universidad EAFIT, 2025) Idárraga Ojeda, Leidy Viviana; Almonacid Hurtado, Paula MaríaPublicación Respuestas a preguntas en contratos de arrendamiento bajo la normativa ASC (Accounting Standards Codification) 842 utilizando grandes modelos de lenguaje(Universidad EAFIT, 2025) Armendáriz Peña, David Adrián; Olarte Hernández, TomásThe ASC 842 standard, part of GAAP (Generally Accepted Accounting Principles) in the United States, establishes rules for recording leases in financial statements, enhancing transparency and comparability. However, its implementation poses significant challenges, such as interpreting complex contracts and extracting key information, tasks often performed manually, leading to high costs and errors. This thesis develops an automated system to address relevant questions about lease contracts using Natural Language Processing, Large Language Models, and Retrieval Augmented Generation. The goal is to reduce reliance on external consultants by identifying the information needed to draft technical accounting memos automatically. The GenAI Lifecycle methodology was employed, including text vectorization using embedding models and data storage in vector databases like Pinecone. Using lease contracts obtained from the Security Exchange Comission, the system was developed to answer key questions such as dates, purchase options, or renewal terms, achieving at least 70% accuracy. The results demonstrate that the system significantly reduces the time and costs associated with contract analysis, improving the accuracy in compliance with ASC 842. This approach has practical implications for the accounting industry, offering a scalable solution that democratizes access to advanced artificial intelligence tools, enabling companies to efficiently manage their regulatory processes. This work represents a significant step forward in integrating artificial intelligence to solve real-world accounting problems, fostering innovation in the extraction and analysis of regulatory information.Publicación Characterization of Phytosanitary Risks in Agricultural Crops using Multispectral Images(Universidad EAFIT, 2025) García Montenegro, Michell; Peña Palacio, Juan Alejandro; Martínez Vargas, Juan David; Royal Academy of Engineering, Distinguished International Associates Program and RiSE group.Publicación Modelo predictivo para optimizar el proceso de selección de aspirantes a becas talento en la Universidad EAFIT(Universidad EAFIT, 2025) Acosta Ospina, Juan Pablo; Tabares Betancur, Marta Silvia del SocorroPublicación Comparación de métodos de aprendizaje de máquina en el análisis de series temporales para la predicción de tasas de cambio(Universidad EAFIT, 2025) Restrepo Vallejo, Stevens; Almonacid Hurtado, Paula MaríaThe study of global financial markets represents a complex field of research, characterized by high competitiveness and volatility. The analysis of exchange rates serves as a focal point for investors and firms aiming to maximize profitability while minimizing risks. Although various techniques currently exist for estimating exchange rate price changes, the inherent stochastic nature of the market, coupled with the influence of political-economic factors, continues to pose significant challenges for precise and reliable data analysis. This study addresses the prediction of the prices of some of the most significant exchange rates in this market. Machine learning methods, which have demonstrated outstanding performance in the literature on time series forecasting, are compared and evaluated against a baseline linear model. The study primarily employs Random Forest models, Long Short-Term Memory (LSTM) neural networks, and a hybrid model combining Convolutional Neural Networks (CNNs) with LSTMs. Additionally, the robustness of these models is explored in the presence of outliers, with the aim of mitigating the risks associated with predictions involving highly variable data behaviors. The goal is to develop an adaptable analytical framework that enables investors and financial analysts to anticipate market movements, thereby enhancing their ability to make data-driven, informed decisions.Publicación Estimación del crecimiento poblacional de Leptopharsa Gibbicarina en palma de aceite (caso de estudio)(Universidad EAFIT, 2025) Salazar Hoyos, Alejandro; Restrepo Arias, Juan FelipePublicación Detección temprana de melanoma : aplicación de técnicas de procesamiento de imágenes y aprendizaje profundo(Universidad EAFIT, 2025) Lacouture Fierro, Juan David; Álvarez Barrera, Claudia PatriciaSkin cancer is the most common type of cancer worldwide, with melanoma accounting for only 1% of cases but causing most deaths associated with this disease. In the United States, 97,610 new cases of melanoma were diagnosed in 2023, with a mortality rate of 7,990. In Colombia, the incidence of melanoma has increased significantly in recent years. According to the Cuenta de Alto Costo, 7,881 new cases were reported in 2024, with 11.94% of diagnoses concentrated in Bogotá and the Central region. Additionally, the total number of cases treated in the country increased from 53,622 in 2017 to more than 105,000 in 2021. These figures place Colombia as the fourth country in the Americas with the highest incidence of melanoma, highlighting the urgent need to implement innovative tools for early diagnosis. This project develops a deep learning model to diagnose melanoma through medical imaging, utilizing convolutional neural networks and advanced image processing techniques. The model includes data collection, training, and validation, aiming to deliver rapid and accurate diagnoses. The research encourages for the integration of artificial intelligence into medical practice, enabling early diagnosis in regions with limited access to specialists and alleviating the burden on the healthcare system. In conclusion, this initiative represents a milestone in dermatological care in Colombia, benefiting both high-incidence areas and rural communities.Publicación Medellín seguro : predicción inteligente del número de hurtos a personas con algoritmos basados en series temporales(Universidad EAFIT, 2025) Guerra Medina, Cindy Paola; Moreno Reyes, Nicolas AlbertoToday, we are immersed in the data revolution, an era characterized by the importance of understanding past events to predict the future, and from these, support strategies that facilitate decision-making in advance. In this context, Colombia faces important challenges in terms of security and coexistence, challenges that can be addressed or estimated through data analysis; in Medellín, the open data portal Medata (medata.gov.co), allows access to historical and descriptive statistics on the incidence of crimes against persons such as theft; which is a recurring crime that affects the security, quality of life and economy of citizens. This project proposes the use of time series algorithms implemented in the IBM SPSS Modeler platform, a robust and flexible tool that facilitates the programming of predictive model competition (IBM, 2023, SPSS Modeler. Through its ability to identify patterns, trends and seasonality in historical data, it seeks to estimate the future incidence of theft from persons in the city of Medellín, disaggregating the analysis at the level of communes and neighborhoods. The projections will be made on a monthly basis for the months of October, November and December 2024, which will serve as input for the planning of preventive security strategies that contribute to the prioritization of areas that require greater attention and optimize available resources that minimize the negative impacts of crime and generate a greater sense of tranquility and confidence in citizens.Publicación Optimización de portafolios de inversión mediante pronósticos de volatilidad de «commodities» y acciones, utilizando modelos GARCH y «deep learning»(Universidad EAFIT, 2024) Villa Cardona, Jairo Alonso; Cruz Castañeda, VivianPublicación Estimación del efecto de las variables ambientales en la producción agrícola exportable en Antioquia usando modelos de ML(Universidad EAFIT, 2025) Páez Bermúdez, Johan Stiven; García Vargas, Johan FelipePublicación Análisis de patrones de violencia armada en la frontera de Colombia con Venezuela usando algoritmos de aprendizaje automático(Universidad EAFIT, 2025) Lopera Pai, Daniela; Aguilar Castro, José LisandroPublicación Financial well-being and credit behavior in Mexico(Universidad EAFIT, 2025) Patiño Hurtado, Germán Alonso; Hernández Zuluaga, Juan FelipePublicación Aplicación de redes neuronales convolucionales y técnicas de procesamiento de lenguaje natural para el análisis de sentimiento en datos financieros(Universidad EAFIT, 2025) Fernández Ceballos, Juan Manuel; Almonacid Hurtado, Paula MaríaPublicación Clasificación ABC de inventarios mediante modelos de aprendizaje por refuerzo(Universidad EAFIT, 2025) Arrieta Salgado, Karolina; Almonacid Hurtado, Paula MaríaPublicación Predicción de ventas para una empresa de Hardware Business-to-Business(Universidad EAFIT, 2025) Sánchez Cárdenas, Hernán Felipe; Almonacid Hurtado, Paula MaríaPublicación Análisis comparativo de modelos predictivos para la estimación de PM2.5 : un enfoque basado en aprendizaje automático y predicción conformal(Universidad EAFIT, 2024) Camelo Valera, Matías; Martínez Vargas, Juan David; Sepúlveda Cano, Lina MariaFine particulate matter (𝑃𝑀2.5pollution poses a significant environmental and public health challenge, requiring accurate predictive models for its monitoring and control. This study compares different machine learning approaches, including Linear Regression, Random Forest, and XGBoost, with and without the inclusion of mobility variables, to estimate 𝑃𝑀2.5 levels. Additionally, inductive conformal prediction is implemented to quantify uncertainty in the estimates and provide confidence intervals with 𝛼=0.05. The results show that while XGBoost experiences performance deterioration during training when mobility variables are included, it achieves the best validation performance with the lowest mean absolute error and the highest coefficient of determination. Conformal prediction enabled the establishment of confidence intervals with 89.26% coverage, close to the expected 95%, ensuring model reliability across different spatial and temporal scenarios. In conclusion, the use of machine learning models combined with advanced validation and calibration techniques, such as conformal prediction, enhances the accuracy and reliability of 𝑃𝑀2.5 estimation. However, the quality of input variables, particularly mobility-related data, remains a challenge, highlighting the need to incorporate meteorological information and improve data resolution. These findings contribute to the development of more reliable predictive tools for environmental management and air quality policy decision-making.Publicación Detección automática de acordes empleando técnicas de caracterización de audio y machine learning(Universidad EAFIT, 2025) Gil Urrego, Rafael Alejandro; Martínez Vargas, Juan David; Sepúlveda Cano, Lina MaríaAutomatic chord detection in audio tracks is essential for developing various musical applications, such as music transcription and score generation. For this reason, there has been a growing interest in the field of data science to explore different strategies to address this need. The main approach studied in recent years is based on extracting features from audio files that contain chord information. Transforming the audio signal using different frequency analysis tools has generated data with a greater ability to describe the musical components present in the processed audio track. The Mel spectrogram and the Chromagram are some of the methods used for these tasks. Additionally, classical supervised analytical models such as Support Vector Machines (SVM), Random Forest, and Convolutional Neural Networks (CNN) have been employed in several studies. These models have demonstrated a high level of accuracy in chord identification. However, in most cases, they have been limited by the number of chord classes to estimate, as an increase in the number of classes can confuse the system, typically allowing a maximum of 24. In this thesis, a system for automatic chord identification was developed by implementing different classical and modern analytical models. For audio feature extraction, the pre-trained models HuBERT and VGGish were used. These extracted features were then fed into three classical models—SVM, Random Forest, and Gradient Boosting—to compare their results with those obtained by a modern model. The HuBERT architecture was chosen as the modern baseline model since it can function both as a feature extractor and a classifier. The experiments were conducted using recordings of 48 different chord classes, all played on a digital piano, providing a solid dataset for training and evaluating the proposed system’s performance. The study confirmed previous research findings: to obtain accurate chord class estimations, it is crucial to improve the characterization techniques of the input audio recordings. A recurring issue identified was the lack of a detailed description of the musical components in the recordings, which affected the models’ ability to deliver optimal results. Our findings highlight that precise feature extraction is key to reducing model generalization error, enabling better chord class identification in both classical supervised approaches and modern architectures such as HuBERT. Finally, it is concluded that modern models, including those based on Transformers, have a high dependency on the quantity and diversity of the data. To achieve effective adaptability, the training data must exhibit sufficient variations within the same class. When data lack intra-class variability, these systems struggle to adapt to new recordings, especially those with background noise or distortions.