Binning application in low-dimensional metagenomic sequences: performance of Barnes-Hut t-Stochastic Neighbor Embeddings, assessment of internal cluster validity indices
dc.contributor.advisor | Pinel Peláez, Nicolás | spa |
dc.contributor.advisor | Ariza Jiménez, Leandro Fabio | spa |
dc.contributor.author | Ceballos Cano, Julián | |
dc.contributor.author | Quintero Montoya, Olga Lucía | |
dc.coverage.spatial | Medellín de: Lat: 06 15 00 N degrees minutes Lat: 6.2500 decimal degrees Long: 075 36 00 W degrees minutes Long: -75.6000 decimal degrees | eng |
dc.creator.degree | Magíster en Matemáticas Aplicadas | spa |
dc.creator.email | Jceballosc@eafit.edu.co | spa |
dc.creator.email | oquintero1@eafit.edu.co | spa |
dc.date.accessioned | 2019-09-25T15:11:23Z | |
dc.date.available | 2019-09-25T15:11:23Z | |
dc.date.issued | 2019 | |
dc.description.abstract | Metagenomic studies aim to reconstruct the structure of microbial communities through the use of DNA sequence data of complex composition. To this end, they generally embed multidimensional data into low dimensional spaces followed by a binning process. The performance of the dimensionality reduction techniques, the clustering methods, and the internal cluster validity indices vary depending on the biological, statistical and computational features that are part of the metagenomic analysis, yet it is seldom evaluated systematically. The explained problematic was explored through an unsupervised binning of metagenomic DNA sequences, based on the Subtractive and Fuzzy c-means algorithms applied to the two- and three-dimensional metagenomic sequences obtained via the Barnes-Hut t-Stochastic Neighbor Embedding (BH-SNE) algorithm in conjunction with Principal Component Analysis (PCA), with the aim of assessing the performance of the BH-SNE including and not including a preliminary PCA, besides the assessment of four Internal Cluster Validity Indices (ICVI) that conditioned the clustering procedure. In addition, the assessment of the ICVIs demonstrated that the Silhouette index had the best performances based on the median values of the F measure. Moreover, Silhouette index was also the most consistent index obtaining the highest values of F median in two- and three-dimensional treatments. In the case of high AAI ranges, the Silhouette index had equal results compared with Calinski-Harabasz index in terms of highest values of F median in three-dimensional treatment, although there were differences between their performance in two-dimensional treatments. In particular, Dunn index generated the worst performances in the low AAI percentages, while the Davies-Bouldin index was the worst in high AAI percentages. Additionally, the Dunn and Davies-Bouldin indices were the most consistent generating the lowest F median values. Moreover, the results of this research suggest that the biology of the metagenomic sequences could have an incidence over the best ICVIs performances. Finally, it was possible to determine that the highest F median values were obtained by the ICVIs in 3D embeddings, with equal results for BH-SNE including and not including preliminary PCA. Furthermore, it was also demonstrated that there was no significance between the results that included or not included a preliminary PCA. In terms of consistency, it was not possible to determine which was the most consistent treatment (2D or 3D embedding with BH-SNE including and not including preliminary PCA) that led the ICVIs to obtaining the best and worst F median results. | eng |
dc.identifier.ddc | 660.6 C387 | |
dc.identifier.uri | http://hdl.handle.net/10784/13874 | |
dc.language.iso | spa | spa |
dc.publisher | Universidad EAFIT | spa |
dc.publisher.department | Escuela de Ciencias. Departamento de Ciencias Básicas | spa |
dc.publisher.place | Medellín | spa |
dc.publisher.program | Maestría en Matemáticas Aplicadas | spa |
dc.rights.accessrights | info:eu-repo/semantics/openAccess | eng |
dc.rights.local | Acceso abierto | spa |
dc.subject | Metagenómica | spa |
dc.subject.keyword | Metagenomics | spa |
dc.subject.keyword | Clustering | eng |
dc.subject.keyword | Cluster validity index | eng |
dc.subject.keyword | Fuzzy clustering | eng |
dc.subject.keyword | Embedding | eng |
dc.subject.keyword | BH-SNE | spa |
dc.subject.lemb | BIOTECNOLOGÍA | spa |
dc.subject.lemb | ADN - GENÉTICA | spa |
dc.subject.lemb | BIOLOGÍA MOLECULAR - TÉCNICAS | spa |
dc.title | Binning application in low-dimensional metagenomic sequences: performance of Barnes-Hut t-Stochastic Neighbor Embeddings, assessment of internal cluster validity indices | eng |
dc.type | masterThesis | eng |
dc.type | info:eu-repo/semantics/masterThesis | eng |
dc.type.hasVersion | acceptedVersion | eng |
dc.type.local | Tesis de Maestría | spa |
Archivos
Bloque original
1 - 1 de 1
No hay miniatura disponible
- Nombre:
- Julian_CeballosCano_Olga_Quintero_2019.pdf
- Tamaño:
- 575.54 KB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Trabajo de grado
Bloque de licencias
1 - 1 de 1
No hay miniatura disponible
- Nombre:
- license.txt
- Tamaño:
- 1.71 KB
- Formato:
- Item-specific license agreed upon to submission
- Descripción: