CON-05: Computational methods for large databases

Computational power and approaches based on bioinformatics tools and algorithms, machine learning (ML) or artificial intelligence (AI) are gaining access to the health care systems. Likewise, our possibilities for evidence generation are growing.

ML techniques can be categorized into supervised and unsupervised learning approaches.

Machine Learning Main Techniques:

Supervised Techniques
- Classification
- Regression
Unsupervised Techniques
- Clustering
- Association Rules

Supervised and Unsupervised Learning Algorithms

The availability of the outcome of interest is the big difference between both approaches. In fact, while in supervised learning algorithms the outcome is given to the algorithm – which can use it as gold standard to convert the input features into the outcome – in unsupervised learning approaches are methods where an algorithm must learn to model the underlying distribution of data elements given input features, but no outcome variable. Supervised methods can be costly and resource intense as they may require human expert input for defining and preparing a gold standard (i.e. the output label). In contrast, unsupervised methods rely purely on the quantity and quality of data for the training process - do not require the manual cost and effort required to develop a gold standard, which can lead to weaker performance.

For supervised approaches, Classification (which can predict a discrete or categorical output variable) and Regression (predicts a numerical continuous output variable) models are the major categories. Some classification models are

(1) Simple logistic – uses a logistic function which is used to predict the outcome variable;

(2) Support vector machines - identifies an optimal hyperplane (a subspace whose dimension is -1 of its ambient space) capable of separating data into each outcome;

(3) Decision trees – generally predicts the value of an outcome by learning decision rules inferred from the training dataset. Among examples of Regression Algorithms are simple logistic regression and random forest regression.

Overall, the process in supervised ML encompasses the following steps:

During training, the model is given both the features and the labels and learns how to map the former to the latter.
A trained model is evaluated on a testing set, where we only give it the features, and it makes predictions.
Then, the predictions are compared with the known labels for the testing set to calculate accuracy.

K-means clustering and Hierarchical clustering are the most widely known unsupervised learning algorithms. The first approach seeks to group each observation into a subset of clusters, where each observation belongs to the cluster with the nearest mean value. In contrast, the second uses an approach which seeks to build out a hierarchy of clusters, which can be agglomerative (each individual instance starts as a separate cluster, with pairs of clusters merging as instances traverse up the hierarchy) or divisive (all observations start with one cluster and splits are performed as instances traverse down the hierarchy). Please see example 6.

Last modified: Friday, 30 September 2022, 11:58 AM