What's Machine Learning
Machine learning is a type of artificial intelligence that gives computers the ability to learn from data without being explicitly programmed. A machine learning algorithm builds a model from example inputs so that it can make predictions or decisions on new, unseen data.
A core objective of machine learning is generalization — the ability of a learned model to perform accurately on new examples after training on a finite sample. The training examples are drawn from some generally unknown probability distribution; the learner must build a model that produces accurate predictions even for cases it has never seen before.
Learning Paradigms
Machine learning tasks are classified into broad categories depending on the nature of the feedback available to the learning system.
| Paradigm | Feedback | Typical Problems | SMILE Pages |
|---|---|---|---|
| Supervised | Labelled input–output pairs | Classification, Regression, Sequence labelling | Classification, Regression, Deep Learning |
| Unsupervised | No labels — find structure | Clustering, Dimensionality reduction, Density estimation, Association rules | Clustering, Manifold Learning, Association Rules |
| Semi-supervised | A few labels + many unlabelled samples | Label propagation, Self-training, Generative models | Classification |
| Self-supervised | Pseudo-labels derived from data itself | Language modelling, Contrastive learning, Masked autoencoders | LLM, Deep Learning |
| Reinforcement | Scalar reward from environment | Game playing, Robotics, Control | — |
Features & Feature Engineering
A feature (also called explanatory variable, predictor, or covariate) is an individual measurable property of the phenomenon being observed. Choosing informative, discriminating, and independent features is a crucial step for effective machine learning. Features are usually numeric; a set of numeric features is conveniently described by a feature vector. Structural features such as strings, sequences, and graphs are also used in NLP and computational biology.
Feature engineering is the process of using domain knowledge to transform raw data into features that make algorithms work well. It encompasses:
- Encoding — converting categorical variables to numbers (one-hot, ordinal, target encoding)
- Scaling — normalizing or standardizing numeric ranges
- Imputation — filling in missing values
- Selection — removing irrelevant or redundant features (SNR, genetic algorithm, TreeSHAP)
- Construction — deriving new features from existing ones (polynomial, interaction terms, embeddings)
- Dimensionality reduction — compressing high-dimensional inputs (PCA, ICA, random projection)
See Feature Engineering for SMILE's full API.
Supervised Learning
In supervised learning each example is a pair: an input object (typically a feature vector) and a desired output (the label or response variable). The algorithm learns a function from inputs to outputs by analysing a labelled training set, then uses that function to predict outputs for new inputs.
Learning is typically framed as empirical risk minimisation: choose the hypothesis that minimises the average loss on the training set.
| Task | Output type | Common algorithms in SMILE |
|---|---|---|
| Classification | Discrete class label | Random Forest, Gradient Boosted Trees, SVM, KNN, Logistic Regression, Naïve Bayes, LDA/QDA/RDA, AdaBoost, Neural Networks, RBF Networks |
| Regression | Continuous real value | GBDT, Random Forest, SVR, Gaussian Process, OLS, Ridge, LASSO, ElasticNet, RBF Networks |
| Sequence Labelling | Sequence of labels | Hidden Markov Model (Viterbi), Conditional Random Field |
See Classification and Regression for detailed API guides.
Overfitting & the Bias–Variance Trade-off
When a model captures random noise rather than the true underlying pattern it is called overfitting. An overfit model has low training error but high test error. Conversely, a model that is too simple underfits: it has high bias and cannot capture the signal.
The bias–variance decomposition breaks generalisation error into:
- Bias — error from wrong assumptions in the model family.
- Variance — sensitivity to fluctuations in the training set.
- Irreducible noise — inherent noise in the data that no model can remove.
Ensemble methods such as Random Forest reduce variance via bagging; boosting methods such as Gradient Boosted Trees reduce bias iteratively.
Model Validation
To estimate how well a model generalises it must be evaluated on data it has not seen during training. SMILE provides:
- Hold-out — reserve a fixed percentage of data for testing.
- k-fold cross-validation — partition data into k equal folds; each fold serves as the test set exactly once.
- Leave-one-out (LOO) — k-fold where k equals the dataset size.
- Bootstrap — resample with replacement; use out-of-bag samples for testing.
See Model Validation for the full API, including confusion matrices, AUC, F1 score, RMSE, and more.
Regularization
Regularization introduces additional constraints or penalties to prevent overfitting. Common forms:
- L2 / Ridge — penalises the squared norm of weights; shrinks all weights smoothly towards zero.
- L1 / LASSO — penalises the absolute norm; produces sparse solutions by driving irrelevant weights to zero.
- Elastic Net — convex combination of L1 and L2; balances sparsity and coefficient stability.
- Dropout — randomly deactivates neurons during neural network training, acting as implicit ensemble averaging.
- Early stopping — halts training when validation error stops improving.
From a Bayesian perspective, L2 regularization corresponds to a Gaussian prior on weights and L1 to a Laplace prior.
Hyperparameter Optimisation
Beyond model parameters learned from data, most algorithms have hyperparameters set before training (e.g., number of trees, learning rate, regularization strength). SMILE supports:
- Grid search — exhaustive search over a discrete hyperparameter grid.
- Random search — sample configurations uniformly from a search space.
- Bayesian optimisation — use a Gaussian Process surrogate to select the next most promising configuration.
See Validation & HPO.
Unsupervised Learning
Unsupervised learning discovers hidden structure in unlabelled data. Because there is no explicit error signal to optimise, evaluating unsupervised models requires different criteria: intra-cluster cohesion, log-likelihood under a density model, reconstruction error, etc.
Clustering
Clustering groups objects so that items in the same cluster are more similar to each other than to items in other clusters. SMILE provides a wide range of algorithms:
| Category | Algorithms |
|---|---|
| Partitional | K-Means, X-Means, G-Means, Deterministic Annealing |
| Hierarchical | Agglomerative (single, complete, average, Ward linkage) |
| Density-based | DBSCAN, DENCLUE |
| Grid / scalable | BIRCH, CLARANS |
| Spectral / graph | Spectral Clustering, SIB |
| Neural / SOM | Self-Organizing Map (SOM), Neural Gas, Growing Neural Gas |
| Information-theoretic | Min-Entropy Clustering |
See Clustering.
Dimensionality Reduction & Manifold Learning
High-dimensional data often lies near a low-dimensional manifold. Dimensionality reduction makes data easier to visualise, speeds up downstream algorithms, and can remove noise.
- Linear methods: PCA, Kernel PCA, Probabilistic PCA, GHA, Random Projection, ICA
- Non-linear / manifold: IsoMap, LLE, Laplacian Eigenmap, t-SNE, UMAP
- Multi-dimensional Scaling: Classical MDS, Sammon Mapping
See Manifold Learning and Multi-Dimensional Scaling.
Association Rule Mining
Association rule mining discovers interesting co-occurrence patterns among variables in large
databases. The classic application is market basket analysis: given supermarket
transaction records, find rules of the form
{onions, potatoes} ⇒ {burger meat}.
Rules are evaluated with three metrics:
- Support — fraction of transactions containing all items in the rule.
- Confidence — P(consequent | antecedent): how often the rule is correct.
- Lift — confidence divided by the baseline frequency of the consequent; lift > 1 means the antecedent is positively correlated with the consequent.
SMILE implements the FP-growth algorithm, which mines frequent itemsets without candidate generation, making it efficient on large datasets. See Association Rule Mining.
Semi-supervised Learning
Labelled data is expensive to acquire; unlabelled data is cheap. Semi-supervised learning combines a small labelled set with a large unlabelled set to improve model accuracy beyond what either alone could achieve.
Assumptions that make semi-supervised learning effective:
- Continuity (smoothness) assumption
- Points that are close together are likely to share a label, yielding a preference for decision boundaries in low-density regions.
- Cluster assumption
- Data tend to form discrete clusters; points in the same cluster are likely to share a label.
- Manifold assumption
- Data lies near a low-dimensional manifold; learning manifold structure from unlabelled data helps avoid the curse of dimensionality.
Self-Supervised Learning
A self-supervised model is trained on a pretext task whose labels are derived automatically from the data — no human annotation required. The pretext task forces the model to learn rich internal representations that transfer well to downstream tasks.
Examples include:
- Masked language modelling (BERT, LLaMA) — predict randomly masked tokens.
- Next-token prediction (GPT) — predict the next token in a sequence.
- Contrastive learning (SimCLR, MoCo) — pull augmented views of the same image together and push different images apart in embedding space.
- Masked autoencoders (MAE) — reconstruct randomly masked patches of an image.
Generative AI
Generative AI models learn to produce new data samples — text, images, audio, video, 3D — that are statistically indistinguishable from real data. The three dominant approaches are Transformers, diffusion models, and GANs.
Transformer & Large Language Models
The Transformer is built on multi-head scaled dot-product attention. Text is tokenized into sub-word units and converted to dense vectors via an embedding table. At each layer, every token is contextualised against all other tokens in the context window via parallel attention heads, allowing important signals to be amplified. GPT-family models use a decoder-only stack trained with next-token prediction. LLaMA-3 extends this with grouped-query attention (GQA), rotary positional encoding (RoPE), SwiGLU feed-forward networks, and RMS normalisation.
Each generation of GPT models is significantly more capable than the previous due to increased model size (number of trainable parameters) and larger training data.
SMILE ships a complete LLaMA-3 inference stack — see Large Language Models.
Diffusion Models
Diffusion models (Stable Diffusion, DALL·E 3) learn to reverse a gradual Gaussian-noise corruption process. Training: iteratively add noise to real images. Inference: start from pure noise and denoise step-by-step, guided by a text prompt via cross-attention. Key components:
- VAE encoder/decoder — compresses images to/from a compact latent space.
- U-Net / DiT denoiser — predicts the noise component at each step.
- Text encoder — conditions the denoiser on the input prompt.
Generative Adversarial Networks (GANs)
A GAN pits two networks against each other in a minimax game:
- Generator — maps random latent vectors to synthetic data samples, trying to fool the discriminator.
- Discriminator — classifies samples as real or generated, trying to detect fakes.
At convergence the generator produces samples that the discriminator cannot distinguish from real data. Known challenges include mode collapse and training instability.
Deep Learning
Deep learning uses multi-layer neural networks to learn hierarchical representations directly from raw data (pixels, waveforms, tokens). Key architectural families:
| Architecture | Typical use |
|---|---|
| MLP (fully-connected) | Tabular data, embeddings |
| CNN (convolutional) | Images, audio spectrograms |
| RNN / LSTM / GRU | Sequences, time series (largely superseded by Transformers) |
| Transformer | Text, images (ViT), multi-modal |
| GNN (graph neural network) | Molecular property prediction, social networks |
| Diffusion model | Image and audio synthesis |
| VAE (variational autoencoder) | Representation learning, generation |
SMILE's smile-deep module wraps LibTorch (PyTorch C++) with GPU acceleration,
pre-built layer primitives, and pretrained models (EfficientNet-V2 image classification,
LLaMA-3 language models). See Deep Learning.
Reinforcement Learning
A reinforcement learning (RL) agent interacts with an environment to maximise cumulative reward. Unlike supervised learning, no labelled examples are provided; the agent discovers which actions yield reward through trial and error — trial-and-error search and delayed reward are its most important distinguishing features.
There are four main components of an RL system:
- Policy (π) — the agent's strategy: maps observed states to actions.
- Reward signal (R) — scalar feedback from the environment after each action; the primary basis for changing the policy.
- Value function (V / Q) — expected cumulative reward from a state (or state–action pair); guides action selection towards long-term payoff rather than immediate reward.
- Model (optional) — the agent's internal model of environment dynamics, used for planning.
Markov Decision Processes (MDPs) provide the standard mathematical framework. Algorithms range from tabular methods (Q-Learning, SARSA) to deep RL (DQN, PPO, SAC) for large or continuous state/action spaces.
The fundamental challenge is the exploration–exploitation trade-off: the agent must exploit known good actions to accumulate reward, while also exploring new actions that might yield higher long-term payoffs. Randomly selecting actions without reference to any distribution shows poor performance; clever exploration mechanisms (ε-greedy, UCB, Thompson sampling) are essential.
Typical ML Workflow
Most machine learning projects follow a common lifecycle:
- Problem definition — frame the task (classification, regression, …), define success metrics, identify data sources.
- Data collection & exploration — gather raw data, compute summary statistics, visualise distributions and correlations. See Data Processing and Data Visualization.
- Data preparation — clean, encode, scale, impute, split into train/validation/test sets. See Data Processing and Missing Value Imputation.
- Feature engineering — construct informative features, select relevant ones, reduce dimensionality. See Feature Engineering.
- Model selection & training — choose candidate algorithms, train, and tune hyperparameters.
- Evaluation — measure performance on held-out data with appropriate metrics (accuracy, AUC, RMSE, …). See Model Validation.
- Deployment & monitoring — serve predictions via SMILE's Quarkus-based inference server, and track data/model drift over time.