What's Machine Learning
Machine learning is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed. Machine learning algorithms can make predictions on data by building a model from example inputs.
A core objective of machine learning is to generalize from its experience. Generalization is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a training data set. The training examples come from some generally unknown probability distribution and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.
Machine learning tasks are typically classified into three broad categories, depending on the nature of the learning "signal" or "feedback" available to a learning system.
- Supervised learning
The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs.
- Unsupervised learning
No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).
- Reinforcement learning
A computer program interacts with a dynamic environment in which it must perform a certain goal, without a teacher explicitly telling it whether it has come close to its goal.
Between supervised and unsupervised learning is semi-supervised learning, where the teacher gives an incomplete training signal: a training set with some (often many) of the target outputs missing.
Features
A feature is an individual measurable property of a phenomenon being observed. Features are also called explanatory variables, independent variables, predictors, regressors, etc. Any attribute could be a feature, but choosing informative, discriminating and independent features is a crucial step for effective algorithms in machine learning. Features are usually numeric and a set of numeric features can be conveniently described by a feature vector. Structural features such as strings, sequences and graphs are also used in areas such as natural language processing, computational biology, etc.
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. It requires the experimentation of multiple possibilities and the combination of automated techniques with the intuition and knowledge of the domain expert.
The initial set of raw features can be redundant and too large to be managed. Therefore, a preliminary step in many applications consists of selecting a subset of features, or constructing a new and reduced set of features to facilitate learning, and to improve generalization and interpretability.
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. It requires the experimentation of multiple possibilities and the combination of automated techniques with the intuition and knowledge of the domain expert.
Supervised Learning
In supervised learning, each example is a pair consisting of an input object (typically a feature vector) and a desired output value (also called the response variable or dependent variable). Supervised learning algorithms try to learn a function (often called hypothesis) from input object to the output value. By analyzing the training data, it produces an inferred function (referred as a model), which can be used for mapping new examples.
Supervised learning problems are often solved by optimizating the loss functions that represent the price paid for inaccuracy of predictions. The risk associated with hypothesis is then defined as the expectation of the loss function. In general, the risk cannot be computed because the underlying distribution is unknown. However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set.
Empirical risk minimization principle states that the learning algorithm should choose a hypothesis which minimizes the empirical risk.
Batch learning algorithms generate the model by learning on the entire training data set at once. In contrast, online learning methods update the model with new data in a sequential order. Online learning is a common technique on big data where it is computationally infeasible to train over the entire dataset. It is also used when the data itself is generated over the time.
If the response variable is of category values, supervised learning problems are called classification. While the response variable is of real values, it is referred as regression.
Overfitting
When a model describes random error or noise instead of the underlying relationship, it is called overfitting. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. An overfit model will generally have poor generalization performance, as it can exaggerate minor fluctuations in the data.
Model Validation
To assess if the model be not overfit and can generalize to an independent data set, out-of-sample evaluation is generally employed. If the model has been estimated over some, but not all, of the available data, then the model using the estimated parameters can be used to predict the held-back data.
A popular model validation technique is cross-validation. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
Regularization
Regularization refers to a process of introducing additional information in order to prevent overfitting (or to solve an ill-posed problem). In general, a regularization term, typically a penalty on the complexity of hypothesis, is introduced to a general loss function with a parameter controlling the importance of the regularization term. For example, regularization term may be restrictions for smoothness or bounds on the vector space norm.
Regularization can be used to learn simpler models, induce models to be sparse, introduce group structure into the learning problem, and more.
A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.
Unsupervised Learning
Unsupervised learning tries to infer a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution.
Unsupervised learning is closely related to the problem of density estimation in statistics. However, unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data.
Clustering
Cluster analysis or clustering is the task of grouping a set of objects such that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
Latent Variable Models
In statistics, latent variables are variables that are not directly observed but are rather inferred from other observed variables. Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.
Association Rules
Association rule mining is to identify strong and interesting relations between variables in large databases.
Introduced by Rakesh Agrawal et al., a typical use case is to discover regularities between products
in large-scale transaction data recorded by point-of-sale systems in supermarkets. For example,
the rule {onions, potatoes} => {burger meat}
found in the sales data of
a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also
buy hamburger meat. Such information can be used as the basis for decisions about marketing activities
(e.g., promotional pricing or product placements).
Semi-supervised Learning
The acquisition of labeled data for a learning problem is usually labor-intensive, time-consuming, and of high cost. On the other hand, acquisition of unlabeled data is relatively inexpensive. Researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in model accuracy. Semi-supervised learning is a class of supervised learning tasks and techniques that make use of both a large amount of unlabeled data and a small amount of labeled data.
In order to make any use of unlabeled data, some relationship to the underlying distribution of data must exist. Semi-supervised learning algorithms make use of at least one of the following assumptions:
- Continuity assumption
Points that are close to each other are more likely to share a label. This is also generally assumed in supervised learning and yields a preference for geometrically simple decision boundaries. In the case of semi-supervised learning, the smoothness assumption additionally yields a preference for decision boundaries in low-density regions, so few points are close to each other but in different classes.
- Cluster assumption
The data tend to form discrete clusters, and points in the same cluster are more likely to share a label (although data that shares a label may spread across multiple clusters). This is a special case of the smoothness assumption and gives rise to feature learning with clustering algorithms.
- Manifold assumption
The data lie approximately on a manifold of much lower dimension than the input space. In this case learning the manifold using both the labeled and unlabeled data can avoid the curse of dimensionality. Then learning can proceed using distances and densities defined on the manifold. The manifold assumption is practical when high-dimensional data are generated by some process that may be hard to model directly, but which has only a few degrees of freedom.
Self-Supervised Learning
A self-supervised learning model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. Like supervised learning methods, the goal of self-supervised learning is to generate a classified output from the input. Meanwhile, it does not require the explicit use of labeled input-output pairs. Instead, correlations, metadata embedded in the data, or domain knowledge present in the input are implicitly and autonomously extracted from the data for training. For example, Transformer, a self-supervised language model, essentially learns to "fill in the blanks."
Generative AI
Generative AI (GenAI) can produce a wide variety of highly realistic and complex content, such as images, videos, audio, text, and 3D models by learning patterns from training data. Transformer is the state-of-the-art GenAI model architecture in natural language generation. It is based on the multi-head attention mechanism. Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head softmax-based attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. GPTs (Generative pre-trained transformers) are based on the decoder-only transformer architecture. Each generation of GPT models is significantly more capable than the previous, due to increased model size (number of trainable parameters) and larger training data.
Stable Diffusion is a text-to-image deep learning model based on latent diffusion techniques. Diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images, which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.
Generative adversarial network (GAN) is another framework for approaching generative AI. In GAN, two neural networks contest with each other in a game. The generative network generates candidates while the discriminative network evaluates them. The contest operates in terms of data distributions. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator from the true data distribution. The generative network's training objective is to increase the error rate of the discriminative network.
A known dataset serves as the initial training data for the discriminator. Typically, the generator is seeded with randomized input that is sampled from a predefined latent space. Thereafter, candidates synthesized by the generator are evaluated by the discriminator. Backpropagation is applied in both networks so that the generator produces better images, while the discriminator becomes more skilled at flagging synthetic images.
Reinforcement Learning
Reinforcement learning is about a learning agent interacting with its environment to achieve a goal. The learning agent has to map situations to actions to maximize a numerical reward signal. Different from supervised learning, the learner is not told which actions to take but instead must discover which actions yield the most reward by trying them. Moreover, actions may affect not only the immediate reward but also all subsequent rewards. Trial-and-error search and delayed reward are the most important features of reinforcement learning.
Markov decision processes (MDPs) provide a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. In contrast, deep reinforcement learning uses a deep neural network without explicitly designing the state space.
The major challenge in reinforcement learning is the tradeoff between exploration and exploitation. Reinforcement learning focus on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). To obtain a lot of reward, an agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has totry actions that it has not selected before. The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future. Reinforcement learning requires clever exploration mechanisms. Randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite MDP is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.
There are four main components in reinforcement learning system: a policy, a reward signal, a value function, and optionally a model of the environment. A policy defines the learning agent's way of behaving at a given time. A reward signal defines the goal of a reinforcement learning problem. The agent's objective is to maximize the total reward it receives overthe long run. The reward signal is the primary basis for altering the policy; if an action selected by the policy is followed by low reward, then the policy may be changed to select some other action in that situation in the future. While the reward is an immediate signal, a value function specifies what is good in the long run. The value of a state may be regarded as the total amount of reward an agent can expect to accumulate over the future, starting from that state. Action choices are made based on value judgments. We seek actions that bring about states of highest value, not the highest reward. Unfortunately, it is much harder to determine values than it is to determine rewards. Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime. Optionally, some reinforcement learning systems have a model of the environment. It allows inferences to be made about how the environment will behave. For example, given a state and action, the model might predict the result, next state and next reward, which can be used for planning.