### Glossary

#### Algorithm

An algorithm is a set of rules that define the link between several variables. Consider algorithms to be a straightforward sequence of instructions with a clear goal in mind: to generate an output. An algorithm may be compared to a typical recipe.

#### Chatbots

A chatbot is a software program that communicates directly with users to assist them with basic activities on the internet. A "conversation" takes place between a user and the computer program. Customer support applications are one of the most frequent uses for chatbots, but there are many other possibilities.

#### Cluster

A cluster, in machine learning and deep learning, is a collection of representations of data points or examples. These may contain varying numbers of elements, and the number of elements may vary over time.

#### Component

A component in machine learning is an algorithmic subroutine that contains some type of behavior which you want to invoke at different times during the algorithm's execution. One example could be a "learning rate" scalar that can be adjusted throughout training to help to improve results. There are many types of components that can be added to algorithms to change their behavior

#### Concept Drift

Concept drift is a problem faced by any system that tries to learn from incoming feedback about its own performance, including reinforcement learning systems or active learning systems for instance. The problem is to decide whether positive or negative feedback is more trustworthy for the purpose of learning.

#### Convolutional Neural Networks (CNN)

A deep convolutional neural network includes many individual networks which share weights, called feature maps, at some layers, and compete at others. This allows them to create new representations across multiple levels of abstraction. A CNN works around the limitations of Recurrent Neural Networks by processing data in an incremental manner—that is, they work on one fixed-sized patch at a time sliding this window over the entire input. Convolutional networks are trained with large sets of image/video data like ImageNet.

#### Deep Learning

Deep learning uses artificial neural networks (ANNs) that are composed of multiple nonlinear transformations. The core of deep learning consists of algorithms for data representation, transfer and functional abstraction.

#### Deep Q Network (DQN)

A deep reinforcement learning technique first published by Google DeepMind. DQNs uses neural network models to learn controllers that maximize future rewards through playing multiple games from the Atari 2600 console and show superior performance across many different game types. Initial results demonstrated that a DQN using only raw pixels as inputs could learn to outperform humans at many Atari games after just a few hours of play experience. *Currently used as the algorithm powering some AI applications within some companies such as Facebook.*

#### Evaluation Metric

An evaluation metric is the score or function used to quantify how well an algorithm performed. There are many different types of metrics used to judge the performance of algorithms, each useful for different purposes.

#### Gradient Descent

Gradient descent is an optimization method that alters a parameter vector in the direction opposite to its gradient, with steps proportional to the magnitude of this derivative (or slope). Gradient descent takes the first-order or second-order partial derivatives and turns them into discrete updates. It does so by minimizing over some parameter θ using an update rule: θ ← θ − α ∇θ log(p(x)) where p(x) is an estimated distribution (likelihood) for x. The step size α controls how large of a step you take; larger step sizes decrease the amount of computation required at each step, but could result in undesirable oscillations.

#### Hierarchical Clustering

A type of clustering that partitions a dataset into subsets and their respective representatives that maximizes intra-cluster similarity and minimizes inter-cluster dissimilarity by recursively partitioning clusters until all points belong to a single cluster. Hierarchical clustering is sometimes called bottom up clustering because you start with individual elements at the "bottom" level and progressively split them up into clusters based on how much they are similar to one another. On the other hand, k-means clustering is an example of top down clustering where you specify K centroids from which to begin and the cluster assignments refine themselves until all points are assigned to a single centroid.

#### Loss Functions

Differentiable loss functions measure how far an estimated distribution (likelihood) p(x) is from the true distribution, q(x). Loss functions play a critical role in optimization algorithms such as Gradient Descent since they specify the nature of the update. The most common choice for a loss function is the negative log probability or cross entropy: L = − ∑x log(p(x)) − log(q(x)).

#### Max-margin classification

A supervised learning algorithm which works by constructing a separating hyperplane between training data based on maximizing margin, where margin represents the minimum distance from each point to the hyperplane. Max-margin classification is often used in computer vision tasks where objects are detected, localized and classified given data from an imaging sensor under real-world conditions. For these applications, margin maximization plays a role similar to Fisher's linear discriminant analysis (LDA) for separating classes of multi-dimensional data.

#### Multi-class Classification

An extension of binary classification where more than two groups exist, but typically only one output value per observation is desired. The most common type of multi class problem is multiclass classification with K classes representing each group. Multi-class problems may use binary or non-binary outputs which can be placed into N(N − 1)/2 possible subclasses for N subgroups.

#### Non-negative matrix factorization

A matrix factorization which produces non-negative values for its factors. In non-negative matrix factorization, the input data is a real-valued sparse matrix whose rows and columns contain positive and negative entries. The goal is to find a set of factors that when multiplied together reconstructs the original data as well as possible, while minimizing reconstruction error using some objective function such as Euclidean distance under constraints on certain elements being non-negative. Generalized variants of NMF exist where instead of enforcing all entries to be non-negative, only a subset need be implemented as such. Many popular methods use this approach including KMeans++ and Deep Learning's recent image classification methods.

#### Non-negative least squares

An optimization problem that minimizes a loss function, but with the constraint that each of its variables must be non-negative as opposed to positive or zero. The problem can be written as: Minimize f(x) = ∑i=1N l(xi), where for each xi ≥ 0 and x >0 for all i. Non-negative least squares problems exist in many domains other than machine learning including economics and finance. One example of their use is to create demand models for household products like groceries. For these sorts of problems, the goal is to estimate total revenue (price per unit multiplied by quantity sold) given market size (number of households) and per capita income (income per household).

#### Overfitting

The principle that a model fitting training data very closely but having no predictive power when deployed to make real-world predictions is said to have "overfit". Overfitting is common when training an overly complex model or one which contains too many parameters relative to the number of examples it is trained on. Most often overfitting occurs in cases where the training set has been selected to closely match a specific distribution while the set of examples where predictions are required encompasses a much larger and broader range.

#### Pipeline

A sequence of data processing steps that share similar properties or behavior which can be reused in multiple stages. Commonly pipelines include tasks such as feature extraction, preprocessing, model building, training, evaluation and deployment. In both machine learning and computer vision applications pipelines take on a variety of forms including neural networks feeds, convolutional networks architectures for image classification tasks and recurrent neural networks sequences for speech recognition problems.

#### Quantization

Reducing the number of values of variable or data point within a given range or set to some subset. Quantization is often used as a means for resource-constrained applications like mobile devices and embedded systems to reduce memory use (and therefore increase available memory) by reducing variable ranges, decreasing precision or even removing values that are unnecessary for the task at hand. For example, converting an 8 bit unsigned integer into its nearest value in a smaller range would be an example of quantization. In machine learning this concept is most commonly applied using stochastic gradient descent which takes place over so many iterations or epochs where each subsequent iteration reduces the range of values taken on by weights during optimization through gradient updates.

#### Recommender systems

Recommender systems are data driven algorithms that predict the preferences of users for various items such as products, services or media. The goal is to generate personalized content for each user within an application by leveraging large datasets describing user behaviors and interaction with different types of content over time. This process is traditionally built around the use of collaborative filtering methods which leverage matrix factorization techniques to make predictions about how users will like new items based on their similarity (in terms of preference) to other known objects. Popular recommender system implementations include Amazon's product recommendations on Facebook's newsfeeds.

#### Regularization

The process by which certain coefficients in a model are penalized using some objective function in order to reduce model complexity and decrease the chance of overfitting. Regularization is often achieved by adding a penalty term to an objective function that is evaluated during optimization. This additional term represents some measure of model complexity and its size determines where relative emphasis will be placed on fitting training data as opposed to avoiding overfitting.

#### Unsupervised learning

The task of gaining knowledge from unlabeled data sets without the guidance or supervision of labeled examples (by humans) which must be provided to the algorithm in advance by some means. Unsupervised learning algorithms can produce meaningful outputs such as low dimensional embeddings representing groups and clusters found within high dimensional data, sparsity patterns describing how variables relate with one another and even complete models for specific tasks such as generating images depicting features found in the data.

#### Visualization

The process of deriving meaning from large and complex datasets through visual means by encoding information in a way that is easily digestible to humans and can be communicated efficiently and effectively. Visualizations for machine learning applications (e.g. embedding plots for quantized data) often represent high dimensional data with lower dimensional representations such as scatterplots, heatmaps or even small multiples like parallel coordinate plots to help gain insight into relationships between different features within the dataset. Common examples also include user interface widgets such as maps (a type of visualization which encodes geographic entities in feature vectors), histograms and pie charts.

#### Word2vec

A type of unsupervised algorithm that learns to represent words in terms of vectors or embeddings where each word is mapped to a high dimensional vector representation. Word2vec forms the core of many modern NLP applications by encoding sentence structure and nouns in vector form so that arbitrary adjectives, verbs, adverbs and prepositions can be recombined without needing knowledge about what was encoded in any specific combination beforehand. This allows for scalable natural language processing applications capable of understanding complex sentences composed with many interdependent parts without requiring any supervision beyond raw text alone.

#### XGBoost

Extreme gradient boosting is an extension of gradient tree boosting designed around distributed computation using CPU cores/processors across many machines by storing all model parameters in memory rather than spilling them to disk. XGBoost contains implementations of both gradient tree boosting and AdaBoost which are capable of scaling out to several GPUs/CPUs on modern hardware for distributed training. Many popular machine learning libraries (e.g. Scikit-learn) provide wrappers around the underlying C++ implementation allowing it to be used within other frameworks like TensorFlow or PyTorch while preserving performance characteristics like speed and low memory overhead that make XGBoost especially well suited for time series data.

#### Gradient Boosting

A type of ensemble learning method for building ensembles of weak learners called base classifiers which are optimized by a process known as gradient descent. Each iteration of gradient boosting tries to find a way to improve the predictions made by the weakest learner from the previous iteration, often using stochastic gradient-based updates instead of batch updates to allow faster training and greater model complexity. Gradient Boosting is commonly used for classification tasks where predicting one value from a set is desired but can also be used for regression problems when optimizing cumulative loss functions over many examples that represent cumulative likelihoods.

#### Gradient Tree Boosting

A type of ensemble learning method which uses gradient boosted decision trees to make predictions. Gradient tree boosting iteratively learns a set of weak learners (decision trees) in order to try and construct one strong learner that can generalize well to unseen data. By using stochastic gradient updates instead of batch updates, Gradient Tree Boosting is often able to learn much more complex models than other tree ensembles like Random Forests or Extremely Randomized Trees. Gradient tree boosting has been shown capable of out-performing other ensemble methods when used on problems with many features relative to the number of available samples where the model must learn how best to combine them without overfitting.

#### Stochastic Gradient Boosted Trees

Another type of ensemble learning method similar to Gradient Tree Boosting except it uses individual decision trees instead of gradient boosted trees. Stochastic Gradient Boosted Trees are commonly used for regression tasks where predictions can be made by summing many single result decisions, either at each node or across entire regions of the feature space which were not explored due to pruning. Like other tree ensembles, SGBoost is often slower than gradient tree boosting and lacks features like feature importances and out-of-bag samples which help improve interpretability but generally trains faster.

#### Random Forests

An ensemble learning technique for building a collection of weak learners called base class using randomized decision trees as weak learners. Random Forests are among the most popular machine learning techniques in use today due to their excellent performance at generalizing well to new data with low bias and variance relative to similarly sized models. Each decision tree is grown with a randomized selection of features which can help reduce overfitting when there are many potential weak learners that are shallow or “thin”.