Glossary, Un-Alphabetized

Ordered out of order

Nicholas Teague
From the Diaries of John Henry
13 min readAug 25, 2022

--

  1. Machine Learning — the algorithmic derivation of a modeled data generating function
  2. Artificial Intelligence — a superset that includes machine learning and other conventions of modeling knowledge
  3. Supervised Learning — a type of machine learning that relies on the convention of a training phase comparing training data and corresponding labels to derive a model that through inference translates test data to predicted labels
  4. Training — the stage of supervised learning in which a model is derived
  5. Inference — the stage of supervised learning in which a model is used to generate predictions
  6. Train and Test Data — common terms for the data sets utilized in training and inference for supervised learning. Note that increasing the scale of training data inspected in training is one of the easiest ways to improve the performance of a model.
  7. Tabular Data — data aggregated into tables of columns (features) and rows (samples), which may have feature types like numeric or categoric entries (“tabular learning” refers to supervised learning applied to tabular data)
  8. Dataframe — a convention for representing tabular data that includes column headers and a distinct index column to help with munging
  9. Munging — slang for data transformations
  10. Imputations — replacement values inserted into a data set in place of missing entries
  11. ML infill — a convention for imputations in which a distinct supervised learning model is trained for a feature based on properties of surrounding features
  12. Validation Data — data that is segregated from the training data before training to validate or tune a model in supervised learning
  13. Cross Validation — a variation on supervised learning in which training rotates through distinct data partitions so as to validate a model without discarding any training data
  14. Classification — a common application of supervised learning in which inference classifies a test data sample between a set of labels
  15. Regression — a common application of supervised learning in which inference translates a test data sample to a numeric value
  16. Linear Regression / Support Vector Machines / Decision Trees — some common conventions for supervised learning that pre-dated neural networks
  17. Random Forests — a variation on decision tree learning in which an aggregation of models are derived by training randomized subsets of features and samples
  18. Boosting — a variation on supervised learning in which an aggregation of models are iterated through by using the output of a preceding model to refine the output of a higher tier model
  19. Gradient Boosting — a variation on random forests relying on a form of boosting. Unlike random forests common libraries can be accelerated on GPU hardware. Gradient boosting remains one of top performing options for tabular learning albeit with more potential to overfit than random forests.
  20. GPU — a form of chipset hardware originally designed for computer graphics applications (graphics processing unit). One of the enabling factors for deep learning was the realization that certain conventions for GPU architectures were quite good at accelerating backpropagation by parallelized matrix multiplications.
  21. Weights — in supervised learning the training of a model amounts to tuning a set of weights, where each learning convention may have different ways of assembling or interacting between weights
  22. Parameters — a common alternate term for weights
  23. Hyperparameters — generally refers to configuration options for an architecture selected prior to a weight tuning step
  24. Hyperparameter Tuning — a meta form of training in which tuning of hyperparameters is conducted, basic forms include grid or random search or may make use of a more sophisticated optimizer
  25. Neural Networks — a common convention for supervised learning which in a basic form relies on layers of weighted activation functions, unlike gradient boosting it is suitable for data types other than tabular and has countless variations on conventions for different applications.
  26. Deep Learning — slang for neural networks with a larger number of layers and weights which may be more difficult to train but usually perform much better, also the name of an influential textbook by Goodfellow, Bengio, and Courville
  27. Architecture — broadly refers to the configuration of weights, e.g. depth of layers, width of layers, or other more advanced configurations applied in modern practice
  28. Dense Layer — a neural network layer in which every activation function in the layer has a weighted connection to every activation function in the subsequent layer
  29. Activation Function — a mathematical form to translate a collection of inputted weighted signals to a common returned signal
  30. ReLU Activation — this activation function returns zero for negative input or returns the input value for positive input. The integration of ReLU activations into neural networks was one of the enabling factors for deep learning. Modern practice may include subtle variations on this original form.
  31. Sigmoid Activation — this activation function is commonly applied for an output signal in neural network classification applications
  32. Linear Activation — this activation function is commonly applied for an output signal in neural network regression applications
  33. Backpropagation — the optimization algorithm applied to train a neural network in which a gradient signal derived from the forward pass of training data compered to corresponding labels fed to a loss function may be channeled back to tune earlier layers in a backward pass of gradient descent via the chain rule. Involves lots and lots of matrix multiplication operations.
  34. Loss Function — applied at the point of backpropagation where the output of a forward pass is compared to labels to derive a gradient signal. May include algorithmic components to shape the comparison (e.g. cross entropy loss) and may also include regularizing components.
  35. Learning Rate — An important hyperparameter of the loss function that scales the updates applied, may be static or adaptive (e.g. may be subject to momentum based optimizers, a cyclic, or an annealing regime)
  36. Regularization — constraints added to dampen or sparcify weights that may be expected to improve generalization characteristics, e.g. L1 and L2 regularizers.
  37. Dropout Regularization — a form of regularizer in which during training a stochastic subset of neural networks weights are randomly zero’d out, with a different sampled subset in each pass. Commonly specified by setting a ratio for a layer.
  38. Batch Normalization — a common operation to rescale weights of a layer during training which may help retain a robust gradient signal
  39. Stochastic Gradient Descent (SGD) — a common variant to gradient descent in which only a subset of the training data is inspected during each epoch
  40. Batch Size — the size of the inspected batch of training data during stochastic gradient descent, often an influential hyperparameter both to tuning other parameters and hardware utilization characteristics.
  41. Epoch — a single application of a backpropagating forward and backward pass
  42. Performance Metric — the score monitored in training which may help to compare model performance on train and test data, e.g. AUC, F1, MSE, etc.
  43. Confusion Matrix — refers to a comparison of model performance by 2x2 matrix of true or false positive predictions and true or false negative predictions. Some performance metrics may balance differently between performance towards these scenarios.
  44. Generalization — in supervised learning, refers to the trained model’s performance on test data, which will generally be lower than the model’s performance on train data
  45. Overfit — a state of high divergence between train and test data performance
  46. Early Stopping — one way to mitigate risk of overfit is to monitor performance metrics and halt at epochs when train or test performance reach a number of epochs without improvement
  47. Double Descent — a surprising phenomenon found in overparameterized networks in which models trained to a state of overfit, when progressed through sufficient additional epochs, will actually recover test data performance and then often outperform the model’s state prior to reaching overfit
  48. Overparameterization — as a surprising rule of thumb, has been found to occur when the number of weights in an architecture sufficiently exceeds the number of samples found in the training corpus. Has been considered one of the predominant mysteries of recent practice, as increasing scales of parameters appear to directly correlate with improvements to performance.
  49. Geometric Regularization Conjecture — this author’s hypothesis that phenomenon found with overparameterization arises from a similar property displayed by hyperspheres with increasing parameterization, mainly that with asymptotic dimensionality volume approaches zero, which if we can consider that the distribution of possible weights associated with a loss value is itself a high dimensioned geometric figure, then if the hypersphere effect is present increasing parameterization should constrain the degrees of freedom available to a training path which will have a regularizing effect.
  50. Hypersphere — also known as the n-sphere, refers to a geometric shape similar to a spherical ball but potentially with number of dimensions other than 3. Is one of the few tractable geometries in higher dimensions to modern theory. Has been found to have a characteristic curve of volume and surface area with increasing dimensionality. One way theorists evaluate hyperspheres is by their maximum packing density at different dimensionalities.
  51. Ratio of Influence — a metric proposed by this author deriving a ratio of the total count of weight interactions in a forward pass divided by total count of weights. For example for a three layer dense network, the input weights to the first layer influence all three layers while the weights of the second layer only influence layers 2 and 3.
  52. Fitness Landscape — refers to the landscape of collective weight configurations that may be traversed through backpropagation
  53. Training Path — the sequence of weight configurations traversed through backpropagation
  54. Degrees of Freedom — refers to how much variability may result in a realized training path from the stochasticity inherent in the training loop (related to Fisher Information)
  55. Entropy — one of the fundamental properties of physics described by laws of thermodynamics, however in machine learning practice we are more commonly referring to entropy in the information theory sense. In the context of random sampling, sufficiently increasing entropy aligns with i.i.d. stochasticity where all outcome scenarios are equally possible.
  56. Shannon Entropy — a measure of information content proposed by Claude Shannon characterized by a sum of negative log likelihoods
  57. Von Neumann Entropy — similar to Shannon entropy but measures in the quantum information setting
  58. Noise — in machine learning practice commonly refers to channels of stochasticity channeled into data or training. Note that in quantum computing literature “noise” more commonly refers to qubit state error channels arising from hardware or environmental factors.
  59. Isotropic Noise — stochasticity sampled from a tame distribution like e.g. Gaussian or Laplace distributions
  60. Distribution — a term from probability theory describing the range and density of possible values when sampling from a system. In simple systems can by approximated by sampling a histogram.
  61. Bernoulli Distribution — a discrete distribution that samples from a set of possible values, similar to rolling a dice. Scenarios may be characterized as equally likely or weighted.
  62. Gaussian Distribution — also known as the Normal Distribution, refers to a tame univariate continuous distribution common in nature. Arrises naturally from a collection of diverse distributions by the central limit theorem. Can be characterized by a mean and standard deviation.
  63. Laplace Distribution — a distribution sometimes used in place of Gaussian in machine learning practice characterized by a sharper peak and fatter tails.
  64. Tail Thickness — in univariate distributions, with increasing tail thickness a sampled value is more likely to fall far from the mean. Fatter tails are less statistically tractable as sampling a single outlier point may materially shift summary statistics (the proverbial calculating average wealth of a population with or without including Warren Buffett). An example of a fat tailed distribution is known as the power law distribution.
  65. Power Law Distribution — a univariate distribution characterized by an exponent term which may have intractable summary statistics like mean or standard deviation. Examples of power laws found in nature include earthquake intensity or market drawdowns.
  66. Stochastic Perturbations — refers to the practice of injecting isotropic noise into the features of a tabular data set. When applied to test data results in non-deterministic inference.
  67. Non-Deterministic Inference — this author’s proposal to lift determinism from an inference operation by stochastic perturbations to test data so that model predictions may be sampled from a distribution instead of returning a fixed value corresponding to a given set of inputs.
  68. Pseudo Random Number Generator (PRNG) — a deterministic form of algorithmically sampling random numbers that relies on some form of entropy seeding, often sourced from the operating system, to make the output more unpredictable. A popular version in current use is known as the PCG generator.
  69. Entropy Seed — a source of unpredictability to a sampling operation that is magnified through the pseudo random number generator. Example resources for input could include clock states, memory states in an operating system, or other natural phenomenon.
  70. Quantum Random Number Generator (QRNG) — a truly random form of sampling from i.i.d. stochasticity that relies on measurements from a quantum circuit. Different forms of quantum sampling circuits may have different benefits such as circuit depth efficiency, certifiable output on untrusted hardware, or other measures of performance.
  71. Qubit — the fundamental building block of quantum computation, may be realized by one of many hardware implementations, generally relies on a nano scale system that may be manipulated into a shaped superposition by application of quantum gates. A rule of thumb is that information capacity of a quantum circuit with n qubits increases by 2^n, although even a small number of qubits may be useful in some applications.
  72. <Bra| |Ket> Convention — a way to represent matrices of qubits whereby the bra represents the amplitude of a state and the corresponding ket represents a particular collective measurement set scenario
  73. Amplitude — an extension of probability theory to the quantum setting in which values may be positive or negative and may also include complex number components. For a set of qubits the sum of amplitude squares across measurement set scenarios resembles classical probability in that it is fixed to a constant of 1.
  74. Cross Product — when adding qubits to a system the measurement set scenarios are a result of cross product between states, e.g. for a 1-qubit system with kets |0> and |1> adding two more qubits would have kets of |000> |001> |010> |011> |100> |101> |110> |111> and a similar operation to combined amplitudes. Cross products are a form of matrix multiplication that return a matrix.
  75. Dot Product — the bra ket notation is a result of multiplication between an amplitude vector and a measurement set vector. Dot products are a form of matrix multiplication that return a scalar.
  76. Complex Conjugate — a simple mathematical operation that can be applied to translate a bra representation to a corresponding ket representation or vice versa
  77. Bloch Sphere — a visual representation of a single qubit’s superposition as a sphere, where in an ideal state the superposition is constrained to the surface and can thus be described by two variables.
  78. Quantum Gate — a transformation applied to translate the superposition of one or more qubits to some alternate state. In physical space the implementation relies on the hardware, a common form is by microwave pulses. In mathematical space the gate can be described as a unitary matrix. The realization of a quantum circuit is nothing more than a set of gates applied to a set of qubits followed by one or more measurements.
  79. Measurement — the accessing of a qubit state that results in a collapse of the superposition to a classical value. (In some cases the measurement basis can be rotated in a manner similar to a gate’s rotation on the Bloch sphere.)
  80. Pauli Gates — a set of single qubit gates that can be represented by a square matrix, including X, Y, Z, H
  81. Hadamard Gate — a Pauli gate that translates a qubit from 0/1 basis to the +/- basis, or if considered from the 0/1 basis translates to an equal superposition between 0 and 1
  82. Bell State — the purest form of entanglement between two qubits, can be generated by a simple set of gates
  83. Entanglement — refers to correlations between measurement outcomes, even in distant devices. Some forms of quantum information transmission rely on sharing an entangled state across some distance. As an example, if we have two qubits in a bell state, then a measurement outcome on one will give us certainty about a measurement outcome on the other.
  84. Teleportation — a means to transport a copy of a superposition state to a different qubit. Relies on access to entanglement, an adjacent classical signal, and in the process of teleportation the original state is lost by measurement.
  85. NISQ (noisy intermediate scale quantum devices) — refers to paradigm of quantum hardware without access to fault tolerating error correction
  86. PQC (parameterized quantum circuits) — a paradigm of quantum circuit algorithm in which parameters of gate operations are tuned in a manner resembling the training of a network.
  87. Ansatz — a quantum circuit prepared as a set of gates serving as the initial state of a PQC prior to training. Some types of ansatz configurations may be more suitable to different applications.
  88. Barren Plateau — a challenging phenomenon sometimes found when training PQC algorithms referring to the loss of a gradient signal necessitating an exponential number of shots in order to train the circuit.
  89. Shots — common terminology from quantum computing cloud vendors referring to a round of circuit initialization to measurement, often a large part of the pricing basis. Some algorithms only require a small number of shots, others more.
  90. QML (quantum machine learning) — a catch-all term for machine learning algorithms that make use of quantum algorithms
  91. QNN (quantum neural network) — a type of QML utilizing a PQC and training in a manner similar to backpropagation. In some cases may have modular components mixed with classical neural networks.
  92. Quantum Kernel Learning — a form of QML that does not rely on PQC
  93. QAOA — a form of optimization performed on gate based quantum circuits
  94. Annealing — a form of optimization performed on adiabatic quantum circuits
  95. Adiabatic Quantum Circuits — a specialized form of quantum hardware for performing optimization
  96. Ising Model — a phenomenon similar to atomic configurations in magnets, in which reductions to a temperature or similar setting cause binary pairwise interactions in this type of system to become aligned to some low energy state, which in annealing could be shaped to align with the objective of an optimization function
  97. Shor’s Factoring Algorithm — an early high profile quantum algorithm relying on a quantum Fourier transform that will eventually result in an exponential speedup for breaking common encryptions. Drew a lot of interest and funding to the quantum ecosystem.
  98. Fourier Transform — an important algorithm that can be used to translate functions between geometric representations and a composition of frequencies and wavelengths, variations include the Fast Fourier Transform and the Quantum Fourier Transform
  99. Grover’s Search Algorithm — an important algorithm for unsorted database search that provides a quadratic speedup, can be adapted to realize quadratic speedups in many adjacent applications
  100. Tensor — an aggregation of data into compositions potentially of higher dimensions than matrices
  101. Tensor Networks — a mathematical trick for performing linear algebra operations on tensors
  102. Eigen Values / Eigen Vector — a characteristic translation of a tensor that contains most of the information in a compressed representation
  103. Principal Component Analysis — a form of dimensionality reduction that can be fit to a training set
  104. Dimensionality Reduction — the collective translation of a feature set to a compressed form
  105. Embedding — the translation of a category to a vectorized form
  106. Bottleneck — associated with the principle that channeling information into a compressed representation often helps with learning (in literature this same concept can be found in aphorisms)
  107. Large Language Model — a modern convention utilizing overparameterized transformer architectures and taught in a self-supervised contrastive fashion on a massive training corpus with auto-regressive inference used to generate sequences of natural languages with only few shot supervision
  108. Few Shot Learning — refers to generative models that can realistically mimic a demonstrated form with only a small amount of instruction
  109. Zero Shot Learning — what will be realized with artificial general intelligence
Highway 61 Revisited — Bob Dylan

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.