Modeling Paradigms

This is a Twitter series on #FoundationsOfML. Today, I want to talk about another fundamental question:

โ“ What is a model in Machine Learning?

Let's take a bird's eye view at different modeling paradigms... ๐Ÿ‘‡๐Ÿงต


Remember our purpose is to find some "optimal" program P for solving a task T, by maximizing a performance metric M using some experience E.

๐Ÿ“ In ML lingo, that program P is called a "model", or alternatively, a "hypothesis".


๐Ÿ‘‰ The first step in coming up with a suitable model, is to think of a model family, or hypothesis space.

We can think of it as a class of programs, all very similar, such that one instance of that class is the actual program P that solves our task T optimally.


All the possible models of a given model family share a common structure, and differentiate themselves in specific decisions (or instructions, if you will) inside this common template.

A useful distinction here is between 1๏ธโƒฃ Parametric and 2๏ธโƒฃ Non-Parametric models.


1๏ธโƒฃ Parametric models

Different models inside our model family have a fixed number of parameters (or weights), and differentiate from each other in the specific values of each parameter.


๐Ÿ“ Most of the models you've probably heard about, including logistic regression, naive Bayes, and all neural networks (with fixed architecture) are in this category.


๐Ÿ”ฅ We can cast the problem of finding the best model as the problem of finding the optimal value for each parameter, such that some performance metric M is maximized (or some error metric M' is minimized).


2๏ธโƒฃ Non-parametric models

Different models of the same model family have a variable number of parameters that often depends on the size of the training set.


๐Ÿ“ Some simple examples of non-parametric models are K-nearest neighbors, support vector machines (with non-linear kernels), and decision trees.


1๏ธโƒฃ๐Ÿ‘ One advantage of parametric models is that they often have very efficient training algorithms, since you can exploit the structure of the parameter space (e.g., you have gradients to follow).


1๏ธโƒฃ๐Ÿ‘ Another advantage is that the size of the model is independent of the size of the training set, and is often proportional to the number of features, i.e., these models "compress" the training set into a fixed-size "formula" of sorts.


1๏ธโƒฃ๐Ÿ‘ Additionally, parametric models are often easy to regularize by adding some cost associated with the model complexity (e.g., the number of non-zero values or their magnitude).


In short, parametric models are mathematically elegant and very malleable.

1๏ธโƒฃ๐Ÿ‘Ž The biggest downside is that every model family implies significant assumptions about the data, and if we assume the wrong ones, we're very likely to underfit.


On the contrast, non-parametric models are much more ad-hoc.

2๏ธโƒฃ๐Ÿ‘ The main advantage of these models is they often imply weaker assumptions about the data, and can adapt to difficult datasets easier than similarly complex parametric models.


2๏ธโƒฃ๐Ÿ‘Ž The biggest downside is that each model family is its own world, with ad-hoc training algorithms and lots of model-specific decisions to fine-tune.


2๏ธโƒฃ๐Ÿ‘Ž Another downside is that the model size is sometimes proportional to the training set size, which makes them less suitable for learning from very large datasets.


โ“ Which is better?

The answer is, of course, it depends.

๐Ÿ‘‰ It depends on the nature of the problem to solve, the amount and quality of the available data, and what you intend to do with that solution.


๐Ÿ”น If you know the data well, and you're clear of your assumptions (e.g., linear relationships), go for the parametric model that best suits those assumptions.


๐Ÿ”น If you know nothing, Jon Snow, some of the most powerful non-parametric models (e.g., decision trees and SVMs) will often perform near the state-of-the-art with little fine-tuning.


โ˜๏ธ In any case, there is no substitute for experimentation. Make sure to evaluate different models and decide based on actual performance rather than intuition.

โณ We'll talk more about evaluation in a later thread.