Using model post-processor within scikit-learn pipelines

On the lack of post-processors

Pipelines are a very handy and central feature of the scikit-learn library that enable to chain sequences, or even, in a limited way, DAGs, of data transformers and one machine learning model. They greatly facilitate the application of cross-validation scoring or hyper-parameter search on the resulting pipeline as a whole. They also ease the productization of a machine learning exercise by providing one single encapsulated trained model that can be applied as one instance on a test set or integrated into a production product.

For a very clear introduction, see also the excellent Using scikit-learn Pipelines and FeatureUnions

Continue reading »

How to normalize log-likelihood vectors

What is the best way to convert a log-likelihood vector into the corresponding log-probability vector, i.e. a vector of values whose exponential sum up to one?

TLDR: just subtract the scipy.misc.logsumexp() of that vector.

Note: this notebook is available on Github in case you want to play around with it.

Usage of logs in probability and likelihood computations

Given a set of independent events $D= \{E_i\}$ with probabilities $\{p_i\}$, the probability of the joint occurance of all events in $D$ is simply

$$P_D = \prod\limits_i^N p_i$$

This quantity however quickly gets impossible to compute on a computer as-written above since, each $p_i$ $\in [0,1]$, the result gets smaller and smaller as the multiplication goes on and quickly leads to underflow. Usage of logarithms of probabilities and of likelihoods is a very common computational trick to avoid that underflow which exploits the fact that a logarithm transforms a product into a sum, i.e.

Continue reading »