Chainer Variational Factorization Machine implementation.

Run python run_movielens.py to download, extract, and run the FM model on MovieLens 1M data.

| model | batchsize | rank |intx term | lambda0 | lambda1 | lambda2 | RMSE | Notes | |-----------|-----------|------|----------|---------|---------|---------| -------| ----- | | FM | 8192 | 0 | N |0 | 1e-2 | 0 | 0.9305 | Regression with regularization | | FM | 8192 | 0 | N |0 | 0 | 0 | 0.9115 | Regression with no regularization | | FM | 8192 | 0 | N |0 | 1e-3 | 0 | 0.9112 | Regression with less regularization | | FM | 8192 | 20 | Y |0 | 0 | 1e-3 | 0.8633 | FM model w/ 20D latent vector | | FM | 8192 | 20 | Y |0 | 1e-3 | 1e-3 | 0.8618 | FM model w/ 20D latent vector and regularization | |VFM | 8192 | 20 | Y |0 | 1e-3 | 1e-3 | 0.8625 | Variational FM model with arbitrary reularization| |VFM | 8192 | 20 | Y |1 | 1 | 1 | 0.8620 | Variational FM model with default priors| |VFM | 8192 | 20 | Y |1 | 1 | 1 | 0.8585 | Variational FM model with grouping| |VFM | 8192 | 64 | Y |1 | 1 | 1 | 0.8800 | Higher rank model does worse|

After Dec 19:

| model | batchsize | rank |intx term | lambda0 | lambda1 | lambda2 | RMSE | Notes | |-----------|-----------|------|----------|---------|---------|---------| -------| ----- | |VFM | 4096 | 8 | Y | 0 | 0 | 0 | 0.8782 | no regularization | |VFM | 4096 | 8 | Y | 0 | 1 | 1 | 0.8775 | | |VFM | 4096 | 8 | Y | 1 | 1 | 1 | 0.8870 | with alpha=1e-2, fast but inaccurate | |VFM | 4096 | 8 | Y | 10 | 10 | 10 | 0.8628 | more regularization than default | |VFM | 4096 | 8 | Y | 1 | 1 | 1 | 0.8805 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 10 | 1 | 1 | 0.8805 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 1 | 10 | 1 | 0.8793 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 1 | 1 | 10 | 0.8623 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 10 | 10 | 10 | 0.8619 | added 300 epochs to 10-10-10 run | |VFM | 4096 | 8 | Y | 0 | 1 | 10 | 0.8629 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 0 | 0 | 10 | 0.8793 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 0 | 0 | 1 | 0.8815 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 1 | 1 | 50 | 0.8561 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 0 | 1 | 50 | 0.8561 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 0 | 1 | 100 | 0.8672 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 1 | 1 | 100 | 0.8673 | default, initialized from 10-10-10 run for 200->500 epochs | |VFM | 4096 | 8 | Y | 100 | 100 | 100 | 0.8708 | initialized from 10-10-10 model |

Yamada [1] reports the following errors on a 25% test set of the same ML-1M dataset root mean squared errors (RMSE):

| Model | RMSE | |-------------------| ------| | libFM ALS | 0.981 | | libFM SGD | 0.943 | | libFM MCMC 0.05 | 0.877 | | CFM | 0.866 | | CFM (BCD) | 0.850 | | libFM MCMC 0.10 | 0.846 |

[1] https://arxiv.org/pdf/1507.01073.pdf

Dicussion

Within the Variational FM framework we can get more than a good point estimate, we can can get an estimate of the mean and variance of a single feature. This means we can estimate the variance conditioned on a few active features (e.g. conditioned on a single user) and retrieve the most uncertain item for that user. The idea here is to switch inference from a gradient descent model (which makes point estimates) to variational stochastic gradient descent (which estimates approximate posteriors) to build an active learning model.

Regression

For typical linear regression with interactions we have:

$eq1$

Note that x_i is usually a sparse feature vector (but doesn't have to be). In the land of recommenders, we're usually interested in the coefficient w_ij in front of an interaction such as x_i x_j where x_i might be a dummy-encoded user id and x_j is an item_id. The big problem here is that w_ij is quadratic in the number of features (e.g. # of users + # of items), so there are lots of parameters to estimate with sparse observations. (Note: we've also left off any regularization, but might choose to L2 penalize w_ij or beta_ij.)

FMs

FMs fix this by doing a low-rank approximation to w_ij by saying that w_ij= v_i0 * v_j0 + ... + v_ik * v_jk where each feature i has a latent rank-k vector v_i. Instead of computing an N x N w_ij matrix, we compute N x k parameters in the form of N v_i vectors, yielding a new objective function:

$eq2$

VFMs

In variational FMs we impose a bit more hierarchy by grouping feature vectors and swap out L2 regularization for Gauassian priors:

$eq3$

$eq3b$

//: # (\beta_i \sim \mathcal{N}( \mu_\beta, \sigma_\beta)) //: # (\vec{v_i} \sim \mathcal{N}( \vec{\mu_v}, \vec{\sigma}_v)))

And then group these (hyper)priors together assuming a normal prior with unity variance. The vectors v_i are drawn from a multivariate prior with a diagonal covariance matrix. The assumption is that there's a group feature-vector, and individual feature vectors need evidence to deviate from that group vector. The log-normal prior on the variance isn't the disciplined choice (inverse Wishart I think?) but it is convenient and amenable to Stochastic Variational Bayes inference.

$eq3$

//: # (\mu_\beta \sim \mathcal{N}(0, 1)) //: # (log\sigma_\beta \sim \mathcal{N}(0, 1))

As you can see in the results table, shrinking to the groups improves test set validation scores (.8620 --> .8580).

This forms a deep model: the hyperpriors mu_b and sigma_b pick the group mean and group variance from which individual beta_i and v_i are drawn. In variational inference, those beta_i and v_i in turn have their own means and variances, so that we're not just point estimating beta_i but in fact estimate mu_beta_i and sigma_beta_i. If you're curious how this mode of inference works, read this or this for the trick in 140 characters -- it's at the heart of Bayesian deep learning techniques.

Model uncertainty

With estimates of mu_v_i = E[v_i] and sigma_v_i = Var[v_i] we finally get the critical ingredient to do active learning on FMs -- an uncertainty estimate around the feature vector v_i. But we need the uncertainty for the whole model, which is composed of interactions on v_i:

$eq4$

Note that the above is just the identity for the product of two independent random variables. Technically v_i is a vector, but the components are independent so replace that above v_i with an arbitrary component of that vector:

$eq4b$

The variances of the beta components do not covary with the v_i components, so the full model variance is decomposes into the sum of

Vfm

Install / Use

README

Dicussion

Regression

FMs

VFMs

Model uncertainty

Related Skills