I've been trying to wrap my head around [factor analysis][FA] as a theory for designing and understanding test and survey results. This has turned out to be [[another|Gumbel-Fisher-Tippett_distributions]] one of those fields where the going has been a bit rough. I think the key factors in making these older topics difficult are: * “Everybody knows this, so we don't need to write up the details.” * “Hey, I can do better than Bob if I just tweak this knob…” The resulting discussion ends up being overly complicated, and it's hard for newcomers to decide if people using similar terminology are in fact talking about the same thing. Some of the better open sources for background has been [Tucker and MacCallum's “Exploratory Factor Analysis” manuscript][TM] and [Max Welling's notes][MW]. I'll use Welling's terminology for this discussion. The basic idea of factor analsys is to model $d$ measurable attributes as generated by $k < d$ common factors and $d$ unique factors. With $n = 4$ and $k = 2$, you get something like: [[!img factors.png alt="Relationships between factors and measured attributes" caption="Relationships between factors and measured attributes (adapted from Tucker and MacCallum's Figure 1.2)" ]] Corresponding to the equation ([Welling's eq. 1][MW-FA]): \[ \mathbf{x} = \mathbf{A}\mathbf{y} + \mathbf{\mu} + \mathbf{\nu} \] The independent random variables $\mathbf{y}$ are distributed according to a Gaussian with zero mean and unit variance $\mathcal{G}_\mathbf{y}[0,\mathbf{I}]$ (zero mean because constant offsets are handled by $\mathbf{\mu}$; unit variance becase scaling is handled by $\mathbf{A}$). The independent random variables $\mathbf{\nu}$ are distributed according to $\mathcal{G}_\mathbf{\nu}[0,\mathbf{\Sigma}]$, with (Welling's eq. 2): \[ \mathbf{\Sigma} \equiv \text{diag}[\sigma_1^2, \ldots, \sigma_d^2] \] Because the only source of constant offset is $\mathbf{\mu}$, we can calculate it by averaging out the random noise (Welling's eq. 6): \[ \mathbf{\mu} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n \] where $N$ is the number of measurements (survey responders) and $\mathbf{x}_n$ is the response vector for the $n^\text{th}$ responder. How do we find $\mathbf{A}$ and $\mathbf{\Sigma}$? This is the tricky bit, and there are a number of possible approaches. Welling suggests using expectation maximization (EM), and there's an excellent example of the procedure with a colorblind experimenter drawing colored balls in his [EM notes][EM] (to test my understanding, I wrote [[color-ball.py]]). To simplify calculations, Welling defines (before eq. 15): \[ \begin{aligned} \mathbf{A}' &\equiv [\mathbf{A}, \mathbf{\mu}] \\ \mathbf{y}' &\equiv [\mathbf{y}^T, 1]^T \end{aligned} \] which reduce the model to \[ \mathbf{x} = \mathbf{A}'\mathbf{y}' + \mathbf{\nu} \] After some manipulation Welling works out the maximizing updates (eq'ns 16 and 17): \[ \begin{aligned} \mathbf{A}'^\text{new} &= \left( \sum_{n=1}^N \mathbf{x}_n \mathbf{E}[\mathbf{y}'|\mathbf{x}_n]^T \right) \left( \sum_{n=1}^N \mathbf{x}_n \mathbf{E}[\mathbf{y}'\mathbf{y}'^T|\mathbf{x}_n] \right)^{-1} \\ \mathbf{\Sigma}^\text{new} &= \frac{1}{N}\sum_{n=1}^N \text{diag}[\mathbf{x}_n\mathbf{x}_n^T - \mathbf{A}'^\text{new} \mathbf{E}[\mathbf{y}'|\mathbf{x}_n]\mathbf{x}_n^T] \end{aligned} \] The expectation values used in these updates are given by (Welling's eq'ns 12 and 13): \[ \begin{aligned} \mathbf{E}[\mathbf{y}|\mathbf{x}_n] &= \mathbf{A}^T (\mathbf{A}\mathbf{A}^T + \mathbf{\Sigma})^{-1} (\mathbf{x}_n - \mathbf{\mu}) \\ \mathbf{E}[\mathbf{y}\mathbf{y}^T|\mathbf{x}_n] &= \mathbf{I} - \mathbf{A}^T (\mathbf{A}\mathbf{A}^T + \mathbf{\Sigma})^{-1} \mathbf{A} + \mathbf{E}[\mathbf{y}|\mathbf{x}_n] \mathbf{E}[\mathbf{y}|\mathbf{x}_n]^T \end{aligned} \] Survey analysis =============== Enough abstraction! Let's look at an example: [survey results][survey]: >>> import numpy >>> scores = numpy.genfromtxt('Factor_analysis/survey.data', delimiter='\t') >>> scores array([[ 1., 3., 4., 6., 7., 2., 4., 5.], [ 2., 3., 4., 3., 4., 6., 7., 6.], [ 4., 5., 6., 7., 7., 2., 3., 4.], [ 3., 4., 5., 6., 7., 3., 5., 4.], [ 2., 5., 5., 5., 6., 2., 4., 5.], [ 3., 4., 6., 7., 7., 4., 3., 5.], [ 2., 3., 6., 4., 5., 4., 4., 4.], [ 1., 3., 4., 5., 6., 3., 3., 4.], [ 3., 3., 5., 6., 6., 4., 4., 3.], [ 4., 4., 5., 6., 7., 4., 3., 4.], [ 2., 3., 6., 7., 5., 4., 4., 4.], [ 2., 3., 5., 7., 6., 3., 3., 3.]]) `scores[i,j]` is the answer the `i`th respondent gave for the `j`th question. We're looking for underlying factors that can explain covariance between the different questions. Do the question answers ($\mathbf{x}$) represent some underlying factors ($\mathbf{y}$)? Let's start off by calculating $\mathbf{\mu}$: >>> def print_row(row): ... print(' '.join('{: 0.2f}'.format(x) for x in row)) >>> mu = scores.mean(axis=0) >>> print_row(mu) 2.42 3.58 5.08 5.75 6.08 3.42 3.92 4.25 Next we need priors for $\mathbf{A}$ and $\mathbf{\Sigma}$. [[MDP]] has an implementation for [[Python]], and their [FANode][] uses a Gaussian random matrix for $\mathbf{A}$ and the diagonal of the score covariance for $\mathbf{\Sigma}$. They also use the score covariance to avoid repeated summations over $n$. >>> import mdp >>> def print_matrix(matrix): ... for row in matrix: ... print_row(row) >>> fa = mdp.nodes.FANode(output_dim=3) >>> numpy.random.seed(1) # for consistend doctest results >>> responder_scores = fa(scores) # hidden factors for each responder >>> print_matrix(responder_scores) -1.92 -0.45 0.00 0.67 1.97 1.96 0.70 0.03 -2.00 0.29 0.03 -0.60 -1.02 1.79 -1.43 0.82 0.27 -0.23 -0.07 -0.08 0.82 -1.38 -0.27 0.48 0.79 -1.17 0.50 1.59 -0.30 -0.41 0.01 -0.48 0.73 -0.46 -1.34 0.18 >>> print_row(fa.mu.flat) 2.42 3.58 5.08 5.75 6.08 3.42 3.92 4.25 >>> fa.mu.flat == mu # MDP agrees with our earlier calculation array([ True, True, True, True, True, True, True, True], dtype=bool) >>> print_matrix(fa.A) # factor weights for each question 0.80 -0.06 -0.45 0.17 0.30 -0.65 0.34 -0.13 -0.25 0.13 -0.73 -0.64 0.02 -0.32 -0.70 0.61 0.23 0.86 0.08 0.63 0.59 -0.09 0.67 0.13 >>> print_row(fa.sigma) # unique noise for each question 0.04 0.02 0.38 0.55 0.30 0.05 0.48 0.21 Because the covariance is unaffected by the rotation $\mathbf{A}\rightarrow\mathbf{A}\mathbf{R}$, the estimated weights $\mathbf{A}$ and responder scores $\mathbf{y}$ can be quite sensitive to the seed priors. The width $\mathbf{\Sigma}$ of the unique noise $\mathbf{\nu}$ is more robust, because $\mathbf{\Sigma}$ is unaffected by rotations on $\mathbf{A}$. Nomenclature ============
$\mathbf{A}_{ij}$
The element from the $i^\text{th}$ row and $j^\text{th}$ column of a matrix $\mathbf{A}$. For example here is a 2-by-3 matrix terms of components: \[ \mathbf{A} = \begin{pmatrix} \mathbf{A}_{11} & \mathbf{A}_{12} & \mathbf{A}_{13} \\ \mathbf{A}_{21} & \mathbf{A}_{22} & \mathbf{A}_{23} \end{pmatrix} \]
$\mathbf{A}^T$
The transpose of a matrix (or vector) $\mathbf{A}$. $\mathbf{A}_{ij}^T=\mathbf{A}_{ji}$
$\mathbf{A}^{-1}$
The inverse of a matrix $\mathbf{A}$. $\mathbf{A}^{-1}\dot\mathbf{A}=1$
$\text{diag}[\mathbf{A}]$
A matrix containing only the diagonal elements of $\mathbf{A}$, with the off-diagonal values set to zero.
$\mathbf{E}[f(\mathbf{x})]$
Expectation value for a function $f$ of a random variable $\mathbf{x}$. If the probability density of $\mathbf{x}$ is $p(\mathbf{x})$, then $\mathbf{E}[f(\mathbf{x})]=\int d\mathbf{x} p(\mathbf{x}) f(\mathbf{x})$. For example, $\mathbf{E}[p(\mathbf{x})]=1$.
$\mathbf{\mu}$
The mean of a random variable $\mathbf{x}$ is given by $\mathbf{\mu}=\mathbf{E}[\mathbf{x}]$.
$\mathbf{\Sigma}$
The covariance of a random variable $\mathbf{x}$ is given by $\mathbf{\Sigma}=\mathbf{E}[(\mathbf{x}-\mathbf{\mu}) (\mathbf{x}-\mathbf{\mu})^\mathbf{T}]$. In the factor analysis model discussed above, $\mathbf{\Sigma}$ is restricted to a diagonal matrix.
$\mathcal{G}_\mathbf{x}[\mu,\mathbf{\Sigma}]$
A Gaussian probability density for the random variables $\mathbf{x}$ with a mean $\mathbf{\mu}$ and a covariance $\mathbf{\Sigma}$. \[ \mathcal{G}_\mathbf{x}[\mathbf{\mu},\mathbf{\Sigma}] = \frac{1}{(2\pi)^{\frac{D}{2}}\sqrt{\det[\mathbf{\Sigma}]}} e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu})} \]
$p(\mathbf{y}|\mathbf{x})$
Probability of $\mathbf{y}$ occurring given that $\mathbf{x}$ occured. This is commonly used in Bayesian statistics.
$p(\mathbf{x}, \mathbf{y})$
Probability of $\mathbf{y}$ and $\mathbf{x}$ occuring simultaneously (the joint density). $p(\mathbf{x},\mathbf{y})=p(\mathbf{x}|\mathbf{y})p(\mathbf{y})$
Note: if you have trouble viewing some of the more obscure [Unicode][] used in this post, you might want to install the [STIX fonts][STIX]. [FA]: http://en.wikipedia.org/wiki/Factor_analysis [TM]: http://www.unc.edu/~rcm/book/factornew.htm [MW]: http://www.ics.uci.edu/~welling/classnotes/classnotes.html [MW-FA]: http://www.ics.uci.edu/~welling/classnotes/papers_class/LinMod.ps.gz [survey]: http://web.archive.org/web/20051125011642/http://www.ncl.ac.uk/iss/statistics/docs/factoranalysis.html [EM]: http://www.ics.uci.edu/~welling/classnotes/papers_class/EM.ps.gz [FANode]: https://github.com/mdp-toolkit/mdp-toolkit/blob/master/mdp/nodes/em_nodes.py [Unicode]: http://en.wikipedia.org/wiki/Unicode [STIX]: http://www.stixfonts.org/ [[!tag tags/teaching]] [[!tag tags/theory]] [[!tag tags/tools]]