I've been trying to wrap my head around [factor analysis][FA] as a
theory for designing and understanding test and survey results.  This
has turned out to be [[another|Gumbel-Fisher-Tippett_distributions]]
one of those fields where the going has been a bit rough.  I think the
key factors in making these older topics difficult are:

* “Everybody knows this, so we don't need to write up the details.”
* “Hey, I can do better than Bob if I just tweak this knob…”

The resulting discussion ends up being overly complicated, and it's
hard for newcomers to decide if people using similar terminology are
in fact talking about the same thing.

Some of the better open sources for background has been [Tucker and
MacCallum's “Exploratory Factor Analysis” manuscript][TM] and [Max
Welling's notes][MW].  I'll use Welling's terminology for this
discussion.

The basic idea of factor analsys is to model $d$ measurable attributes
as generated by $k &lt; d$ common factors and $d$ unique factors.
With $n = 4$ and $k = 2$, you get something like:

[[!img factors.png
 alt="Relationships between factors and measured attributes"
 caption="Relationships between factors and measured attributes
  (adapted from Tucker and MacCallum's Figure 1.2)"
 ]]

Corresponding to the equation ([Welling's eq. 1][MW-FA]):

\[
  \mathbf{x} = \mathbf{A}\mathbf{y} + \mathbf{\mu} + \mathbf{\nu}
\]

The independent random variables $\mathbf{y}$ are distributed
according to a Gaussian with zero mean and unit
variance $\mathcal{G}_\mathbf{y}[0,\mathbf{I}]$ (zero mean because
constant offsets are handled by $\mathbf{\mu}$; unit variance becase
scaling is handled by $\mathbf{A}$).  The independent random
variables $\mathbf{\nu}$ are distributed according
to $\mathcal{G}_\mathbf{\nu}[0,\mathbf{\Sigma}]$, with (Welling's
eq. 2):

\[
  \mathbf{\Sigma} \equiv \text{diag}[\sigma_1^2, \ldots, \sigma_d^2]
\]

Because the only source of constant offset is $\mathbf{\mu}$, we can
calculate it by averaging out the random noise (Welling's eq. 6):

\[
  \mathbf{\mu} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n
\]

where $N$ is the number of measurements (survey responders)
and $\mathbf{x}_n$ is the response vector for the $n^\text{th}$
responder.

How do we find $\mathbf{A}$ and $\mathbf{\Sigma}$?  This is the tricky
bit, and there are a number of possible approaches.  Welling suggests
using expectation maximization (EM), and there's an excellent example
of the procedure with a colorblind experimenter drawing colored balls
in his [EM notes][EM] (to test my understanding, I wrote
[[color-ball.py]]).

To simplify calculations, Welling defines (before eq. 15):

\[
\begin{aligned}
  \mathbf{A}' &\equiv [\mathbf{A}, \mathbf{\mu}] \\
  \mathbf{y}' &\equiv [\mathbf{y}^T, 1]^T
\end{aligned}
\]

which reduce the model to

\[
  \mathbf{x} = \mathbf{A}'\mathbf{y}' + \mathbf{\nu}
\]

After some manipulation Welling works out the maximizing updates
(eq'ns 16 and 17):

\[
\begin{aligned}
  \mathbf{A}'^\text{new}
    &= \left( \sum_{n=1}^N \mathbf{x}_n
                \mathbf{E}[\mathbf{y}'|\mathbf{x}_n]^T \right)
       \left( \sum_{n=1}^N \mathbf{x}_n
                \mathbf{E}[\mathbf{y}'\mathbf{y}'^T|\mathbf{x}_n]
         \right)^{-1} \\
  \mathbf{\Sigma}^\text{new}
    &= \frac{1}{N}\sum_{n=1}^N
         \text{diag}[\mathbf{x}_n\mathbf{x}_n^T -
            \mathbf{A}'^\text{new}
            \mathbf{E}[\mathbf{y}'|\mathbf{x}_n]\mathbf{x}_n^T]
\end{aligned}
\]

The expectation values used in these updates are given by (Welling's
eq'ns 12 and 13):

\[
\begin{aligned}
  \mathbf{E}[\mathbf{y}|\mathbf{x}_n]
    &= \mathbf{A}^T (\mathbf{A}\mathbf{A}^T + \mathbf{\Sigma})^{-1}
       (\mathbf{x}_n - \mathbf{\mu}) \\
  \mathbf{E}[\mathbf{y}\mathbf{y}^T|\mathbf{x}_n]
    &= \mathbf{I} -
       \mathbf{A}^T (\mathbf{A}\mathbf{A}^T + \mathbf{\Sigma})^{-1} \mathbf{A} +
       \mathbf{E}[\mathbf{y}|\mathbf{x}_n] \mathbf{E}[\mathbf{y}|\mathbf{x}_n]^T
\end{aligned}
\]

Survey analysis
===============

Enough abstraction!  Let's look at an example: [survey
results][survey]:

    >>> import numpy
    >>> scores = numpy.genfromtxt('Factor_analysis/survey.data', delimiter='\t')
    >>> scores
    array([[ 1.,  3.,  4.,  6.,  7.,  2.,  4.,  5.],
           [ 2.,  3.,  4.,  3.,  4.,  6.,  7.,  6.],
           [ 4.,  5.,  6.,  7.,  7.,  2.,  3.,  4.],
           [ 3.,  4.,  5.,  6.,  7.,  3.,  5.,  4.],
           [ 2.,  5.,  5.,  5.,  6.,  2.,  4.,  5.],
           [ 3.,  4.,  6.,  7.,  7.,  4.,  3.,  5.],
           [ 2.,  3.,  6.,  4.,  5.,  4.,  4.,  4.],
           [ 1.,  3.,  4.,  5.,  6.,  3.,  3.,  4.],
           [ 3.,  3.,  5.,  6.,  6.,  4.,  4.,  3.],
           [ 4.,  4.,  5.,  6.,  7.,  4.,  3.,  4.],
           [ 2.,  3.,  6.,  7.,  5.,  4.,  4.,  4.],
           [ 2.,  3.,  5.,  7.,  6.,  3.,  3.,  3.]])

`scores[i,j]` is the answer the `i`th respondent gave for the `j`th
question.  We're looking for underlying factors that can explain
covariance between the different questions.  Do the question answers
($\mathbf{x}$) represent some underlying factors ($\mathbf{y}$)?
Let's start off by calculating $\mathbf{\mu}$:

    >>> def print_row(row):
    ...     print('  '.join('{: 0.2f}'.format(x) for x in row))
    >>> mu = scores.mean(axis=0)
    >>> print_row(mu)
     2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25

Next we need priors for $\mathbf{A}$ and $\mathbf{\Sigma}$.  [[MDP]]
has an implementation for [[Python]], and their [FANode][] uses a
Gaussian random matrix for $\mathbf{A}$ and the diagonal of the score
covariance for $\mathbf{\Sigma}$.  They also use the score covariance
to avoid repeated summations over $n$.

    >>> import mdp
    >>> def print_matrix(matrix):
    ...     for row in matrix:
    ...         print_row(row)
    >>> fa = mdp.nodes.FANode(output_dim=3)
    >>> numpy.random.seed(1)  # for consistend doctest results
    >>> responder_scores = fa(scores)   # hidden factors for each responder
    >>> print_matrix(responder_scores)
    -1.92  -0.45   0.00
     0.67   1.97   1.96
     0.70   0.03  -2.00
     0.29   0.03  -0.60
    -1.02   1.79  -1.43
     0.82   0.27  -0.23
    -0.07  -0.08   0.82
    -1.38  -0.27   0.48
     0.79  -1.17   0.50
     1.59  -0.30  -0.41
     0.01  -0.48   0.73
    -0.46  -1.34   0.18
    >>> print_row(fa.mu.flat)
     2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25
    >>> fa.mu.flat == mu  # MDP agrees with our earlier calculation
    array([ True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)
    >>> print_matrix(fa.A)  # factor weights for each question
     0.80  -0.06  -0.45
     0.17   0.30  -0.65
     0.34  -0.13  -0.25
     0.13  -0.73  -0.64
     0.02  -0.32  -0.70
     0.61   0.23   0.86
     0.08   0.63   0.59
    -0.09   0.67   0.13
    >>> print_row(fa.sigma)  # unique noise for each question
     0.04   0.02   0.38   0.55   0.30   0.05   0.48   0.21

Because the covariance is unaffected by the
rotation $\mathbf{A}\rightarrow\mathbf{A}\mathbf{R}$, the estimated
weights $\mathbf{A}$ and responder scores $\mathbf{y}$ can be quite
sensitive to the seed priors.  The width $\mathbf{\Sigma}$ of the
unique noise $\mathbf{\nu}$ is more robust, because $\mathbf{\Sigma}$
is unaffected by rotations on $\mathbf{A}$.

Nomenclature
============

<dl>
  <dt id="subsripts">$\mathbf{A}_{ij}$</dt>
    <dd>The element from the $i^\text{th}$ row and $j^\text{th}$
    column of a matrix $\mathbf{A}$.  For example here is a 2-by-3
    matrix terms of components:

\[
  \mathbf{A} = \begin{pmatrix}
      \mathbf{A}_{11} & \mathbf{A}_{12} & \mathbf{A}_{13} \\
      \mathbf{A}_{21} & \mathbf{A}_{22} & \mathbf{A}_{23}
    \end{pmatrix}
\]

    </dd>
  <dt id="transpose">$\mathbf{A}^T$</dt>
    <dd>The transpose of a matrix (or vector) $\mathbf{A}$.
    $\mathbf{A}_{ij}^T=\mathbf{A}_{ji}$</dd>
  <dt id="inverse">$\mathbf{A}^{-1}$</dt>
    <dd>The inverse of a matrix $\mathbf{A}$.
    $\mathbf{A}^{-1}\dot\mathbf{A}=1$</dd>
  <dt id="diag">$\text{diag}[\mathbf{A}]$</dt>
    <dd>A matrix containing only the diagonal elements of
    $\mathbf{A}$, with the off-diagonal values set to zero.</dd>
  <dt id="expect">$\mathbf{E}[f(\mathbf{x})]$</dt>
    <dd>Expectation value for a function $f$ of a random variable
    $\mathbf{x}$.  If the probability density of $\mathbf{x}$ is
    $p(\mathbf{x})$, then $\mathbf{E}[f(\mathbf{x})]=\int d\mathbf{x}
    p(\mathbf{x}) f(\mathbf{x})$.  For example,
    $\mathbf{E}[p(\mathbf{x})]=1$.</dd>
  <dt id="mean">$\mathbf{\mu}$</dt>
    <dd>The mean of a random variable $\mathbf{x}$ is given by
    $\mathbf{\mu}=\mathbf{E}[\mathbf{x}]$.</dd>
  <dt id="variance">$\mathbf{\Sigma}$</dt>
    <dd>The covariance of a random variable $\mathbf{x}$ is given by
    $\mathbf{\Sigma}=\mathbf{E}[(\mathbf{x}-\mathbf{\mu})
    (\mathbf{x}-\mathbf{\mu})^\mathbf{T}]$.  In the factor analysis
    model discussed above, $\mathbf{\Sigma}$ is restricted to a
    diagonal matrix.</dd>
  <dt id="gaussian">$\mathcal{G}_\mathbf{x}[\mu,\mathbf{\Sigma}]$</dd>
    <dd>A Gaussian probability density for the random variables
    $\mathbf{x}$ with a mean $\mathbf{\mu}$ and a covariance
    $\mathbf{\Sigma}$.

\[
  \mathcal{G}_\mathbf{x}[\mathbf{\mu},\mathbf{\Sigma}]
    = \frac{1}{(2\pi)^{\frac{D}{2}}\sqrt{\det[\mathbf{\Sigma}]}}
      e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T
                     \mathbf{\Sigma}^{-1}
                     (\mathbf{x}-\mathbf{\mu})}
\]

    </dd>
  <dt id="probability-of-y-given-x">$p(\mathbf{y}|\mathbf{x})$</dt>
    <dd>Probability of $\mathbf{y}$ occurring given that $\mathbf{x}$
    occured.  This is commonly used in <a
    href="http://en.wikipedia.org/wiki/Bayes%27_theorem">Bayesian
    statistics</a>.</dd>
  <dt id="probability-of-x-and-y">$p(\mathbf{x}, \mathbf{y})$</dt>
    <dd>Probability of $\mathbf{y}$ and $\mathbf{x}$ occuring
    simultaneously (the joint density).
    $p(\mathbf{x},\mathbf{y})=p(\mathbf{x}|\mathbf{y})p(\mathbf{y})$</dd>
</dl>

Note: if you have trouble viewing some of the more obscure [Unicode][]
used in this post, you might want to install the [STIX fonts][STIX].

[FA]: http://en.wikipedia.org/wiki/Factor_analysis
[TM]: http://www.unc.edu/~rcm/book/factornew.htm
[MW]: http://www.ics.uci.edu/~welling/classnotes/classnotes.html
[MW-FA]: http://www.ics.uci.edu/~welling/classnotes/papers_class/LinMod.ps.gz
[survey]: http://web.archive.org/web/20051125011642/http://www.ncl.ac.uk/iss/statistics/docs/factoranalysis.html
[EM]: http://www.ics.uci.edu/~welling/classnotes/papers_class/EM.ps.gz
[FANode]: https://github.com/mdp-toolkit/mdp-toolkit/blob/master/mdp/nodes/em_nodes.py
[Unicode]: http://en.wikipedia.org/wiki/Unicode
[STIX]: http://www.stixfonts.org/

[[!tag tags/teaching]]
[[!tag tags/theory]]
[[!tag tags/tools]]