posts/Factor_analysis.mdwn_itex

   1 I've been trying to wrap my head around [factor analysis][FA] as a
   2 theory for designing and understanding test and survey results.  This
   3 has turned out to be [[another|Gumbel-Fisher-Tippett_distributions]]
   4 one of those fields where the going has been a bit rough.  I think the
   5 key factors in making these older topics difficult are:
   6
   7 * “Everybody knows this, so we don't need to write up the details.”
   8 * “Hey, I can do better than Bob if I just tweak this knob…”
   9
  10 The resulting discussion ends up being overly complicated, and it's
  11 hard for newcomers to decide if people using similar terminology are
  12 in fact talking about the same thing.
  13
  14 Some of the better open sources for background has been [Tucker and
  15 MacCallum's “Exploratory Factor Analysis” manuscript][TM] and [Max
  16 Welling's notes][MW].  I'll use Welling's terminology for this
  17 discussion.
  18
  19 The basic idea of factor analsys is to model $d$ measurable attributes
  20 as generated by $k &lt; d$ common factors and $d$ unique factors.
  21 With $n = 4$ and $k = 2$, you get something like:
  22
  23 [[!img factors.png
  24  alt="Relationships between factors and measured attributes"
  25  caption="Relationships between factors and measured attributes
  26   (adapted from Tucker and MacCallum's Figure 1.2)"
  27  ]]
  28
  29 Corresponding to the equation ([Welling's eq. 1][MW-FA]):
  30
  31 \[
  32   \mathbf{x} = \mathbf{A}\mathbf{y} + \mathbf{\mu} + \mathbf{\nu}
  33 \]
  34
  35 The independent random variables $\mathbf{y}$ are distributed
  36 according to a Gaussian with zero mean and unit
  37 variance $\mathcal{G}_\mathbf{y}[0,\mathbf{I}]$ (zero mean because
  38 constant offsets are handled by $\mathbf{\mu}$; unit variance becase
  39 scaling is handled by $\mathbf{A}$).  The independent random
  40 variables $\mathbf{\nu}$ are distributed according
  41 to $\mathcal{G}_\mathbf{\nu}[0,\mathbf{\Sigma}]$, with (Welling's
  42 eq. 2):
  43
  44 \[
  45   \mathbf{\Sigma} \equiv \text{diag}[\sigma_1^2, \ldots, \sigma_d^2]
  46 \]
  47
  48 Because the only source of constant offset is $\mathbf{\mu}$, we can
  49 calculate it by averaging out the random noise (Welling's eq. 6):
  50
  51 \[
  52   \mathbf{\mu} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n
  53 \]
  54
  55 where $N$ is the number of measurements (survey responders)
  56 and $\mathbf{x}_n$ is the response vector for the $n^\text{th}$
  57 responder.
  58
  59 How do we find $\mathbf{A}$ and $\mathbf{\Sigma}$?  This is the tricky
  60 bit, and there are a number of possible approaches.  Welling suggests
  61 using expectation maximization (EM), and there's an excellent example
  62 of the procedure with a colorblind experimenter drawing colored balls
  63 in his [EM notes][EM] (to test my understanding, I wrote
  64 [[color-ball.py]]).
  65
  66 To simplify calculations, Welling defines (before eq. 15):
  67
  68 \[
  69 \begin{aligned}
  70   \mathbf{A}' &\equiv [\mathbf{A}, \mathbf{\mu}] \\
  71   \mathbf{y}' &\equiv [\mathbf{y}^T, 1]^T
  72 \end{aligned}
  73 \]
  74
  75 which reduce the model to
  76
  77 \[
  78   \mathbf{x} = \mathbf{A}'\mathbf{y}' + \mathbf{\nu}
  79 \]
  80
  81 After some manipulation Welling works out the maximizing updates
  82 (eq'ns 16 and 17):
  83
  84 \[
  85 \begin{aligned}
  86   \mathbf{A}'^\text{new}
  87     &= \left( \sum_{n=1}^N \mathbf{x}_n
  88                 \mathbf{E}[\mathbf{y}'|\mathbf{x}_n]^T \right)
  89        \left( \sum_{n=1}^N \mathbf{x}_n
  90                 \mathbf{E}[\mathbf{y}'\mathbf{y}'^T|\mathbf{x}_n]
  91          \right)^{-1} \\
  92   \mathbf{\Sigma}^\text{new}
  93     &= \frac{1}{N}\sum_{n=1}^N
  94          \text{diag}[\mathbf{x}_n\mathbf{x}_n^T -
  95             \mathbf{A}'^\text{new}
  96             \mathbf{E}[\mathbf{y}'|\mathbf{x}_n]\mathbf{x}_n^T]
  97 \end{aligned}
  98 \]
  99
 100 The expectation values used in these updates are given by (Welling's
 101 eq'ns 12 and 13):
 102
 103 \[
 104 \begin{aligned}
 105   \mathbf{E}[\mathbf{y}|\mathbf{x}_n]
 106     &= \mathbf{A}^T (\mathbf{A}\mathbf{A}^T + \mathbf{\Sigma})^{-1}
 107        (\mathbf{x}_n - \mathbf{\mu}) \\
 108   \mathbf{E}[\mathbf{y}\mathbf{y}^T|\mathbf{x}_n]
 109     &= \mathbf{I} -
 110        \mathbf{A}^T (\mathbf{A}\mathbf{A}^T + \mathbf{\Sigma})^{-1} \mathbf{A} +
 111        \mathbf{E}[\mathbf{y}|\mathbf{x}_n] \mathbf{E}[\mathbf{y}|\mathbf{x}_n]^T
 112 \end{aligned}
 113 \]
 114
 115 Survey analysis
 116 ===============
 117
 118 Enough abstraction!  Let's look at an example: [survey
 119 results][survey]:
 120
 121     >>> import numpy
 122     >>> scores = numpy.genfromtxt('Factor_analysis/survey.data', delimiter='\t')
 123     >>> scores
 124     array([[ 1.,  3.,  4.,  6.,  7.,  2.,  4.,  5.],
 125            [ 2.,  3.,  4.,  3.,  4.,  6.,  7.,  6.],
 126            [ 4.,  5.,  6.,  7.,  7.,  2.,  3.,  4.],
 127            [ 3.,  4.,  5.,  6.,  7.,  3.,  5.,  4.],
 128            [ 2.,  5.,  5.,  5.,  6.,  2.,  4.,  5.],
 129            [ 3.,  4.,  6.,  7.,  7.,  4.,  3.,  5.],
 130            [ 2.,  3.,  6.,  4.,  5.,  4.,  4.,  4.],
 131            [ 1.,  3.,  4.,  5.,  6.,  3.,  3.,  4.],
 132            [ 3.,  3.,  5.,  6.,  6.,  4.,  4.,  3.],
 133            [ 4.,  4.,  5.,  6.,  7.,  4.,  3.,  4.],
 134            [ 2.,  3.,  6.,  7.,  5.,  4.,  4.,  4.],
 135            [ 2.,  3.,  5.,  7.,  6.,  3.,  3.,  3.]])
 136
 137 `scores[i,j]` is the answer the `i`th respondent gave for the `j`th
 138 question.  We're looking for underlying factors that can explain
 139 covariance between the different questions.  Do the question answers
 140 ($\mathbf{x}$) represent some underlying factors ($\mathbf{y}$)?
 141 Let's start off by calculating $\mathbf{\mu}$:
 142
 143     >>> def print_row(row):
 144     ...     print('  '.join('{: 0.2f}'.format(x) for x in row))
 145     >>> mu = scores.mean(axis=0)
 146     >>> print_row(mu)
 147      2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25
 148
 149 Next we need priors for $\mathbf{A}$ and $\mathbf{\Sigma}$.  [[MDP]]
 150 has an implementation for [[Python]], and their [FANode][] uses a
 151 Gaussian random matrix for $\mathbf{A}$ and the diagonal of the score
 152 covariance for $\mathbf{\Sigma}$.  They also use the score covariance
 153 to avoid repeated summations over $n$.
 154
 155     >>> import mdp
 156     >>> def print_matrix(matrix):
 157     ...     for row in matrix:
 158     ...         print_row(row)
 159     >>> fa = mdp.nodes.FANode(output_dim=3)
 160     >>> numpy.random.seed(1)  # for consistend doctest results
 161     >>> responder_scores = fa(scores)   # hidden factors for each responder
 162     >>> print_matrix(responder_scores)
 163     -1.92  -0.45   0.00
 164      0.67   1.97   1.96
 165      0.70   0.03  -2.00
 166      0.29   0.03  -0.60
 167     -1.02   1.79  -1.43
 168      0.82   0.27  -0.23
 169     -0.07  -0.08   0.82
 170     -1.38  -0.27   0.48
 171      0.79  -1.17   0.50
 172      1.59  -0.30  -0.41
 173      0.01  -0.48   0.73
 174     -0.46  -1.34   0.18
 175     >>> print_row(fa.mu.flat)
 176      2.42   3.58   5.08   5.75   6.08   3.42   3.92   4.25
 177     >>> fa.mu.flat == mu  # MDP agrees with our earlier calculation
 178     array([ True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)
 179     >>> print_matrix(fa.A)  # factor weights for each question
 180      0.80  -0.06  -0.45
 181      0.17   0.30  -0.65
 182      0.34  -0.13  -0.25
 183      0.13  -0.73  -0.64
 184      0.02  -0.32  -0.70
 185      0.61   0.23   0.86
 186      0.08   0.63   0.59
 187     -0.09   0.67   0.13
 188     >>> print_row(fa.sigma)  # unique noise for each question
 189      0.04   0.02   0.38   0.55   0.30   0.05   0.48   0.21
 190
 191 Because the covariance is unaffected by the
 192 rotation $\mathbf{A}\rightarrow\mathbf{A}\mathbf{R}$, the estimated
 193 weights $\mathbf{A}$ and responder scores $\mathbf{y}$ can be quite
 194 sensitive to the seed priors.  The width $\mathbf{\Sigma}$ of the
 195 unique noise $\mathbf{\nu}$ is more robust, because $\mathbf{\Sigma}$
 196 is unaffected by rotations on $\mathbf{A}$.
 197
 198 Nomenclature
 199 ============
 200
 201 <dl>
 202   <dt id="subsripts">$\mathbf{A}_{ij}$</dt>
 203     <dd>The element from the $i^\text{th}$ row and $j^\text{th}$
 204     column of a matrix $\mathbf{A}$.  For example here is a 2-by-3
 205     matrix terms of components:
 206
 207 \[
 208   \mathbf{A} = \begin{pmatrix}
 209       \mathbf{A}_{11} & \mathbf{A}_{12} & \mathbf{A}_{13} \\
 210       \mathbf{A}_{21} & \mathbf{A}_{22} & \mathbf{A}_{23}
 211     \end{pmatrix}
 212 \]
 213
 214     </dd>
 215   <dt id="transpose">$\mathbf{A}^T$</dt>
 216     <dd>The transpose of a matrix (or vector) $\mathbf{A}$.
 217     $\mathbf{A}_{ij}^T=\mathbf{A}_{ji}$</dd>
 218   <dt id="inverse">$\mathbf{A}^{-1}$</dt>
 219     <dd>The inverse of a matrix $\mathbf{A}$.
 220     $\mathbf{A}^{-1}\dot\mathbf{A}=1$</dd>
 221   <dt id="diag">$\text{diag}[\mathbf{A}]$</dt>
 222     <dd>A matrix containing only the diagonal elements of
 223     $\mathbf{A}$, with the off-diagonal values set to zero.</dd>
 224   <dt id="expect">$\mathbf{E}[f(\mathbf{x})]$</dt>
 225     <dd>Expectation value for a function $f$ of a random variable
 226     $\mathbf{x}$.  If the probability density of $\mathbf{x}$ is
 227     $p(\mathbf{x})$, then $\mathbf{E}[f(\mathbf{x})]=\int d\mathbf{x}
 228     p(\mathbf{x}) f(\mathbf{x})$.  For example,
 229     $\mathbf{E}[p(\mathbf{x})]=1$.</dd>
 230   <dt id="mean">$\mathbf{\mu}$</dt>
 231     <dd>The mean of a random variable $\mathbf{x}$ is given by
 232     $\mathbf{\mu}=\mathbf{E}[\mathbf{x}]$.</dd>
 233   <dt id="variance">$\mathbf{\Sigma}$</dt>
 234     <dd>The covariance of a random variable $\mathbf{x}$ is given by
 235     $\mathbf{\Sigma}=\mathbf{E}[(\mathbf{x}-\mathbf{\mu})
 236     (\mathbf{x}-\mathbf{\mu})^\mathbf{T}]$.  In the factor analysis
 237     model discussed above, $\mathbf{\Sigma}$ is restricted to a
 238     diagonal matrix.</dd>
 239   <dt id="gaussian">$\mathcal{G}_\mathbf{x}[\mu,\mathbf{\Sigma}]$</dd>
 240     <dd>A Gaussian probability density for the random variables
 241     $\mathbf{x}$ with a mean $\mathbf{\mu}$ and a covariance
 242     $\mathbf{\Sigma}$.
 243
 244 \[
 245   \mathcal{G}_\mathbf{x}[\mathbf{\mu},\mathbf{\Sigma}]
 246     = \frac{1}{(2\pi)^{\frac{D}{2}}\sqrt{\det[\mathbf{\Sigma}]}}
 247       e^{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T
 248                      \mathbf{\Sigma}^{-1}
 249                      (\mathbf{x}-\mathbf{\mu})}
 250 \]
 251
 252     </dd>
 253   <dt id="probability-of-y-given-x">$p(\mathbf{y}|\mathbf{x})$</dt>
 254     <dd>Probability of $\mathbf{y}$ occurring given that $\mathbf{x}$
 255     occured.  This is commonly used in <a
 256     href="http://en.wikipedia.org/wiki/Bayes%27_theorem">Bayesian
 257     statistics</a>.</dd>
 258   <dt id="probability-of-x-and-y">$p(\mathbf{x}, \mathbf{y})$</dt>
 259     <dd>Probability of $\mathbf{y}$ and $\mathbf{x}$ occuring
 260     simultaneously (the joint density).
 261     $p(\mathbf{x},\mathbf{y})=p(\mathbf{x}|\mathbf{y})p(\mathbf{y})$</dd>
 262 </dl>
 263
 264 Note: if you have trouble viewing some of the more obscure [Unicode][]
 265 used in this post, you might want to install the [STIX fonts][STIX].
 266
 267 [FA]: http://en.wikipedia.org/wiki/Factor_analysis
 268 [TM]: http://www.unc.edu/~rcm/book/factornew.htm
 269 [MW]: http://www.ics.uci.edu/~welling/classnotes/classnotes.html
 270 [MW-FA]: http://www.ics.uci.edu/~welling/classnotes/papers_class/LinMod.ps.gz
 271 [survey]: http://web.archive.org/web/20051125011642/http://www.ncl.ac.uk/iss/statistics/docs/factoranalysis.html
 272 [EM]: http://www.ics.uci.edu/~welling/classnotes/papers_class/EM.ps.gz
 273 [FANode]: https://github.com/mdp-toolkit/mdp-toolkit/blob/master/mdp/nodes/em_nodes.py
 274 [Unicode]: http://en.wikipedia.org/wiki/Unicode
 275 [STIX]: http://www.stixfonts.org/
 276
 277 [[!tag tags/teaching]]
 278 [[!tag tags/theory]]
 279 [[!tag tags/tools]]