Differences

This shows you the differences between two versions of the page.

--- cs401r_w2016:lab9 [2016/03/17 21:31]
admin
+++ cs401r_w2016:lab9 [2021/06/30 23:42] (current)
@@ Line 29: / Line 29: @@
 {{ :cs401r_w2016:lab8_pdtm.png?direct&500 |}}
+{{ :cs401r_w2016:gibbs_sampler_results.png?direct&500|}}
+Here, you can see how documents that are strongly correlated with Topic #3 appear every six months; these are the sustainings of church officers and statistical reports.
+Your notebook must also produce a plot of the log posterior of the data over time, as your sampler progresses.  You should produce a single plot comparing the regular Gibbs sampler and the collapsed Gibbs sampler.
+To the right is an example of my log pdfs.
 ----
@@ Line 37: / Line 45: @@
   * 40% Correct implementation of Gibbs sampler
   * 40% Correct implementation of collapsed Gibbs sampler
-  * 20% Final plots are tidy and legible
+  * 20% Final plots are tidy and legible (at least 2 plots: posterior over time for both samplers, and heat-map of distribution of topics over documents)
 ----
@@ Line 44: / Line 52: @@
 For this lab, you will code two different inference algorithms on the Latent Dirichlet Allocation (LDA) model.
-You will use [[http://hatch.cs.byu.edu/courses/stat_ml/files.tar.gz|a dataset of general conference talks]].
+You will use [[https://www.dropbox.com/s/yr3n9w61ifon04h/files.tar.gz?dl=0|a dataset of general conference talks]].  Download and untar these files; there is helper code in the ''Hints'' section to help you process them.
-<code python>
+**Part 1: Basic Gibbs Sampler**
-import numpy as np
-def p( x, t=1.0 ):
-    return np.exp( -10*t*((x-2)**2) ) + 0.3*np.exp( -0.5*10*t*((x+1)**2) )
-</code>
-This distribution has two modes that are separated by a region of low probability.
+The LDA model consists of several variables: the per-word topic assignments, the per-document topic mixtures, and the overall topic distributions.  Your Gibbs sampler should alternate between sampling each variable, conditioned on all of the others.  Equations for this were given in the lecture slides.
-**Part 1: Metropolis Hastings**
+//Hint: it may be helpful to define a datastructure, perhaps called ''civk'', that contains counts of how many words are assigned to which topics in which documents.  My ''civk'' datastructure was simply a 3 dimensional array.//
-For this part, you should implement the basic MH MCMC algorithm.  You should use a Gaussian proposal distribution with three different variances ''(0.1, 1.0 and 10.0)''.  Your sampler should start with an initial state of ''0''.
+You should run 100 sweeps through all the variables.
-For each different proposal distribution, you should run your MCMC chain for 10,000 steps, and record the sequence of states.  Then, you should produce a visualization of the distribution of states, and overlay a plot of the actual target distribution.  They may or may not match (see, for example, the first example plot in the Description section).
+Note: this can be quite slow.  As a result, you are welcome to run your code off-line, and simply include the final images / visualizations in your notebook (along with your code, of course!).
-Furthermore, for each proposal distribution, you should run three independent chains (you can do these sequentially or in parallel, as you like).  You should display each of these three chains on a single plot with time on the x-axis and the state on the y-axis.  Ideally, you will see each of the three chains mixing between two modes; you may notice other features of the behavior of the samplers as well, which you should report in your writeup!
+**Part 2: Collapsed Gibbs Sampler**
-**Part 2: Hamiltonian MCMC**
+The collapsed Gibbs sampler integrates out the specific topic and per-document-topic-distribution variables.  The only variables left, therefore, are the per-word topic assignments.  A collapsed Gibbs sampler therefore must iterate over every word in every document, and resample a topic assignment.
-For this part, you will code the Hamiltonian MCMC algorithm, as discussed in class.  To do this, you will need to compute the gradient of the density function with respect to the state.  An easy easy way to do this is to use the [[https://github.com/HIPS/autograd|autograd]] package:
+Again, computing and maintaining a datastructure that tracks counts will be very helpful.  This will again be somewhat slow.
-<code python>
-from autograd import grad
-import autograd.numpy as np
-grad_p = grad( p )
-</code>
-The function ''grad_p'' accepts the same parameters as ''p'', but it returns the gradient, instead of the density.
-You should use the leapfrog method to integrate the dynamics.
-Remember that you will need to introduce as many momentum variables as there are state (ie, position) variables.
-A detailed explanation of Hamiltonian MCMC can be found here:[[http://www.mcmchandbook.net/HandbookChapter5.pdf|Hamiltonian MCMC]].
-  * You will find the equations describing the leapfrog method in Equations 5.18, 5.19 and 5.20.
-  * You will find a description of how to convert a given ''p(x)'' into a Hamiltonian in Section 5.3.1.
-  * You will find a description of the complete HMC algorithm in section 5.3.2.1
-Remember that you will alternate between two steps:
-  - Sampling the momentum conditioned on the position.  This is just sampling from a Gaussian.
-  - Proposing a new state for the position, given the momentum.  This involves integrating the dynamics, and then accepting or rejecting based on integration error.
-You will have to tune two parameters in order to implement HMC: the variance of the momentum variables, and the timestep used for integrating the dynamics.  Experiment with both, and report your results using plots like those you prepared for Part 1.
-**Part 3: Observations**
-You have now coded two different inference algorithms, and a few variants of each.  For this section, you must provide a small write-up that compares and contrasts each.  Answer at least the following questions:
-  - What was the acceptance rate of each algorithm?  (ie, what percentage of proposals were accepted)
-  - Why don't some inference algorithms explore both modes of the density?
-  - Why do some algorithms stay in the same state repeatedly?  Is this good or bad?
-  - What were the best values for the variance of the momentum variables and the timestep you found?  How did you know that they were good?
 ----
@@ Line 153: / Line 124: @@
 # topic distributions
-topics = np.zeros((V,K))
+bs = np.zeros((V,K)) + (1/V)
 # how should this be initialized?
 # per-document-topic distributions
-pdtm = np.zeros((K,D))
+pis = np.zeros((K,D)) + (1/K)
 # how should this be initialized?
 for iters in range(0,100):
-    p = compute_data_likelihood( docs_i, qs, topics, pdtm )
+    p = compute_data_likelihood( docs_i, qs, bs, pis)
-    print "Iter %d, p=%.2f" % (iters,p)
+    print("Iter %d, p=%.2f" % (iters,p))
-    # resample per-word topic assignments qs
+    # resample per-word topic assignments bs
-    # resample per-document topic mixtures pdtm
+    # resample per-document topic mixtures pis
     # resample topics
-<code>
+</code>

BYU CS classes

User Tools

Site Tools

Differences

Page Tools