講解:CSCI 567 Python、Python、Python、Python

CSCI 567, Fall’18Haipeng LuoHomework #4 Programming AssignmentDue: 11:59 pm, November 4, 2018General instructionsYour repository will have now a directory “P4/”. Please do not change the name of this repository or thenames of any files we have added to it. Please perform. a git pull to retrieve these files. You will findwithin it: baboon.tiff, data loader.py, gmm.py, gmm.sh, gmmTest.py, kmeans.py, kmeans.sh,kmeansTest.py, utils.py. Depending on your environment you may need to install the python librarynamed, ”pillow”, which is used by matplotlib to process some of the images needed for this assignment.You do not need to import pillow in the files we provide, it only needs to be installed in your environment.You can install it by running ’pip install pillow’ in your command line.High Level Description0.1 Tasks In this assignment you are asked to implement K-means clustering to identify main clusters inthe data, use the discovered centroid of cluster for classification, and implement Gaussian Mixture Modelsto learn a generative model of the data. Specifically, you willImplement K-means clustering algorithm to identify clusters in a two-dimensional toy-dataset.Implement image compression using K-means clustering algorithm.Implement classification using the centroids identified by clustering.Implement Gaussian Mixture Models to learn a generative model and generate samples from a mix-ture distribution.0.2 Running the code We have provided two scripts to run the code. Run kmeans.sh after you fin-ish implementation of k-means clustering, classification and compression. Run gmm.sh after you finishimplementing Gaussian Mixture Models.0.3 Dataset Through out the assignment we will use two datasets (See fig. 1) — Toy Dataset and DigitsDataset (you do not need to download).Toy Dataset is a two-dimensional dataset generated from 4 Gaussian distributions. We will use thisdataset to visualize the results of our algorithm in two dimensions.We will use digits dataset from sklearn [1] to test K-means based classifier and generate digits usingGaussian Mixture model. Each data point is a 8 8 image of a digit. This is similar to MNIST but lesscomplex.1(a) Digits (b) 2-D Toy DatasetFigure 1: Datasets0.4 Cautions Please DO NOT import packages that are not listed in the provided code. Follow the in-structions in each section strictly to code up your solutions. Do not change the output format. Do notmodify the code unless we instruct you to do so. A homework solution that does not match the providedsetup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make surethat your code runs with the provided commands and scripts on the VM. Finally, make sure that you gitadd, commit, and push all the required files, including your code and generated output files.0.5 Final submission After you have solved problem 1 and 2, execute bash kmeans.sh command andbash gmm.sh command. Git add, commit and push plots and results folder and all the *.py files.Problem 1 K-means ClusteringRecall that for a dataset x1, . . . , xN 2RD, the K-means distortion objective is2: Outputs:Array of size K D of means, fmkgKk=1Membership vector R of size N, where R[i]2[K] is the index of the cluster thatexample i belongs to.3: Initialize:Set meansfmkgKk=1 to be K points selected from x uniformly at random (withreplacement), and J to be a large number (e.g. 1010)4: repeat5: Compute membership rik using eq. 26: Compute distortion measure Jnew using eq. 17: ifjJ Jnewj e thenSTOP8: end if9: Set J := Jnew10: Compute means mk using eq. 311: until maximum iteration is reached1.1 Implementing K-means clustering algorithm Implement Algorithm 1 by filling out the TODO partsin class KMeans of file kmeans.py. Note the following:Use numpy.random.choice for the initialization step.If at some iteration, there exists a cluster k with no points assigned to it, then do not update thecentroid of this cluster for this round.While assigning a sample to a cluster, if there’s a tie (i.e. the sample is equidistant from two centroids),you should choose the one with smaller index (like what numpy.argmin does).After you complete the implementation, execute bash kmeans.sh command to run k-means on toydataset. You should be able to see three images generated in plots folder. In particular, you can seetoy dataset predicted labels.png and toy dataset real labels.png and compare the clus-ters identified by the algorithm against the real clusters. Your implementation should be able to recover thecorrect clusters sufficiently well. Representative images are shown in fig. 2. Red dots are cluster centroids.Note that color coding of recovered clusters may not match that of correct clusters. This is due to mis-matchin ordering of retrieved clusters and correct clusters (which is fine).3(a) Predicted Clusters (b) Real ClustersFigure 2: Clustering on toy dataset1.2 Image compression with K-means In the next part, we will look at lossy image compression as anapplication of clustering. The idea is simply to treat each pixel of an image as a point xi, then pe代寫CSCI 567 Python實(shí)驗(yàn)、Python程序代做、代做Python程序、Python語言代做留學(xué)生rformK-means algorithm to cluster these points, and finally replace each pixel with its centroid.What you need to implement is to compress an image with K centroids given. Specifically, complete thefunction transform. image in the file kmeansTest.py.After your implementation, execute bash kmeans.sh again and you should be able to see an imagebaboon compressed.png in the plots folder. You can see that this image is distorted as compared tothe original baboon.tiff.1.3 Classification with k-means Another application of clustering is to obtain a faster version of the near-est neighbor algorithm. Recall that nearest neighbor evaluates the distance of a test sample from everytraining point to predict its class, which can be very slow. Instead, we can compress the entire trainingset to just the K centroids, where each centroid is now labeled as the majority class of the correspondingcluster. After this compression the prediction time of nearest neighbor is reduced from O(N) to just O(K)(see Algorithm 2 for the pseudocode).Algorithm 2 Classification with K-means clustering1: Inputs:Training Data :fX,YgParameters for running K-means clustering2: Training:Run K-means clustering to find centroids and membership (reuse your code from Problem 1.1)Label each centroid with majority voting from its members. i.e. arg maxc i rikIfyi = cg3: Prediction:Predict the same label as the nearest centroid (that is, 1-NN on centroids).Note: 1) break ties in the same way as in previous problems; 2) if some centroid doesn’t contain anypoint, set the label of this centroid as 0.Complete the fit and predict function in KMeansClassifier in file kmeans.py. Once completed,run kmeans.sh to evaluate the classifier on a test set. For comparison, the script. will also print accuracy ofa logistic classifier and a nearest neighbor classifier. (Note: a naive K-means classifier may not do well butit can be an effective unsupervised method in a classification pipeline [2].)4Problem 2 Gaussian Mixture ModelNext you will implement Gaussian Mixture Model (GMM) for clustering and also generate data after learn-ing the model. Recall the key steps of EM algorithm for learning GMMs on Slide 52 of Lec 8 (we change thenotation wk to pk):2.1 Implementing EM Implement EM algorithm (class Gaussian pdf, function fit and function compute log likelihood) in file gmm.py to estimate mixture model parameters. Please note the following:For K-means initialization, the inputs of K-means are the same as those of EM’s.When computing the density of a Gaussian with covariance matrix S, use S0 = S+ 10 3I when S isnot invertible (in case it’s still not invertible, keep adding 10 3I until it is invertible).After implementation execute bash gmm.sh command to estimate mixture parameters for toy dataset.You should see a Gaussian fitted to each cluster in the data. A representative image is shown in fig. 3.We evaluate both initialization methods and you should observe that initialization with K-means usuallyconverges faster.Figure 3: Gaussian Mixture model on toy datasetAlgorithm 3 EM algorithm for estimating GMM parameters1: Inputs:An array of size N D denoting the training set, xMaximum number of iterations, max iterNumber of clusters, KError tolerance, eInit method — K-means or random2: Outputs:Array of size K D of means, fmkgKk=1Variance matrix Sk of size K D DA vector of size K denoting the mixture weights, pi k3: Initialize:For “random” init method: initialize means uniformly at random from [0,1)for each dimension (use numpy.random.rand), initialize variance to beidentity matrix for each component, initialize mixture weight to be uniform.For “K-means” init method: run K-means, initialize means as the centroidsfound by K-means, and initialize variance and mixture weight according toEq. (7) and Eq. (8) where gik is the binary membership found by K-means.4: Compute the log-likelihood l using Eq. (9)5: repeat6: E Step: Compute responsibilities using Eq. (4)7: M Step:Estimate means using Eq. (6)Estimate variance using Eq. (7)Estimate mixture weight using Eq. (8)8: Compute new log-likelihood lnew9: ifjl lnewj e thenSTOP10: end if11: Set l := lnew12: until maximum iteration is reached2.2 Implementing sampling We also fit a GMM with K = 30 using the digits dataset. An advantage ofGMM compared to K-means is that we can sample from the learned distribution to generate new syntheticexamples which look similar to the actual data.To do this, implement sample function in gmm.py which uses self.means, self.variances andself.pi k to generate digits. Recall that sampling from a GMM is a two step process: 1) first sample acomponent k according to the mixture weight; 2) then sample from a Gaussian distribution with mean mkand variance Sk. Use numpy.random for these sampling steps.After implementation, execute bash gmm.sh again. This should produce visualization of means mkand some generated samples for the learned GMM. Representative images are shown in fig. 4.6(a) Means of GMM learnt on digits (b) Random digits sample generated from GMM轉(zhuǎn)自:http://ass.3daixie.com/2018110120083459.html

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

  • 愛你三千遍。 我不是一個(gè)熱衷的漫威迷,沒那么多的情懷。 影片中打動(dòng)人的片段,不少,其實(shí)笑點(diǎn)也不少。 人生有億萬種可...
    東寶你造么閱讀 566評(píng)論 0 5
  • 看著老人堅(jiān)實(shí)的背影,為老人感動(dòng),他本可以選擇在家安逸度日,享受兒子的孝心,在沙發(fā)和床頭度過余身,但是他選擇了...
    ed2b0c62bc34閱讀 296評(píng)論 0 0
  • A+B問題描述:給出兩個(gè)整數(shù) aa 和 bb , 求他們的和。 算法思路 在十進(jìn)制的加法中,例如 6+7,個(gè)位為3...
    愛吃饅頭的二餅閱讀 1,210評(píng)論 1 0
  • 我屋外陽臺(tái)上種了一些花草,其中我最喜歡的是那叢茂盛的風(fēng)雨蘭。風(fēng)雨蘭的葉子長長的,柔柔的,看起來很像韭菜的長葉,但是...
    醉清風(fēng)_于叢洋閱讀 278評(píng)論 2 9

友情鏈接更多精彩內(nèi)容