expectation maximization

Expectation maximization (EM) is a probabilistic method for [[clustering]] which assumes the elements were generated by a mixture of probabilistic models. In the first step, calculate the conditional probability of each element coming from a probabilistic model $\Theta_j$ $P(\Theta_j | x_i, \Theta) = \frac{P(x_i | \Theta_j)}{\sum_{l=1}^k P(x_i | \Theta_j)}$ Next, calculate the mean of model $\Theta_j$ $\mu_j = \sum_{i=1}^n x_i \frac{P(\Theta_j | x_i, \Theta)}{\sum_{l=1}^n P(\Theta_j | x_i, \Theta)}$ And variance of model $\Theta_j$ $\sigma_j = \sqrt{\frac{\sum_{i=1}^n P(\Theta_j | x_i, \Theta)(x_i - \mu_j)^2}{\sum_{i=1}^n P(\Theta_j | x_i, \Theta)}}$ A basic implementation in [[Python]] is provided below. ```python def em_clustering(centroids, dataset): # Input: # 1. centroids - A list of lists with each value representing the mean and standard deviation values picked from a gausian distribution. # 2. dataset - A list of points randomly picked. # Output: # 1. results - Return the updated centroids(updated mean and std values after the EM step) after the first iteration. # helper function for normal distribution def normal_pdf(x, mu, sigma): coef = 1 / (math.sqrt(2 * math.pi) * sigma) exponent = -((x - mu) ** 2) / (2 * sigma ** 2) return coef * math.exp(exponent) dataset = np.array(dataset) K = len(centroids) N = len(dataset) # calculate expectations expectations = np.zeros((N, K)) for i, x in enumerate(dataset): probs = np.array([normal_pdf(x, mu, sigma) for mu, sigma in centroids]) probs_sum = np.sum(probs) probabilities[i] = probs / probs_sum # update centroids new_centroids = [] for k in range(K): total_prob = expectations[:, k] weight_sum = np.sum(total_prob) new_mu = np.sum(total_prob * dataset) / weight_sum new_var = np.sum(total_prob * (dataset - new_mu) ** 2) / weight_sum new_sigma = math.sqrt(new_var) new_centroids.append([new_mu, new_sigma]) return new_centroids ```