Expectation maximization (EM) is a probabilistic method for [[clustering]] which assumes the elements were generated by a mixture of probabilistic models.
In the first step, calculate the conditional probability of each element coming from a probabilistic model $\Theta_j$
$P(\Theta_j | x_i, \Theta) = \frac{P(x_i | \Theta_j)}{\sum_{l=1}^k P(x_i | \Theta_j)}$
Next, calculate the mean of model $\Theta_j$
$\mu_j = \sum_{i=1}^n x_i \frac{P(\Theta_j | x_i, \Theta)}{\sum_{l=1}^n P(\Theta_j | x_i, \Theta)}$
And variance of model $\Theta_j$
$\sigma_j = \sqrt{\frac{\sum_{i=1}^n P(\Theta_j | x_i, \Theta)(x_i - \mu_j)^2}{\sum_{i=1}^n P(\Theta_j | x_i, \Theta)}}$
A basic implementation in [[Python]] is provided below.
```python
def em_clustering(centroids, dataset):
# Input:
# 1. centroids - A list of lists with each value representing the mean and standard deviation values picked from a gausian distribution.
# 2. dataset - A list of points randomly picked.
# Output:
# 1. results - Return the updated centroids(updated mean and std values after the EM step) after the first iteration.
# helper function for normal distribution
def normal_pdf(x, mu, sigma):
coef = 1 / (math.sqrt(2 * math.pi) * sigma)
exponent = -((x - mu) ** 2) / (2 * sigma ** 2)
return coef * math.exp(exponent)
dataset = np.array(dataset)
K = len(centroids)
N = len(dataset)
# calculate expectations
expectations = np.zeros((N, K))
for i, x in enumerate(dataset):
probs = np.array([normal_pdf(x, mu, sigma) for mu, sigma in centroids])
probs_sum = np.sum(probs)
probabilities[i] = probs / probs_sum
# update centroids
new_centroids = []
for k in range(K):
total_prob = expectations[:, k]
weight_sum = np.sum(total_prob)
new_mu = np.sum(total_prob * dataset) / weight_sum
new_var = np.sum(total_prob * (dataset - new_mu) ** 2) / weight_sum
new_sigma = math.sqrt(new_var)
new_centroids.append([new_mu, new_sigma])
return new_centroids
```