Cross-entropy is the average per-token log probability that the model assigns to the validation set. $H(T) = -\frac1N \sum \log_2 P(w_i|w_{1:n-1})$ For the binary case $\ell_{CE}(y, \hat y) = - [y_i \ \log \hat y_i + (1 - y_i) \ \log (1 - \hat y_i) ]$