Saying that layer normalization normalizes input across the features was difficult for me to understand initially. Here's what made it click for me - straight from the Layer Normalization paper abstract:

"In this paper, we transpose batch normalization into layer normalization by **computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case**. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times."

https://arxiv.org/pdf/1607.06450.pdf