The goal of PCA is to get a subspace (i.e a lower dimensions) and projecting the cloud of data points to the subspace without losing information embedded in the original data with higher dimensions, so are looking for vector that maximizes the variation of the data.
To train the model, the loss was calculated using the mean squared error of the labeled key points and the predicted key points. It was found that the Adam optimizer provided the best results. It was also found that the lowest loss was achieved by progressing the batch size and learning rate. Starting with a learning rate of 0.001, the model was trained for 15 epochs on 32, 64, and 128 batch sizes. This was repeated for learning rates of 0.0001 and 0.00001. The reasoning being that with a lower batch size the gradient descent step is more stochastic(more random) as it is averaged over fewer examples. As the optimization reaches a minimum value the parameter steps should represent a more general solution, provided by taking the average gradient over a larger batch size.