Neural Network Layer's WX+b vs XW+b - why different formulas in theory and implementation?