The labels that are used for both the final outputs and aux_outputs are same. Wouldn’t the label be something else, had the network exited at auxiliary point of the network? Wouldn’t the additional layers between aux output and actual output causes any change to the values of label? I am concerned that the loss_function, which is meant for aux_output, would spill out higher error values due to that.
Auxiliary outputs were used in 2014 Inception v1 paper, which was quite a deep network back then. They suffered the problem of vanishing gradients (and ResNet wasn’t published until end of 2015). So they used auxiliary losses once in some number of layers so that sufficient information flow is available in the initial layers too.
That’s the whole point, to boost the information flow wherever the signal attenuates.