I’ve been studying about Inception V3 network and I read, one could get the 1×1×2048 output from the last Max Pooling layer (ignoring the output layer). This set of values can be seen as a 2048-dimension feature vector. I don’t understand how is it 2048 dimensions. It should be 3 dimensional right as I read dimension of x = len(x.shape). But I’m getting confused as in this way an image is 3 dimensional but I read somewhere that image is 5 dimensional as (x, y, r, g, b). Pls clear the confusion.

Inception v3 is a special implementation of CNN architecture wherein there are two outputs, auxiliary and a primary output. Can you provide more insight into your implementation for better understanding of the problem?