I didnt completely understand the question but let me share my understanding of the convolution operation:

What is the 2D convolution operation ?

The 2 dimensions refers to the movement of the filters over the image or layers in the X and Y directions.

What is the dimension of the 3D filter/kernel?

The width and height are decided by the user say 5X5 ,3X3 , 2X2 or 1X1 but the depth is decided by the number of channels. When we we consider an RGB image there are 3 channels hence the filter depth will be 3 making it a 5X5X3 , 3X3X3 , 2X2X3 and 1X1X3.

Lets assume we are using a 5X5X3 filter ( 3D filter) on the RGB(3 layers/channels) image this filter when it does a 2D convolution will generate 1 output channel. If we use 10 such filters we will generate 10 output channels/layers. Now on these set of layers if I want to do 2D convolutions I have to use a filter of height * width * 10 eg 5X5X10 or 3X3X10.

What is 3D convolution?

In 3D convolution the filter will have to move in X , Y and Z directions and is mainly used in medical applications in which people study stacked images. Say we have 3 stacked images for example sake lets say the first image is the normal photograph, below that is an infrared image and in the end there is an Xray. In such a situation we will perform a 3D convolution operation. Let us assume that all the 3 stacked images have RGB channels to keep it simple. So our input dimension will be ( img_height * img_width * 9) . Now we use a kernel/filter of size (5X5X3). When this filter moves it will move in all three X, Y and Z directions. So if you think logically it will first move in X and Y and perform a 2D convolution on the photograph then move deeper and perform another 2D convolution on the Xray and finally the infrared image. This will cause the 5X5X3 filter to generate 3 output channels/layer. That is nothing but (channels of input / depth of filter) 9/3 =3 . We need to note that a 2D convolution operation using a single filter will give one output channel a 3D convolution by a single filter will give multiple output channels. Now in the above case if we use 10 3D filters we will produce an output of 30 channels.

The 3D convolution operation can be continued further down the CNN but the depth of the filters need to be taken into consideration.

Something to keep in mind : When we talk about convolution operations we are just defining the connections of the neural network. In the end everything is one dimensional and the number of images the second and the number of batches the third. Dimensions depend on how you define them so dont get confused with them. The 2D of a convolution network is different from the 4D of an image or the 5D of the data or the 3D of the filter.

Hope this helps!!