Profiling gpu-opu-gpu model

I have a heterogeneous model, where OPU receives input from the torch layer and forward output further to the model. On real OPU it very slow even slower than with simulated device (which performs matrix multiplication on CPU). I suppose that bottleneck is data transfer from gpu and in my case I do it twice (input comes from gpu and output goes to gpu). Am I correct? Also how transfer to gpu works? Am I correct, that data from OPU firstly goes to CPU and then to GPU (and vise versa)?

The relative performance depends on the size of the layer, with the OPU starting to be competitive at an input/output size of about 10k. Indeed, data transfer currently can bring a significant performance hit, and you are correct, the transfer at the moment happens in 2 stages going through the CPU.

Can you give some tips about minimizing the cost of transferring the data? For example, how batch size affects this transfer costs (for example if the bus slow, but wide, it is beneficial to send large chunks of data)?

We do not have a thorough benchmark but, in general, fewer transfers of more data at a time are better than a lot of transfers of a small amount.