Need help regarding training multimodal rnn architecture

I need help in figuring out the loss function for multimodal rnn architecture. I implemented the perplexity cost function, but seems like the model isn’t training. Please help.
link to github_repo