For the full experience, and links to everything referenced, visit our website: FAIR proposes Unified Transformer Recently, we have seen that Transformers have led to a paradigm shift in AI and NLP research (now even computer vision). Multi-modal research has recently employed Transformers in large Vision/Language Pretraining frameworks (such as VILBERT, VLP etc). Models such as these are usually only trained on one or two pre-training tasks. Facebook AI Research (FAIR) propose a multi-modal model they call the Unified Transformer (UniT), which is a Transformer based model jointly trained on 7 different tasks: object detection, VQA, SNLI-VE, MNLI, QNLI, QQP and SST-2. The architecture, which achieves comparable results to task specific Transformer based models with a signficantly reduced parameter set uses two Transformer encoders and one Transformer decoder. At a very high level, one Transformer encoder is responsible for encoding the image, and the other for encoding the text. T
Hide player controls
Hide resume playing