Neural modular networks do not generalize well to new questions since their performance rely on syntactic parser.



Instead of parsing questions into universal dependency representation, this paper used LSTM to generate a sequence of functions to form a program. Function modules are generic: unary module, binary module and scene module. The program generator is trained in semi-supervised manner. First it is trained with few ground truth labels and fine tuned with REINFORCE algorithm.



Stongly supervised model whoed near-perfect performance on CLEVR, and semi-supervised model showed better performance than any other baselines.


This program showed to attend the right place of the image.

Johnson, Justin, et al. “Inferring and Executing Programs for Visual Reasoning.” ICCV. 2017.