This work presents a sparse-attention Transformer architecture for modeling documents that contain large tables. Tables are ubiquitous on the web, and are rich in information. However, more than 20% of relational tables on the web have 20 or more rows (Cafarella et al., 2008), and these large tables present a challenge for current Transformer models, which are typically limited to 512 tokens. Here we propose MATE, a novel Transformer architecture designed to model the structure of web tables. MATE uses sparse attention in a way that allows heads to efficiently attend to either rows or columns in a table. This architecture scales linearly with respect to speed and memory, and can handle documents containing more than 8000 tokens with current accelerators. MATE also has a more appropriate inductive bias for tabular data, and sets a new state-of-the-art for three table reasoning datasets. For HybridQA (Chen et al., 2020b), a dataset that involves large documents containing tables, we improve the best prior result by 19 points.
@inproceedings{eisenschlos2021mate,abbr={EMNLP},bibtex_show={true},title={MATE: Multi-view Attention for Table Transformer Efficiency},author={Eisenschlos, Julian Martin and Gor, Maharshi and M{\"u}ller, Thomas and Cohen, William Weston},booktitle={Empirical Methods in Natural Language Processing},publisher={Association for Computational Linguistics,},year={2021},location={Punta Cana},html={https://arxiv.org/pdf/2109.04312.pdf},selected={true}}
EMNLP 2021
Toward Deconfounding the Influence of Entity Demographics for Question Answering Accuracy
The goal of question answering (QA) is to answer any question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, model accuracy analysis reveals little evidence that accuracy is lower for people based on gender or nationality; instead, there is more variation on professions (question topic). But QA’s lack of representation could itself hide evidence of bias, necessitating QA datasets that better represent global diversity.
@inproceedings{Gor:Webster:Boyd-Graber-2021,abbr={EMNLP},bibtex_show={true},title={Toward Deconfounding the Influence of Entity Demographics for Question Answering Accuracy},author={Gor, Maharshi and Webster, Kellie and Boyd-Graber, Jordan},booktitle={Empirical Methods in Natural Language Processing},publisher={Association for Computational Linguistics,},month=nov,year={2021},location={Punta Cana},html={https://arxiv.org/pdf/2104.07571.pdf},selected={true}}
2019
ICCV 2019
GAN-Tree: An Incrementally Learned Hierarchical Generative Framework for Multi-Modal Data Distributions
Despite the remarkable success of generative adversarial networks, their performance seems less impressive for diverse training sets, requiring learning of discontinuous mapping functions. Though multi-mode prior or multi-generator models have been proposed to alleviate this problem, such approaches may fail depending on the empirically chosen initial mode components. In contrast to such bottom-up approaches, we present GAN-Tree, which follows a hierarchical divisive strategy to address such discontinuous multi-modal data. Devoid of any assumption on the number of modes, GAN-Tree utilizes a novel mode-splitting algorithm to effectively split the parent mode to semantically cohesive children modes, facilitating unsupervised clustering. Further, it also enables incremental addition of new data modes to an already trained GAN-Tree, by updating only a single branch of the tree structure. As compared to prior approaches, the proposed framework offers a higher degree of flexibility in choosing a large variety of mutually exclusive and exhaustive tree nodes called GAN-Set. Extensive experiments on synthetic and natural image datasets including ImageNet demonstrate the superiority of GAN-Tree against the prior state-of-the-art.
@inproceedings{Kundu:Gor:Agrawal:Babu-ICCV2019,abbr={ICCV},bibtex_show={true},author={Kundu, Jogendra Nath and Gor, Maharshi and Agrawal, Dakshit and Babu, R. Venkatesh},title={GAN-Tree: An Incrementally Learned Hierarchical Generative Framework for Multi-Modal Data Distributions},booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},month=oct,year={2019},location={Seoul, South Korea},html={https://openaccess.thecvf.com/content_ICCV_2019/papers/Kundu_GAN-Tree_An_Incrementally_Learned_Hierarchical_Generative_Framework_for_Multi-Modal_Data_ICCV_2019_paper.pdf},selected={true},code={https://github.com/maharshi95/GANTree}}
AAAI 2019
BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN
Human motion prediction model has applications in various fields of computer vision. Without taking into account the inherent stochasticity in the prediction of future pose dynamics, such methods often converges to a deterministic undesired mean of multiple probable outcomes. Devoid of this, we propose a novel probabilistic generative approach called Bidirectional Human motion prediction GAN, or <em>BiHMP-GAN</em>. To be able to generate multiple probable human-pose sequences, conditioned on a given starting sequence, we introduce a random extrinsic factor <em>r</em>, drawn from a predefined prior distribution. Furthermore, to enforce a direct content loss on the predicted motion sequence and also to avoid mode-collapse, a novel bidirectional framework is incorporated by modifying the usual discriminator architecture. The discriminator is trained also to regress this extrinsic factor <em>r</em>, which is used alongside with the intrinsic factor (encoded starting pose sequence) to generate a particular pose sequence. To further regularize the training, we introduce a novel recursive prediction strategy. In spite of being in a probabilistic framework, the enhanced discriminator architecture allows predictions of an intermediate part of pose sequence to be used as a conditioning for prediction of the latter part of the sequence. The bidirectional setup also provides a new direction to evaluate the prediction quality against a given test sequence. For a fair assessment of <em>BiHMP-GAN</em>, we report performance of the generated motion sequence using (i) a critic model trained to discriminate between real and fake motion sequence, and (ii) an action classifier trained on real human motion dynamics. Outcomes of both qualitative and quantitative evaluations, on the probabilistic generations of the model, demonstrate the superiority of <em>BiHMP-GAN</em> over previously available methods.
@inproceedings{Kundu_Gor_Babu_2019,abbr={AAAI},bibtex_show={true},title={BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN},author={Kundu, Jogendra Nath and Gor, Maharshi and Babu, R. Venkatesh},booktitle={Proceedings of the AAAI conference on Artificial Intelligence},year={2019},month=jul,volume={33},number={01},pages={8553--8560},url={https://ojs.aaai.org/index.php/AAAI/article/view/4874},doi={10.1609/aaai.v33i01.33018553},html={https://arxiv.org/pdf/1812.02591.pdf},code={https://github.com/maharshi95/Pose2Vec}}
WACV 2019
Unsupervised Feature Learning of Human Actions As Trajectories in Pose Embedding Manifold
An unsupervised human action modeling framework can provide useful pose-sequence representation, which can be utilized in a variety of pose analysis applications. In this work we propose a novel temporal pose-sequence modeling framework, which can embed the dynamics of 3D human-skeleton joints to a latent space in an efficient manner. In contrast to an end-to-end framework explored by previous works, we disentangle the task of individual pose representation learning from the task of learning actions as a sequence of pose embeddings. In order to realize a continuous pose embedding manifold along with better reconstructions, we propose an unsupervised, manifold learning procedure named Encoder GAN, (or EnGAN). Further we use the pose embeddings generated by EnGAN to model human actions using an RNN auto-encoder architecture, PoseRNN. We introduce first-order gradient loss to explicitly enforce temporal regularity in the predicted motion sequence. A hierarchical feature fusion technique is also investigated for simultaneous modeling of local skeleton joints along with global pose variations. We demonstrate state-of-the-art transfer-ability of the learned representation against other supervisedly and unsupervisedly learned motion embeddings for the task of fine-grained action recognition on SBU interaction dataset. Further, we show the qualitative strengths of the proposed framework by visualizing skeleton pose reconstructions and interpolations in pose-embedding space, and low dimensional principal component projections of the reconstructed pose trajectories.