Datasets !

Here are some Datasets that I played a major role in curating. This Video is a good summary of how I think about dataset curation and how to do it.


mmmCAD : Multi-modal Modification of CAD

Project Image

Small scale data from Communicating Design Intent Using Drawing and Text. [~100 man-hours] One participant (Designer) communicate with another (Maker) to collaboratively re-create a precise 2D CAD design. The Designer is given a target design, and must use drawing and language to communicate to the maker, who makes the design using a CAD interface.

Large scale data from mrCAD: Multimodal Refinement of Computer-aided Designs A scaled up version of multi-modal communication about design refinements. mrCAD consists of 6,082 communication games, 15,163 instruction-execution rounds, played between 1,092 pairs of human players. [~2000 man-hours]


LARC : Language-complete Abstract Reasoning Corpus

Project Image

From Communicating Natural Programs to Humans and Machines. [~350 man-hours] One participant (Describer) describes an abstract transformation of grids from the ARC corpus to another (Builder) using language. The builder applies the transformation on a new input grid to produce an output grid. Access the dataset here.


DARC: A Recursive Decomposition Dataset of ARC Tasks

Project Image

From ANPL: Towards Natural Programming with Interactive Decomposition . [~440 man-hours] A corpus of 227 ARC tasks, recursively decomposed and grounded as Python code. Access the dataset here


DiffVL100

Project Image

From DiffVL: Scaling Up Soft Body Manipulation using Vision-Language Driven Differentiable Physics. [~50 man-hours] 100 soft-body manipulation tasks inspired by real-life scenarios from online videos. Access the dataset here