MAV3D (Make-A-Video3D) is a method for generating three-dimensional dynamic scenes from text descriptions. Our method employs a 4D dynamic Neural Radiance Field (NeRF) that is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and it can be composited into any 3D environment. MAV3D does not require any 3D or 4D data, and the T2V model is trained solely on Text-Image pairs and unlabeled videos.