This is a matchmoved shot, meaning the video footage was interpreted by a piece of software and 3d position data was pulled out. Using that 3d data, another piece of software was used to create a model and animate it, then composite the rendered model over the original source footage, making it appear that the 3d model is actually in the scene.