Object recognition in radiance fields: A comparison of language embedding methods (Work in Progress)

Furtwangen University
Medieninformatik - Winter Semester 24/25

Supervisor: Prof. Dr. Uwe Hahne

Second supervisor: Prof. Dr. Thomas Schlegel

Aliquam vitae elit ullamcorper tellus egestas pellentesque. Ut lacus tellus, maximus vel lectus at, placerat pretium mi. Maecenas dignissim tincidunt vestibulum. Sed consequat hendrerit nisl ut maximus.

Abstract

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ullamcorper tellus sed ante aliquam tempus. Etiam porttitor urna feugiat nibh elementum, et tempor dolor mattis. Donec accumsan enim augue, a vulputate nisi sodales sit amet. Proin bibendum ex eget mauris cursus euismod nec et nibh. Maecenas ac gravida ante, nec cursus dui. Vivamus purus nibh, placerat ac purus eget, sagittis vestibulum metus. Sed vestibulum bibendum lectus gravida commodo. Pellentesque auctor leo vitae sagittis suscipit.

HFU prototype scene

To avoid the difficulties of a real scene, such as depth of field, overexposure due to direct sunlight hitting the camera lens, motion blur due to camera movement, movement in the scene due to e.g. wind or shadows caused by the person holding the camera in some perspectives, a virtual scene was created. This made it possible to generate images that have no depth of field or light artifacts, such as those caused by direct sunlight entering the lens. Blender is a program for 3D modeling of objects. To create the landscape of Furtwangen, Blender and the BlenderGIS project were used. With this, the landscape of Furtwangen could be created from GIS data as a mesh. The plugin also generated cubes at the points where buildings were recognized in the geographic information systems (GIS) data

Image of the HFU Scene in Unity


Datasets

For the purpose of finetuning, it is necessary to generate datasets that can also be implemented on the real Furtwangen campus. Within a virtual scene, thousands of images of objects can be generated from all viewing directions in a few seconds. However, this is not possible with real scenes. There, considerably more time must be allowed to capture a building. Three data sets were created that reflect a different approach to collecting images in a real scenario.

Scene-Datensatz
- 280 Images
- Directly from the camera path

Surround
- 280 Images
- Separate camera path around the individual buildings

Big-Surround
- 7000 Images
- Random placement in the vicinity of the building

Results

The results show in qualitative and metric evaluations that the concept developed for this work using Mask R-CNN is superior to the other techniques for this application.

F1-score-Values

Methode No
Finetuning
Scene Surround Big-
Surround
LeRF lite 0.484 0.533 0.568 0.536
Feture-Splatting 0.484 0.340 0.308 0.588
ResNet + SAM 0.105 0.432 0.271 0.513
Mask R-CNN 0.000 0.951 0.842 0.951

IoU-Values

Methode No
Finetuning
Scene Surround Big-
Surround
LeRF lite 0.022 0.003 0.003 0.184
Feture-Splatting 0.013 0.000 0.000 0.338
ResNet + SAM 0.000 0.000 0.000 0.002
Mask R-CNN 0.000 0.861 0.840 0.854

References

Kerbl, Bernhard; Kopanas, Georgios; Leimkühler, Thomas; Drettakis, George: 3D Gaussian Splatting for Real-Time Radiance Field Rendering. In: ACM Transactions on Graphics. http://arxiv.org/pdf/2308.04079v1.

Kerr, Justin; Kim, Chung Min; Goldberg, Ken; Kanazawa, Angjoo; Tancik, Matthew (2023): LERF: Language Embedded Radiance Fields. http://arxiv.org/pdf/2303.09553v1.

Mildenhall, Ben; Srinivasan, Pratul P.; Tancik, Matthew; Barron, Jonathan T.; Ramamoorthi, Ravi; Ng, Ren (2020): NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. http://arxiv.org/pdf/2003.08934v2.

He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; and Sun, Jian (2016): Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016), 770–778. https://doi.org/10.1109/CVPR.2016.90

Kirillov, Alexander; Mintun, Eric; Ravi, Nikhila; Mao, Hanzi; Rolland, Chloe; Gustafson, Laura et al. (2023): Segment Anything. http://arxiv.org/pdf/2304.02643.

Qiu, Ri-Zhao ; Yang, Ge; Zeng, Weijia; and Wang, Xiaolong (2024): Feature Splatting: Language-Driven Physics-Based Scene Synthesis and Editing. https://doi.org/10.48550/arXiv.2404.01223

He, Kaiming; Gkioxari, Georgia; Dollár, Piotr; and Girshick, Ross (2018): Mask R-CNN. https://doi.org/10.48550/arXiv.1703.06870