MaixCAM2 MaixPy Using Depth-Anything for Monocular Depth Estimation

2025-06-09

Update history

Date	Version	Author	Update content
2025-06-09	v1.0	neucrack	Added support for Depth-Anything-V2 code and documentation

Introduction

Depth-Anything-V2 can estimate depth maps from regular images. Compared to Depth-Anything-V1, it has improvements in detail and optimizations in certain materials such as glass. For more information, refer to the official open-source repository or the web homepage.

Supported Devices

Device	Supported
MaixCAM2	✅
MaixCAM	❌

Using Depth-Anything on MaixCAM2 MaixPy

MaixPy supports Depth-Anything-V2. Note that the first-generation MaixCAM hardware is not supported.

The default system already includes the model, just run the code directly:

from maix import camera, display, image, nn, app

cmap = image.CMap.TURBO
model = nn.DepthAnything(model="/root/models/depth_anything_v2_vits.mud", dual_buff = True)

cam = camera.Camera(model.input_width(), model.input_height(), model.input_format())
disp = display.Display()

while not app.need_exit():
    img = cam.read()
    res = model.get_depth_image(img, image.Fit.FIT_CONTAIN, cmap)
    if res:
        disp.show(res)

Here, the get_depth_image function is used to directly obtain a pseudo-color image. The cmap can be set to a color map. All supported pseudo-color maps are documented in the maix.image.CMap API.

Of course, if you want the model's raw output, you can use the get_depth method to obtain the original float32 data. The returned data can be converted to NumPy format using maix.tensor.tensor_to_numpy_float32.

Real-world performance on MaixCAM2:

Notes

Note that Depth-Anything-V2 performs depth estimation on a single image, and the raw output is a relative value. The value range varies based on the image content. For example, when depth differences are small, the range might be 0.0~2.0; when there is greater variation, it might be 0.0~8.0. The final image is a normalized value between 0~255.

Therefore, for videos or continuous image frames, the absolute depth value may fluctuate. That means when you perform continuous depth estimation using a camera feed, you may observe slight flickering in the output images.

More Input Resolutions

Due to the high computational demand of the model, the default resolution used is 448x336. If you wish to use other resolutions, you can download available models from the MaixHub model library. If MaixHub does not have the resolution you need, you can convert the model yourself.

For MaixCAM2, refer to the model quantization documentation as well as huggingface.co/AXERA-TECH/Depth-Anything-V2 and github.com/AXERA-TECH/DepthAnythingV2.axera for model conversion. (Note: The input in config.json in that project is BGR, it is recommended to change it to RGB which is the default in MaixPy.)

Hand geature classifier

YOLO-World deten any object