Why do robots seem clumsy? This is related to recent fantasy TV shows like "It's Not That Easy to Transform a Monster into a Human" and "The Iron Man." A robot's two most important parts are the brain and the body. Simply put, a robot's clumsiness lies in its lack of human-like intelligence and physical capabilities. This article will focus on the first part: the brain. Making a robot intelligent requires massive amounts of data training. Robots' language models consist of both a VLM (visual-language model) and an LLM (large language model). Why do robots need a VLM? Because language models lack eyes; they can only "understand" but not see the world. For example, if you say, "Please pick up the water glass on the left side of the table," the robot must "see" before it can act. A visual model alone is not enough; it can recognize objects but cannot understand human language or intention. A VLM is the fusion of the brain and eyes. Human commands (language) + environmental perception (vision) → unified translation into action plans. The autonomous driving we're used to today is actually a VLM. It just requires much less data to learn from. After all, humanoid robots mimic humans, and the diversity and complexity of their application scenarios represent the next level of sophistication. However, when it comes to VLM training, there's still a significant gap between the amount of data required and the actual data available. Currently, this data is primarily generated through motion capture and VR teleoperation. These data collection methods are extremely costly and inefficient, and the amount of data they contribute is insufficient. Furthermore, these specialized data collection methods often lack generalizability. Robot training often takes place in clean, controlled environments: a few common objects (bottles, cups, building blocks) are placed on a table. However, in reality, a cup might be translucent, reflective, or even partially obscured by a tissue. Home and factory environments are full of distractions (clutter, noise, people moving around). Training data lacks this "long tail" of data, so robots become "unskilled" when the environment changes.

USDUnited States Dollar

CNYChinese Yuan

JPYJapanese Yen

HKDHong Kong Dollar

THBThai Baht

GBPBritish Pound

EUREuro

AUDAustralian Dollar

TWDNew Taiwan Dollar

KRWSouth Korean Won

PHPPhilippine Peso

AEDUAE Dirham

CADCanadian Dollar

MYRMalaysian Ringgit

MOPMacanese Pataca

NZDNew Zealand Dollar

CHFSwiss Franc

CZKCzech Koruna

DKKDanish Krone

IDRIndonesian Rupiah

LKRSri Lankan Rupee

NOKNorwegian Krone

QARQatari Riyal

RUBRussian Ruble

SGDSingapore Dollar

SEKSwedish Krona

VNDVietnamese Dong

ZARSouth African Rand