Why do robots seem clumsy? This is related to recent fantasy TV shows like "It's Not That Easy to Transform a Monster into a Human" and "The Iron Man." A robot's two most important parts are the brain and the body. Simply put, a robot's clumsiness lies in its lack of human-like intelligence and physical capabilities. This article will focus on the first part: the brain. Making a robot intelligent requires massive amounts of data training. Robots' language models consist of both a VLM (visual-language model) and an LLM (large language model). Why do robots need a VLM? Because language models lack eyes; they can only "understand" but not see the world. For example, if you say, "Please pick up the water glass on the left side of the table," the robot must "see" before it can act. A visual model alone is not enough; it can recognize objects but cannot understand human language or intention. A VLM is the fusion of the brain and eyes. Human commands (language) + environmental perception (vision) → unified translation into action plans. The autonomous driving we're used to today is actually a VLM. It just requires much less data to learn from. After all, humanoid robots mimic humans, and the diversity and complexity of their application scenarios represent the next level of sophistication. However, when it comes to VLM training, there's still a significant gap between the amount of data required and the actual data available. Currently, this data is primarily generated through motion capture and VR teleoperation. These data collection methods are extremely costly and inefficient, and the amount of data they contribute is insufficient. Furthermore, these specialized data collection methods often lack generalizability. Robot training often takes place in clean, controlled environments: a few common objects (bottles, cups, building blocks) are placed on a table. However, in reality, a cup might be translucent, reflective, or even partially obscured by a tissue. Home and factory environments are full of distractions (clutter, noise, people moving around). Training data lacks this "long tail" of data, so robots become "unskilled" when the environment changes.
Risk and Disclaimer:The content shared by the author represents only their personal views and does not reflect the position of CoinWorldNet (币界网). CoinWorldNet does not guarantee the truthfulness, accuracy, or originality of the content. This article does not constitute an offer, solicitation, invitation, recommendation, or advice to buy or sell any investment products or make any investment decisions
No Comments
edit
comment
collection44
like42
share