Someday, you may want your home robot to carry a load of dirty clothes downstairs and deposit them in the washing machine in the far-left corner of the basement. The robot will need to combine your instructions with its visual observations to determine what it should do to complete this task.
ADVERTISEMENT |
For an AI agent, this is easier said than done. Current approaches often use multiple, hand-crafted machine-learning models to tackle different parts of the task, which requires a great deal of human effort and expertise to build. These methods, which use visual representations to directly make navigation decisions, demand massive amounts of visual data for training that are often hard to come by.
To overcome these challenges, researchers from MIT and the MIT-IBM Watson AI Lab devised a navigation method that converts visual representations into pieces of language that are then fed into one large language model that achieves all parts of the multistep navigation task.
…
Add new comment