A Computer Use Agent(CUA) primarily needs three categories of data to act reliably: visual state, user intent, and action history. Visual state comes from screen captures, which provide raw pixels from which the CUA detects actionable UI components. User intent is typically a natural-language instruction or a structured command specifying what needs to be done, such as “open the settings panel” or “save the file as PDF.” Action history helps the CUA track what it has already attempted, avoiding repeated clicks or infinite loops when an interface behaves unexpectedly.
The CUA also needs contextual metadata about the environment. This may include OS-level accessibility hints, window geometry, cursor coordinates, keyboard state, or application boundaries. These signals ensure that when the CUA issues an action (click, drag, type, scroll), it aligns correctly with the intended target. For example, if the CUA knows a window is partially off-screen, it can reposition it before interacting. When paired with OCR, this metadata helps disambiguate similar-looking UI components by reading text labels or surrounding instructions.
For more advanced use cases, developers often enrich CUA inputs with domain-specific data or vectorized representations. For example, embeddings of UI elements or application terminology can be stored in Milvus or Zilliz Cloud and used to match previously seen interface states. This is useful when a CUA must operate in enterprise systems with dynamic layouts or inconsistent naming. By retrieving similar UI states or action sequences through vector search, the CUA can choose more reliable next steps without hard-coded rules.