building such a system would consume a large amount of data and computational resources. Besides, another challenge comes that what if we want to incorporate modalities beyond languages and images, like videos or voices? Would it be necessary to train a totally new multi-modality model every time when it comes to new modalities or functions?
문제제기: ChatGPT 같은 시스템이 이미지 이해 그리고 생성에 도움을 줄 수 없을지. 이런 시스템은 엄청난 컴퓨터 자원이랑 데이터가 필요함. 다른 문제는 새로운 multi-modality model에 학습이 필요한지
We answer the above questions by proposing a system named Visual ChatGPT. Instead of training a new multimodal ChatGPT from scratch, we build Visual ChatGPT directly based on ChatGPT and incorporate a variety of VFMs. To bridge the gap between ChatGPT and these VFMs, we propose a Prompt Manager which supports the following functions:
이 문제에 대해 Visual ChatGPT라는 시스템으로 제안한다. 새로운 multimodal을 처음부터 학습하는 대신에 ChatGPT랑 VFMs의 합친다. 이 VFM과 ChatGPT 차이를 잇기 위해 저자는 Prompt Manager를 제안한다.
- 특정 input-output format과 각 VFM의 기능을 ChatGPT한테 말한다.
Let S = {(Q1, A1),(Q2, A2), ...,(QN , AN )} be a dialogue system with N question-answer pairs. To get the response Ai from the i-th round of conversation, a series of VFMs and intermediate outputs A (j) i from those models are involved, where j denotes the output from the j-th VFM (F) in i-th round. More concretely, handling with Prompt Manager M, the format of A (j) i is constantly modified to meet the input format of each F. In the end, the system output A (j) i if it is denoted as the final response, and no more VFM is executed. Eq. (1) provides a formal definition of Visual ChatGPT:
S는 N번 question-answer 쌍이다. Prompt Manager M, A(j)의 format은 각 j번째 VFM(F)로부터 input format을 맞도록 계속 수정한다. 결국은 system 결과 A(j)_i 를 내놓는다. 공식 1을 보면 알 수 있다.
– System Principle P: System Principle provides basic rules for Visual ChatGPT,
– Visual Foundation Model F: One core of Visual ChatGPT is the combination of various VFMs
– History of Dialogue H<i: We define the dialogue history of i-th round of conversation as the string concatenation of previous question answer pairs,
– User query Qi: In visual ChatGPT, query is a general term, since it can include both linguistic and visual queries.
– History of Reasoning R (<j) i : To solve a complex question, Visual ChatGPT may require the collaboration of multiple VFMs.
– Intermediate Answer A(j) : When handling a complex query, Visual ChatGPT will try to obtain the final answer step-by-step by invoking different VFMs logically, thus producing multiple intermediate answers.
– Prompt Manager M: : A prompt manager is designed to convert all the visual signals into language so that ChatGPT model can understand.