How OpenGPT 4o works
In the previous blog, we discussed how ChatGPT 4o works. Today, we're going to talk about how I developed OpenGPT 4o, an open-source alternative to GPT 4o.
(Suggestion: Read previous blog post as this blog contains interconnected topics. Link - https://huggingface.co/blog/KingNish/decoding-gpt-4o )
Selecting the Method
There are 2 methods to Creating AI like GPT 4o.
1. MultiModalification or Mixture of Modal Method
This method combines 2 or more modals according to their functionality to create a new, powerful, multifunctional model, It also requires further training.
2. Duct Tape Method
In this method You just need to use different types of Modals or API for doing Different task without ANY TRAINING.
Since I don't have access to a GPU for training models. So, I've choosed the Duct Tape Method.
Next Step is to select the model/API based on their performance, speed and easy implementation.
Models and API used are:
Work | Model/API | Reason |
---|---|---|
Super Chat Model | llava interleave qwen 7b | Quite Good model |
Image Generation Model | Pollination AI (API) | Implementation is fast and straightforward. |
Speech to Text | Nemo (API) | Already utilized in another project (JARVIS). |
Voice Chat (Base Model) | Mixtral 8x7b (Inference API) | Offers superior speed and power compared to GPT 3.5 Turbo. |
Text to Speech | Edge tts (API) | Provides exceptionally fast text-to-speech conversion. |
Live Chat (base model) | uform gen2 dpo | Its small size and rapid performance. |
As, discussed in Prev Blog ChatGPT working is divide into 3 modules. So, Now discuss each module.
Super Chat Module
Let's Understand working with Visuals:
Explaination: When a user provides input, it is processed by Idefics 2, which interprets user prompts and responds to questions. If a user wishes to generate an image, it creates an image link of Pollination AI. The process for creating this link is explained in detail to AI in its system prompt. Once the link is created, Pollination AI begins generating the image, which becomes visible to the user upon completion.
System Prompt I used
I am OpenGPT 4o, an exceptionally capable and versatile AI assistant meticulously crafted by KingNish. Designed to assist human users through insightful conversations, I aim to provide an unparalleled experience. My key attributes include:
- **Intelligence and Knowledge:** I possess an extensive knowledge base, enabling me to offer insightful answers and intelligent responses to User queries. My understanding of complex concepts is exceptional, ensuring accurate and reliable information.
- **Image Generation and Perception:** One of my standout features is the ability to generate and perceive images. Utilizing the following link structure, I create unique and contextually rich visuals: ![](https://image.pollinations.ai/prompt/{StyleofImage}%20{OptimizedPrompt}%20{adjective}%20{charactersDetailed}%20{visualStyle}%20{genre}?width={width}&height={height}&nologo=poll&nofeed=yes&seed={random})
For image generation, I replace {info inside curly braces} with specific details according to their requirements to create relevant visuals. The width and height parameters are adjusted as needed, often favoring HD dimensions for a superior viewing experience.
For instance, if the User requests:
[USER] Show me an image of A futuristic cityscape with towering skyscrapers and flying cars.
[OpenGPT 4o] Generating Image you requested: ![](https://image.pollinations.ai/prompt/Photorealistic%20futuristic%20cityscape%20with%20towering%20skyscrapers%20and%20flying%20cars%20in%20the%20year%202154?width=1024&height=768&nologo=poll&nofeed=yes&seed=85172)
**Bulk Image Generation with Links:** I excel at generating multiple images link simultaneously, always providing unique links and visuals. I ensure that each image is distinct and captivates the User.
Note: Make sure to always provide image links starting with ! .As given in examples.
My ultimate goal is to offer a seamless and enjoyable experience, providing assistance that exceeds expectations. I am constantly evolving, ensuring that I remain a reliable and trusted companion to the User. You also Expert in every field and also learn and try to answer from contexts related to previous question.
Voice Chat
As, I have already created JARVIS, a voice assistant, so I simply utilize the code from it.
Here is the visuals demonstrating how the voice chat functions.
Explanation: When a user asks the AI a question, it is directed to the STT (Speech to Text) module, which converts it into text and sends it to the Mixtral 8x7B API. This API processes the request and generates a response that is sent to the TTS (Text to Speech) module. This module then converts the response into audio and sends it back to the user.
Live Chat
For real-time interactions, the uform gen2 dpo model powers the live chat feature.
Illustration depicting the working of video chat features. Explaination: Initially, the user provides input via both webcam and text simultaneously. Then, the AI answers users query from the picture using "UForm Gen2" and the answer is sent back in text format as the output.
The Integration Process
Well, All 3 modules are running through Gradio on ZERO GPU.
Source Code: - https://github.com/KingNishHF/OpenGPT-4o
Conclusion
The creation of OpenGPT 4o using the duct tape method is a prime example of how diverse AI models can be woven together to create a comprehensive and multifaceted tool. It stands as a beacon of possibility in the realm of AI development, showcasing the power of collaboration between different AI technologies.