BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News OpenAI Announces ChatGPT Voice and Image Features

OpenAI Announces ChatGPT Voice and Image Features

OpenAI recently announced new voice and image features for ChatGPT. A new backend model, GPT-4V, will handle image inputs, and an updated DALL-E model will be integrated to generate images. In addition, users of the mobile ChatGPT app will be able to hold voice conversations with the chatbot.

OpenAI announced that the newest version of their image generation AI, DALL-E 3, was in "research preview" and would be available for users of ChatGPT Plus and Enterprise in the coming month. Its integration with ChatGPT means that users can more easily create prompts with help from the chatbot. The ability to understand image input is supported by a multimodal version of the underlying GPT model called GPT-4 Vision (GPT-4V). The voice feature uses OpenAI's Whisper automatic speech recognition (ASR) model to handle user voice input, and a new text-to-speech (TTS) model will convert ChatGPT's text output into the user's choice of five available voices. OpenAI is deploying the new features gradually, citing safety concerns, and has conducted beta testing and "red teaming" to explore and mitigate risks. According to OpenAI:

Large multimodal models introduce different limitations and expand the risk surface compared to text-based language models. GPT-4V possesses the limitations and capabilities of each modality (text and vision), while at the same time presenting novel capabilities emerging from the intersection of said modalities and from the intelligence and reasoning afforded by large scale models.

OpenAI published a paper describing their testing efforts with GPT-4V. They used the model in a tool called Be My AI, which aids vision-impaired people by describing the contents of images. OpenAI ran a pilot program with 200 beta testers from March until August 2023, then in September 2023 expanded it to 16,000 users. They also ran a developer alpha program, where more than 1,000 devs had access to the model over three months; the goal was to "gain additional feedback and insight into the real ways people interact with GPT-4V."

The paper summarizes OpenAI's evaluation of the model's behavior in several areas, such as refusing to generate harmful content, refusing to identify people in images, ability to break CAPTCHAs, and refusal of image-based "jailbreaks." OpenAI also engaged "red teams" to test the model's abilities in scientific domains, such as understanding images in publications; and its ability to provide medical advice given medical images such as CT scans. The paper specifically notes that "we do not consider the current version of GPT-4V to be fit for performing any medical function."

Several users discussed the new features in a thread on Hacker News. One user pointed out some limitations of the voice feature:

Voice has the potential to be awesome. This demo is really underwhelming to me because of the multi-second latency between the query and response, just like every other lame voice assistant. It doesn't have to be this way! [Determining] when the user is done talking is tough. What's needed is a speech conversation turn-taking dataset and model; that's missing from off the shelf speech recognition systems.

Several of OpenAI's partners have been releasing products that use the new features. Spotify recently announced Voice Translation for some of their podcasts, which uses "OpenAI’s newly released voice generation technology" to generate a translation that mimics the original speaker. Microsoft's CEO of Advertising and Web Services, Mikhail Parakhin, announced on X (formerly Twitter) that DALL-E 3 was being rolled out to Bing's image generation tool. OpenAI also announced on X that it would be making ChatGPT's "Browse with Bing" feature generally available soon. This feature gives the bot access to information that was published on the web after the model was trained.

About the Author

Rate this Article

Adoption
Style

BT