InfoQ Homepage Computer Vision Content on InfoQ
-
Nexa AI Unveils Omnivision: a Compact Vision-Language Model for Edge AI
Nexa AI unveiled Omnivision, a compact vision-language model tailored for edge devices. By significantly reducing image tokens from 729 to 81, Omnivision lowers latency and computational requirements while maintaining strong performance in tasks like visual question answering and image captioning.
-
LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models
Chinese researchers fine-tuned Llama-3.2-11B to improve its ability to solve multimodal reasoning problems by going beyond the direct-response or chain-of-thought (coT) approaches to reason step by step in a structured way. Named LLava-CoT, the new model outperforms its base model and proves better than larger models, including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
-
OpenAI Developer Day 2024 (SF) Announces Real-Time API, Vision Fine-Tuning, and More
On October 1, 2024, OpenAI SF DevDay unveiled innovative features, including a Real-Time API enabling instant voice interactions and function calling. Enhanced model distillation and vision fine-tuning empower developers to customize AI for diverse applications. Upcoming events in London and Singapore will further expand these capabilities.
-
Apple Open-Sources Multimodal AI Model 4M-21
Researchers at Apple and the Swiss Federal Institute of Technology Lausanne (EPFL) have open-sourced 4M-21, a single any-to-any AI model that can handle 21 input and output modalities. 4M-21 performs well "out of the box" on several vision benchmarks and is available under the Apache 2.0 license.
-
Google Trains User Interface and Infographics Understanding AI Model ScreenAI
Google Research recently developed ScreenAI, a multimodal AI model for understanding infographics and user interfaces. ScreenAI is based on the PaLI architecture and achieves state-of-the-art performance on several tasks.
-
Nvidia Announces Robotics-Oriented AI Foundational Model
At its recent GTC 2024 event, Nvidia announced a new foundational model to build intelligent humanoid robots. Dubbed GR00T, short for Generalist Robot 00 Technology, the model will understand natural language and be able to observe human actions and emulate human movements.
-
Apple Researchers Detail Method to Combine Different LLMs to Achieve State-of-the-Art Performance
Many large language models (LLMs) have become available recently, both closed and open source further leading to the creation of combined models known as Multimodal LLMs (MLLMs). Yet, few or none of them unveil what design choices were made to create them, say Apple researchers who distilled principles and lessons to design state-of-the-art (SOTA) Multimodal LLMs.
-
Vesuvius Challenge Winners Use AI to Read Ancient Scroll
The Vesuvius Challenge recently announced the winners of their 2023 Grand Prize. The winning team used an ensemble of AI models to read text from a scroll of papyrus that was buried in volcanic ash nearly 2,000 years ago.
-
NVIDIA Introduces Metropolis Microservices for Jetson to Run AI Apps at the Edge
NVIDIA has expanded its Nvidia Metropolis Microservices Cloud-based AI solution to run on the NVIDIA Jetson IoT embedded platform, including support for video streaming and AI-based perception.
-
Meta Announces Generative AI Models Emu Video and Emu Edit
Meta AI Research announced two new generative AI models: Emu Video, which can generate short videos given a text prompt, and Emu Edit, which can edit images given text-based instructions. Both models are based on Meta's Emu foundation model and exhibit state-of-the-art performance on several benchmarks.
-
Combating AI-Generated Fake Images with JavaScript Libraries, by Kate Sills at QCon San Francisco
At the recent QCon San Francisco conference Kate Sills gave a talk about combating AI-generated fake images using existing JavaScript libraries. She advocated for using cryptographic timestamping to ensure the time photos were taken, and using digital signatures to verify that the image was made by a legitimate source.
-
Berkeley Open-Sources AI Image-Editing Model InstructPix2Pix
Researchers from the Berkeley Artificial Intelligence Research (BAIR) Lab have open-sourced InstructPix2Pix, a deep-learning model that follows human instructions to edit images. InstructPix2Pix was trained on synthetic data and outperforms a baseline AI image-editing model.
-
Apple Extends Core ML, Create ML, and Vision Frameworks for iOS 17
At its recent WWDC 2023 developer conference, Apple presented a number of extensions and updates to its machine learning and vision ecosystem, including updates to its Core ML framework, new features for the Create ML modeling tool, and new vision APIs for image segmentation, animal body pose detection, and 3D human body pose.
-
Voxel51 Open-Sources Computer Vision Dataset Assistant VoxelGPT - Q&A with Jason Corso
Voxel51 recently open-sourced VoxelGPT, an AI assistant that interfaces with GPT-3.5 to produce Python code for querying computer vision datasets. InfoQ spoke with Jason Corso, co-founder and CSO of Voxel51, who shared their lessons and insights gained while developing VoxelGPT.
-
Meta Open-Sources Computer Vision Foundation Model DINOv2
Meta AI Research open-sourced DINOv2, a foundation model for computer vision (CV) tasks. DINOv2 is pretrained on a curated dataset of 142M images and can be used as a backbone for several tasks, including image classification, video action recognition, semantic segmentation, and depth estimation.