InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage Computer Vision Content on InfoQ

News

RSS Feed

Newer Older

AI, ML & Data Engineering

Google Releases PaliGemma 2 Vision-Language Model Family

Google DeepMind released PaliGemma 2, a family of vision-language models (VLM). PaliGemma 2 is available in three different sizes and three input image resolutions and achieves state-of-the-art performance on several vision-language benchmarks.

Anthony Alford
on Jan 14, 2025
AI, ML & Data Engineering

Nexa AI Unveils Omnivision: a Compact Vision-Language Model for Edge AI

Nexa AI unveiled Omnivision, a compact vision-language model tailored for edge devices. By significantly reducing image tokens from 729 to 81, Omnivision lowers latency and computational requirements while maintaining strong performance in tasks like visual question answering and image captioning.

Robert Krzaczyński
on Dec 03, 2024
AI, ML & Data Engineering

LLaVA-CoT Shows How to Achieve Structured, Autonomous Reasoning in Vision Language Models

Chinese researchers fine-tuned Llama-3.2-11B to improve its ability to solve multimodal reasoning problems by going beyond the direct-response or chain-of-thought (coT) approaches to reason step by step in a structured way. Named LLava-CoT, the new model outperforms its base model and proves better than larger models, including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

Sergio De Simone
on Nov 24, 2024
AI, ML & Data Engineering

OpenAI Developer Day 2024 (SF) Announces Real-Time API, Vision Fine-Tuning, and More

On October 1, 2024, OpenAI SF DevDay unveiled innovative features, including a Real-Time API enabling instant voice interactions and function calling. Enhanced model distillation and vision fine-tuning empower developers to customize AI for diverse applications. Upcoming events in London and Singapore will further expand these capabilities.

Andrew Hoblitzell
on Oct 10, 2024
AI, ML & Data Engineering

Apple Open-Sources Multimodal AI Model 4M-21

Researchers at Apple and the Swiss Federal Institute of Technology Lausanne (EPFL) have open-sourced 4M-21, a single any-to-any AI model that can handle 21 input and output modalities. 4M-21 performs well "out of the box" on several vision benchmarks and is available under the Apache 2.0 license.

Anthony Alford
on Sep 17, 2024
AI, ML & Data Engineering

Google Trains User Interface and Infographics Understanding AI Model ScreenAI

Google Research recently developed ScreenAI, a multimodal AI model for understanding infographics and user interfaces. ScreenAI is based on the PaLI architecture and achieves state-of-the-art performance on several tasks.

Anthony Alford
on Apr 16, 2024
AI, ML & Data Engineering

Nvidia Announces Robotics-Oriented AI Foundational Model

At its recent GTC 2024 event, Nvidia announced a new foundational model to build intelligent humanoid robots. Dubbed GR00T, short for Generalist Robot 00 Technology, the model will understand natural language and be able to observe human actions and emulate human movements.

Sergio De Simone
on Apr 05, 2024
AI, ML & Data Engineering

Apple Researchers Detail Method to Combine Different LLMs to Achieve State-of-the-Art Performance

Many large language models (LLMs) have become available recently, both closed and open source further leading to the creation of combined models known as Multimodal LLMs (MLLMs). Yet, few or none of them unveil what design choices were made to create them, say Apple researchers who distilled principles and lessons to design state-of-the-art (SOTA) Multimodal LLMs.

Sergio De Simone
on Mar 29, 2024
AI, ML & Data Engineering

Vesuvius Challenge Winners Use AI to Read Ancient Scroll

The Vesuvius Challenge recently announced the winners of their 2023 Grand Prize. The winning team used an ensemble of AI models to read text from a scroll of papyrus that was buried in volcanic ash nearly 2,000 years ago.

Anthony Alford
on Mar 19, 2024
AI, ML & Data Engineering

NVIDIA Introduces Metropolis Microservices for Jetson to Run AI Apps at the Edge

NVIDIA has expanded its Nvidia Metropolis Microservices Cloud-based AI solution to run on the NVIDIA Jetson IoT embedded platform, including support for video streaming and AI-based perception.

Sergio De Simone
on Feb 08, 2024
AI, ML & Data Engineering

Meta Announces Generative AI Models Emu Video and Emu Edit

Meta AI Research announced two new generative AI models: Emu Video, which can generate short videos given a text prompt, and Emu Edit, which can edit images given text-based instructions. Both models are based on Meta's Emu foundation model and exhibit state-of-the-art performance on several benchmarks.

Anthony Alford
on Nov 28, 2023
Development

Combating AI-Generated Fake Images with JavaScript Libraries, by Kate Sills at QCon San Francisco

At the recent QCon San Francisco conference Kate Sills gave a talk about combating AI-generated fake images using existing JavaScript libraries. She advocated for using cryptographic timestamping to ensure the time photos were taken, and using digital signatures to verify that the image was made by a legitimate source.

Roland Meertens
on Oct 06, 2023
AI, ML & Data Engineering

Berkeley Open-Sources AI Image-Editing Model InstructPix2Pix

Researchers from the Berkeley Artificial Intelligence Research (BAIR) Lab have open-sourced InstructPix2Pix, a deep-learning model that follows human instructions to edit images. InstructPix2Pix was trained on synthetic data and outperforms a baseline AI image-editing model.

Anthony Alford
on Jul 18, 2023
Mobile

Apple Extends Core ML, Create ML, and Vision Frameworks for iOS 17

At its recent WWDC 2023 developer conference, Apple presented a number of extensions and updates to its machine learning and vision ecosystem, including updates to its Core ML framework, new features for the Create ML modeling tool, and new vision APIs for image segmentation, animal body pose detection, and 3D human body pose.

Sergio De Simone
on Jul 03, 2023
AI, ML & Data Engineering

Voxel51 Open-Sources Computer Vision Dataset Assistant VoxelGPT - Q&A with Jason Corso

Voxel51 recently open-sourced VoxelGPT, an AI assistant that interfaces with GPT-3.5 to produce Python code for querying computer vision datasets. InfoQ spoke with Jason Corso, co-founder and CSO of Voxel51, who shared their lessons and insights gained while developing VoxelGPT.

Anthony Alford
on Jun 20, 2023

Newer News

Older News

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

News