Translating Visuals into Words: Image Captioning with AI

basic project conducted with the W5100S-EVB-PICO board, utilizing a replicated image captioning model to generate descriptions for images

COMPONENTS Hardware components

WIZnet - W5100S-EVB-Pico

x 2

Arducam - HM0360 Camera Module

x 1

Software Apps and online services

Adafruit - Circuitpython

x 1

micropython - MicroPython

x 1

PROJECT DESCRIPTION

Overview

In this project, we are using two W5100S-EVB-PICO boards.

1.The first board connects an Arducam and Ethernet to serve the role of transmitting a picture to a web page upon receiving a web request.

2.The second board will perform image-to-text captioning via the "Replicate API" in the form of a web address serving images from the first PICO board over an Ethernet connection and display them on the ssd1306 OLED screen.

Discuss this in more detail below.

Model used for image captioning

BLIP-2 is a part of Salesforce's LAVIS project. BLIP-2 is a generic and efficient pre-training strategy that leverages the advancements of pretrained vision models and large language models (LLMs). BLIP-2 surpasses Flamingo in zero-shot VQAv2 (scoring 65.0 vs 56.3) and sets a new state-of-the-art in zero-shot captioning (achieving a CIDEr score of 121.6 on NoCaps compared to the previous best of 113.2). When paired with powerful LLMs such as OPT and FlanT5, BLIP-2 unveils new zero-shot instructed vision-to-language generation capabilities for a range of intriguing applications.

Model testable on Replicate

Research paper on the model

Official GitHub

Installation and How to Use

in progress

Outputs

Potential Developments

The current focus of this project is on captioning an image and displaying the text. However, by adding speakers in the future, this could evolve into applications for the visually impaired or anomaly detection, among various other project ideas.

Contributions and Feedback

Contributions and feedback on the project are always welcome. Please use the GitHub issue tracker to report issues or create pull requests.

Original link

WIZnet maker Github

Documents

circuitpython - arducam
I modified the code a bit to get a more stable connection.
micropython-captioning
Code using the API for IMG2TXT captioning

Comments Write