NEWTrain a custom GPT Chatbot on YouTube videosTry Now

Hands-On Vision RAG: Images, Tables & Text

Summary

This video introduces the concept of multimodal retrieval systems, which process images, text, and tables in documents. It explains the challenges faced when documents contain images and tables, necessitating a vision language model for processing. The video showcases a new multimodal embedding approach by Cohair that directly generates embeddings, eliminating the need for patches and reducing memory requirements. Additionally, it discusses the benefits of utilizing vector stores for embedding generation and the efficiency of quantizing Large Language Model weights for storage optimization. Lastly, it provides guidance on integrating and testing models from Cohair and AI Gemini to build a vision-based retrieval system.

Chapters

Introduction to Multimodal Retrieval Systems
Text-Based Rack System Setup
New Embedding Approach
Vector Store Embedding
Quantization of LLM Weights
API Key Requirements
Image Embedding Process
Query Generation and Answering
Local Model Setup

Introduction to Multimodal Retrieval Systems

Introduction to the concept of multimodal retrieval systems that process images, text, and tables in documents both in a cloud-based and local setup.

Text-Based Rack System Setup

Overview of a traditional text-based rack system setup and the challenges when documents contain images and tables, requiring a vision language model for processing.

New Embedding Approach

Explanation of a new multimodal embedding approach by Cohair that generates embeddings directly, eliminating the need for patches and reducing memory requirements.

Vector Store Embedding

Utilizing vector store for embedding generation and the benefits of using different sizes of embedding vectors for storage optimization.

Quantization of LLM Weights

Discussion on quantization of Large Language Model (LLM) weights for storage efficiency and reduced memory requirements compared to traditional approaches.

API Key Requirements

Guidance on obtaining API keys from Cohair and AI Gemini for model integration and testing purposes in building a vision-based retrieval system.

Image Embedding Process

Detailed steps on embedding images into a document, resizing images, converting base64, and preparing images for retrieval purposes in the system.

Query Generation and Answering

Process of generating queries based on image retrieval and obtaining answers using vision language models with examples and explanations.

Local Model Setup

Setting up a local vision-based retrieval system using open-source models and providing flexibility to choose between local and cloud-based solutions.

Get your own AI Agent Today

Thousands of businesses worldwide are using Chaindesk Generative AI platform.
Don't get left behind - start building your own custom AI chatbot now!

Start For Free

Book a Demo