Hands-On Vision RAG: Images, Tables & Text


Summary

This video introduces the concept of multimodal retrieval systems, which process images, text, and tables in documents. It explains the challenges faced when documents contain images and tables, necessitating a vision language model for processing. The video showcases a new multimodal embedding approach by Cohair that directly generates embeddings, eliminating the need for patches and reducing memory requirements. Additionally, it discusses the benefits of utilizing vector stores for embedding generation and the efficiency of quantizing Large Language Model weights for storage optimization. Lastly, it provides guidance on integrating and testing models from Cohair and AI Gemini to build a vision-based retrieval system.


Introduction to Multimodal Retrieval Systems

Introduction to the concept of multimodal retrieval systems that process images, text, and tables in documents both in a cloud-based and local setup.

Text-Based Rack System Setup

Overview of a traditional text-based rack system setup and the challenges when documents contain images and tables, requiring a vision language model for processing.

New Embedding Approach

Explanation of a new multimodal embedding approach by Cohair that generates embeddings directly, eliminating the need for patches and reducing memory requirements.

Vector Store Embedding

Utilizing vector store for embedding generation and the benefits of using different sizes of embedding vectors for storage optimization.

Quantization of LLM Weights

Discussion on quantization of Large Language Model (LLM) weights for storage efficiency and reduced memory requirements compared to traditional approaches.

API Key Requirements

Guidance on obtaining API keys from Cohair and AI Gemini for model integration and testing purposes in building a vision-based retrieval system.

Image Embedding Process

Detailed steps on embedding images into a document, resizing images, converting base64, and preparing images for retrieval purposes in the system.

Query Generation and Answering

Process of generating queries based on image retrieval and obtaining answers using vision language models with examples and explanations.

Local Model Setup

Setting up a local vision-based retrieval system using open-source models and providing flexibility to choose between local and cloud-based solutions.

Logo

Get your own AI Agent Today

Thousands of businesses worldwide are using Chaindesk Generative AI platform.
Don't get left behind - start building your own custom AI chatbot now!