#93 question: qdrant store

Open
opened 5 months ago by fszontagh · 7 comments

Question: we create vectors using embedding to store products into qdrant.

Wen get the products using the webshop's api, then create vectors then store vectors with payload into the qdrant server. which json keys are converted to vectors from the webshop api? Only the description?

Question: we create vectors using embedding to store products into qdrant. Wen get the products using the webshop's api, then create vectors then store vectors with payload into the qdrant server. which json keys are converted to vectors from the webshop api? Only the description?
claude commented 5 months ago
Collaborator

@fszontagh I'm starting work on this question now! I'll search through the codebase to find where product vectors are created and stored in Qdrant, and identify which JSON fields from the webshop API are being converted to vectors.

@fszontagh I'm starting work on this question now! I'll search through the codebase to find where product vectors are created and stored in Qdrant, and identify which JSON fields from the webshop API are being converted to vectors.
claude commented 5 months ago
Collaborator

@fszontagh Great question! Here's the detailed answer about which JSON keys are converted to vectors:

What gets converted to vectors?

The vector embeddings are NOT created from just the description. The system creates a comprehensive text representation that includes multiple fields from the product data. This text is then converted into a vector using OpenAI's text-embedding-3-large model (3072 dimensions).

Product Fields Included in Vector Embeddings

Based on the createProductText() function in supabase/functions/_shared/qdrant-client.ts (lines 511-621), here are ALL the JSON keys that get converted into the text representation before vectorization:

Core Product Information

  1. Title/Name - product.title or product.name
  2. Description - product.description (HTML cleaned, entities decoded)
  3. Short Description - product.short_description (WooCommerce, only if different from main description)
  4. SKU - product.sku (formatted as "SKU: {value}")

Categorization & Classification

  1. Vendor/Brand - product.vendor (formatted as "Vendor: {value}")
  2. Product Type - product.product_type (formatted as "Type: {value}")
  3. Categories - product.categories array (formatted as "Categories: cat1, cat2, ...")
  4. Tags - product.tags array (formatted as "Tags: tag1, tag2, ...")

Variants & Attributes

  1. Variants - product.variants array including:

    • variant.title (if not "Default Title")
    • variant.option1, variant.option2, variant.option3
    • Formatted as "Variants: option1, option2, ..."
  2. Attributes - product.attributes array (WooCommerce):

    • attribute.name and attribute.options
    • Formatted as "Name: options | Name2: options2"

SEO & Metadata

  1. Meta Description - product.meta_description (ShopRenter/SEO, only if different from main description)

Context Information

  1. Price - product.price (formatted as "Price: {value}")
  2. Stock Status - product.stock_status or product.status (formatted as "Stock: {status}" or "Status: {status}")

How It Works

The system:

  1. Collects all these fields from the webshop API response
  2. Cleans HTML content using cleanHtmlContent() (handles HTML entities, removes tags, preserves newlines)
  3. Joins all parts with | separator to create a comprehensive text representation
  4. Generates embedding using OpenRouter API with openai/text-embedding-3-large model
  5. Stores the vector (3072-dimensional) in Qdrant along with the original payload data

Example Text Representation

For a typical product, the text might look like:

Blue Running Shoes | Experience ultimate comfort with our premium running shoes... | SKU: SHOE-BLUE-001 | Vendor: Nike | Type: Footwear | Categories: Shoes, Sports, Running | Tags: athletic, comfortable, breathable | Variants: Size 8, Size 9, Size 10, Color Blue, Color Red | Price: 89.99 | Stock: in stock

This comprehensive approach ensures that semantic search can find products based on any of these attributes, not just the description!

Code Reference

You can see the full implementation in:

  • Text creation: supabase/functions/_shared/qdrant-client.ts - createProductText() function (lines 511-621)
  • Usage example: supabase/functions/woocommerce-sync/index.ts - where it creates product texts before generating embeddings

The same pattern applies to orders (createOrderText()) and customers (createCustomerText()), which also include comprehensive information beyond just basic fields.

@fszontagh Great question! Here's the detailed answer about which JSON keys are converted to vectors: ## What gets converted to vectors? The vector embeddings are **NOT** created from just the description. The system creates a **comprehensive text representation** that includes multiple fields from the product data. This text is then converted into a vector using OpenAI's `text-embedding-3-large` model (3072 dimensions). ## Product Fields Included in Vector Embeddings Based on the `createProductText()` function in `supabase/functions/_shared/qdrant-client.ts` (lines 511-621), here are **ALL** the JSON keys that get converted into the text representation before vectorization: ### Core Product Information 1. **Title/Name** - `product.title` or `product.name` 2. **Description** - `product.description` (HTML cleaned, entities decoded) 3. **Short Description** - `product.short_description` (WooCommerce, only if different from main description) 4. **SKU** - `product.sku` (formatted as "SKU: {value}") ### Categorization & Classification 5. **Vendor/Brand** - `product.vendor` (formatted as "Vendor: {value}") 6. **Product Type** - `product.product_type` (formatted as "Type: {value}") 7. **Categories** - `product.categories` array (formatted as "Categories: cat1, cat2, ...") 8. **Tags** - `product.tags` array (formatted as "Tags: tag1, tag2, ...") ### Variants & Attributes 9. **Variants** - `product.variants` array including: - `variant.title` (if not "Default Title") - `variant.option1`, `variant.option2`, `variant.option3` - Formatted as "Variants: option1, option2, ..." 10. **Attributes** - `product.attributes` array (WooCommerce): - `attribute.name` and `attribute.options` - Formatted as "Name: options | Name2: options2" ### SEO & Metadata 11. **Meta Description** - `product.meta_description` (ShopRenter/SEO, only if different from main description) ### Context Information 12. **Price** - `product.price` (formatted as "Price: {value}") 13. **Stock Status** - `product.stock_status` or `product.status` (formatted as "Stock: {status}" or "Status: {status}") ## How It Works The system: 1. **Collects all these fields** from the webshop API response 2. **Cleans HTML content** using `cleanHtmlContent()` (handles HTML entities, removes tags, preserves newlines) 3. **Joins all parts** with ` | ` separator to create a comprehensive text representation 4. **Generates embedding** using OpenRouter API with `openai/text-embedding-3-large` model 5. **Stores the vector** (3072-dimensional) in Qdrant along with the original payload data ## Example Text Representation For a typical product, the text might look like: ``` Blue Running Shoes | Experience ultimate comfort with our premium running shoes... | SKU: SHOE-BLUE-001 | Vendor: Nike | Type: Footwear | Categories: Shoes, Sports, Running | Tags: athletic, comfortable, breathable | Variants: Size 8, Size 9, Size 10, Color Blue, Color Red | Price: 89.99 | Stock: in stock ``` This comprehensive approach ensures that semantic search can find products based on **any** of these attributes, not just the description! ## Code Reference You can see the full implementation in: - **Text creation**: `supabase/functions/_shared/qdrant-client.ts` - `createProductText()` function (lines 511-621) - **Usage example**: `supabase/functions/woocommerce-sync/index.ts` - where it creates product texts before generating embeddings The same pattern applies to orders (`createOrderText()`) and customers (`createCustomerText()`), which also include comprehensive information beyond just basic fields.
Szontágh Ferenc commented 5 months ago
Owner

we have the shoprenter mcp server for example. When a llm searching for "popsítörlő", then no results found. Popsitörlő exists in the qdrant.

This is the json representation of the tool call parameters:

{
  "shop_id": "d164a10f-580c-4500-9850-fed3079dd6af",
  "name": "popsitörlő",
  "limit": 5
}

This is the resosnse:

[
  {
    "type": "text",
    "text": "{\"count\":0,\"limit\":5,\"source\":\"qdrant\",\"products\":[]}"
  }
]

In qdrant this product exists. name: Soft clean 4x55 db popsitörlő, but for testing purposes the description is just lorem ipsum. So in theory the mcp have to find the product by the product name.

we have the shoprenter mcp server for example. When a llm searching for "popsítörlő", then no results found. Popsitörlő exists in the qdrant. This is the json representation of the tool call parameters: ``` { "shop_id": "d164a10f-580c-4500-9850-fed3079dd6af", "name": "popsitörlő", "limit": 5 } ``` This is the resosnse: ``` [ { "type": "text", "text": "{\"count\":0,\"limit\":5,\"source\":\"qdrant\",\"products\":[]}" } ] ``` In qdrant this product exists. `name: Soft clean 4x55 db popsitörlő`, but for testing purposes the description is just lorem ipsum. So in theory the mcp have to find the product by the product name.
claude commented 5 months ago
Collaborator

@fszontagh I'm investigating the search issue now! Let me check the MCP server implementation to understand why searching for "popsítörlő" isn't finding the product "Soft clean 4x55 db popsitörlő" even though it exists in Qdrant. I'll look at:

  1. The ShopRenter MCP server search implementation
  2. How the search query is being processed
  3. The Qdrant search/query mechanism
  4. Any encoding or normalization issues with Hungarian characters
@fszontagh I'm investigating the search issue now! Let me check the MCP server implementation to understand why searching for "popsítörlő" isn't finding the product "Soft clean 4x55 db popsitörlő" even though it exists in Qdrant. I'll look at: 1. The ShopRenter MCP server search implementation 2. How the search query is being processed 3. The Qdrant search/query mechanism 4. Any encoding or normalization issues with Hungarian characters
claude commented 5 months ago
Collaborator

@fszontagh I've found the issue! Here's what's happening:

Root Cause

When searching for products by name (like "popsítörlő"), the MCP server is using a hybrid approach that has a significant limitation:

How Product Search Currently Works

Looking at supabase/functions/_shared/mcp-qdrant-helpers.ts (lines 80-143):

  1. Step 1: Call scrollPoints() to retrieve products from Qdrant

    • Only filters by indexed fields: store_id, sku, status, price
    • The name field is NOT indexed in Qdrant (see qdrant-client.ts lines around collection creation)
  2. Step 2: Apply client-side filtering for name:

    if (filters?.name) {
     const nameLower = filters.name.toLowerCase();
     products = products.filter((p: any) =>
       (p.name || p.title || '').toLowerCase().includes(nameLower)
     );
    }
    
  3. Step 3: Limit results to the requested limit (default: 10, max: 20)

The Problem

The scrollPoints() function in line 123 is called with a limit parameter that defaults to the same limit as the final result (in your case, 5 products). This means:

const result = await scrollPoints(collectionName, qdrantFilter, limit);

So when you search for "popsítörlő" with limit: 5, the code:

  1. Retrieves only 5 products from Qdrant (matching store_id filter)
  2. Then tries to find "popsítörlő" within those 5 products
  3. If "popsítörlő" is not in those first 5 products, it returns 0 results

This is why you're getting:

{"count":0,"limit":5,"source":"qdrant","products":[]}

Why This Happens

The product name is stored in the payload of each Qdrant point, but it's NOT an indexed field. Only these fields are indexed:

  • store_id (keyword)
  • product_id (keyword)
  • platform (keyword)
  • status (keyword)
  • price (float)
  • sku (keyword)

The product name is embedded in the vector (used for semantic search), but not searchable as a keyword filter.

Solutions

There are several ways to fix this:

Option 1: Use Semantic Vector Search (Recommended)

Instead of using scrollPoints + client-side filtering, use semantic vector search with searchPoints():

  1. Generate an embedding for the search query "popsítörlő"
  2. Use searchPoints() to find semantically similar products
  3. This should find "Soft clean 4x55 db popsitörlő" because the name is in the embedded text

Pros: More accurate, finds variations and related products, uses Qdrant's full power Cons: Requires embedding generation (API call), slightly slower

Option 2: Increase Scroll Limit

When filtering by name, retrieve many more products before filtering:

// Instead of:
const result = await scrollPoints(collectionName, qdrantFilter, limit);

// Do:
const scrollLimit = filters?.name ? 1000 : limit;  // Get more if filtering by name
const result = await scrollPoints(collectionName, qdrantFilter, scrollLimit);

Pros: Simple fix, no API calls needed Cons: Inefficient, loads many products into memory, won't scale well

Option 3: Index the Name Field

Add name as an indexed field in Qdrant:

await createCollection(productsCollection, [
  { field: 'store_id', type: 'keyword' },
  { field: 'product_id', type: 'keyword' },
  { field: 'platform', type: 'keyword' },
  { field: 'status', type: 'keyword' },
  { field: 'price', type: 'float' },
  { field: 'sku', type: 'keyword' },
  { field: 'name', type: 'text' },  // <-- Add this
]);

Then use Qdrant's text matching:

if (filters?.name) {
  qdrantFilter.must.push({ 
    key: 'name', 
    match: { text: filters.name }  // Full-text search
  });
}

Pros: Efficient, scalable, uses Qdrant's indexing Cons: Requires re-creating collections (data migration), may not handle Hungarian characters well

My Recommendation

For the best user experience, I recommend Option 1: Semantic Vector Search.

When a user searches by name, the system should:

  1. Generate an embedding for the search query
  2. Use vector similarity search to find matching products
  3. This will find "popsitörlő" even with typos, plural forms, or related terms

This is actually the intended use case for Qdrant - semantic search is more powerful than keyword matching!

Would you like me to implement Option 1 (semantic search) or Option 2 (quick fix with increased limit)?

@fszontagh I've found the issue! Here's what's happening: ## Root Cause When searching for products by **name** (like "popsítörlő"), the MCP server is using a **hybrid approach** that has a significant limitation: ### How Product Search Currently Works Looking at `supabase/functions/_shared/mcp-qdrant-helpers.ts` (lines 80-143): 1. **Step 1**: Call `scrollPoints()` to retrieve products from Qdrant - Only filters by indexed fields: `store_id`, `sku`, `status`, `price` - **The `name` field is NOT indexed** in Qdrant (see `qdrant-client.ts` lines around collection creation) 2. **Step 2**: Apply client-side filtering for `name`: ```typescript if (filters?.name) { const nameLower = filters.name.toLowerCase(); products = products.filter((p: any) => (p.name || p.title || '').toLowerCase().includes(nameLower) ); } ``` 3. **Step 3**: Limit results to the requested limit (default: 10, max: 20) ### The Problem The `scrollPoints()` function in line 123 is called with a **limit parameter** that defaults to **the same limit as the final result** (in your case, 5 products). This means: ```typescript const result = await scrollPoints(collectionName, qdrantFilter, limit); ``` So when you search for "popsítörlő" with `limit: 5`, the code: 1. Retrieves only **5 products** from Qdrant (matching store_id filter) 2. Then tries to find "popsítörlő" within those 5 products 3. If "popsítörlő" is not in those first 5 products, **it returns 0 results** This is why you're getting: ```json {"count":0,"limit":5,"source":"qdrant","products":[]} ``` ## Why This Happens The product name is stored in the **payload** of each Qdrant point, but it's NOT an **indexed field**. Only these fields are indexed: - `store_id` (keyword) - `product_id` (keyword) - `platform` (keyword) - `status` (keyword) - `price` (float) - `sku` (keyword) The product name is **embedded in the vector** (used for semantic search), but not searchable as a keyword filter. ## Solutions There are several ways to fix this: ### Option 1: Use Semantic Vector Search (Recommended) Instead of using `scrollPoints` + client-side filtering, use **semantic vector search** with `searchPoints()`: 1. Generate an embedding for the search query "popsítörlő" 2. Use `searchPoints()` to find semantically similar products 3. This should find "Soft clean 4x55 db popsitörlő" because the name is in the embedded text **Pros**: More accurate, finds variations and related products, uses Qdrant's full power **Cons**: Requires embedding generation (API call), slightly slower ### Option 2: Increase Scroll Limit When filtering by name, retrieve **many more products** before filtering: ```typescript // Instead of: const result = await scrollPoints(collectionName, qdrantFilter, limit); // Do: const scrollLimit = filters?.name ? 1000 : limit; // Get more if filtering by name const result = await scrollPoints(collectionName, qdrantFilter, scrollLimit); ``` **Pros**: Simple fix, no API calls needed **Cons**: Inefficient, loads many products into memory, won't scale well ### Option 3: Index the Name Field Add `name` as an indexed field in Qdrant: ```typescript await createCollection(productsCollection, [ { field: 'store_id', type: 'keyword' }, { field: 'product_id', type: 'keyword' }, { field: 'platform', type: 'keyword' }, { field: 'status', type: 'keyword' }, { field: 'price', type: 'float' }, { field: 'sku', type: 'keyword' }, { field: 'name', type: 'text' }, // <-- Add this ]); ``` Then use Qdrant's text matching: ```typescript if (filters?.name) { qdrantFilter.must.push({ key: 'name', match: { text: filters.name } // Full-text search }); } ``` **Pros**: Efficient, scalable, uses Qdrant's indexing **Cons**: Requires re-creating collections (data migration), may not handle Hungarian characters well ## My Recommendation For the **best user experience**, I recommend **Option 1: Semantic Vector Search**. When a user searches by name, the system should: 1. Generate an embedding for the search query 2. Use vector similarity search to find matching products 3. This will find "popsitörlő" even with typos, plural forms, or related terms This is actually the **intended use case** for Qdrant - semantic search is more powerful than keyword matching! Would you like me to implement Option 1 (semantic search) or Option 2 (quick fix with increased limit)?
Szontágh Ferenc commented 5 months ago
Owner

Option 1, but create a new issue for it, assig to claude user

Option 1, but create a new issue for it, assig to claude user
claude commented 5 months ago
Collaborator

@fszontagh Done! I've created issue #94 with the following details:

Title: feat: Implement semantic vector search for product name queries in MCP server

Assignment: Assigned to claude user ✅

Label: enhancement ✅

Summary: The new issue contains:

  • Complete problem description (why "popsítörlő" search fails)
  • Detailed solution approach (semantic vector search)
  • Step-by-step implementation guide
  • List of benefits
  • Files to modify
  • Acceptance criteria for testing

The issue is ready for implementation and will solve the search problem you encountered!

You can view it here: Issue #94

@fszontagh Done! I've created issue #94 with the following details: **Title**: feat: Implement semantic vector search for product name queries in MCP server **Assignment**: Assigned to claude user ✅ **Label**: enhancement ✅ **Summary**: The new issue contains: - Complete problem description (why "popsítörlő" search fails) - Detailed solution approach (semantic vector search) - Step-by-step implementation guide - List of benefits - Files to modify - Acceptance criteria for testing The issue is ready for implementation and will solve the search problem you encountered! You can view it here: Issue #94
Sign in to join this conversation.
No Milestone
No assignee
2 Participants
Loading...
Cancel
Save
There is no content yet.