#92 enh: products descriptions clean-up

Открыто
5 месяцев назад открыта fszontagh · комментариев: 2
Szontágh Ferenc прокомментировал 5 месяцев назад

In the supabase edge functions where products are synchronized with qdrant, the decriptions contains html special chars, html tags.

We have to remove these formatting, only plain text allowed to store in qdrant.

Remove duplicated new lines. Replace html new lines
with one new line etc... we have to keep the paragraphs, but remove all text formatting to keep clear the descriptions and preserve space.

If you deploy edge funtion into supabase, verify if the deployed script is not empty after deployment.

In the supabase edge functions where products are synchronized with qdrant, the decriptions contains html special chars, html tags. We have to remove these formatting, only plain text allowed to store in qdrant. Remove duplicated new lines. Replace html new lines <br/> with one new line etc... we have to keep the paragraphs, but remove all text formatting to keep clear the descriptions and preserve space. If you deploy edge funtion into supabase, verify if the deployed script is not empty after deployment.
claude прокомментировал 5 месяцев назад
Соавтор

@fszontagh I'm starting work on this issue now!

I'll implement HTML cleanup for product descriptions in the Qdrant synchronization edge functions. My approach:

  1. Search for edge functions that sync products to Qdrant
  2. Create a utility function to clean HTML from descriptions (remove tags, special chars, normalize newlines)
  3. Apply the cleanup to product descriptions before storing in Qdrant
  4. Deploy and verify the edge functions are not empty after deployment

Starting analysis now...

@fszontagh I'm starting work on this issue now! I'll implement HTML cleanup for product descriptions in the Qdrant synchronization edge functions. My approach: 1. Search for edge functions that sync products to Qdrant 2. Create a utility function to clean HTML from descriptions (remove tags, special chars, normalize newlines) 3. Apply the cleanup to product descriptions before storing in Qdrant 4. Deploy and verify the edge functions are not empty after deployment Starting analysis now...
claude прокомментировал 5 месяцев назад
Соавтор

✅ Issue Completed

I've successfully implemented HTML cleanup for product descriptions in Qdrant synchronization.

Changes Made

  1. Created html-cleaner.ts utility (supabase/functions/_shared/html-cleaner.ts)

    • Comprehensive HTML tag removal
    • HTML entity decoding (&nbsp;, &amp;, &lt;, etc.)
    • Converts <br/>, <br>, </p>, </div> tags to newlines
    • Preserves paragraph structure
    • Removes duplicate newlines (max 2 consecutive for paragraphs)
    • Normalizes all whitespace
  2. Updated qdrant-client.ts

    • Integrated cleanHtmlContent() function into createProductText()
    • Applied to description, short_description, and meta_description fields
    • All product text stored in Qdrant is now plain text without HTML formatting
  3. Deployed Edge Functions

    • shopify-sync (v7)
    • woocommerce-sync (v15)
    • shoprenter-sync (v30)
    • shoprenter-scheduled-sync (v28)
    • mcp-shopify (v6)
    • mcp-woocommerce (v7)
    • mcp-shoprenter (v19)
  4. Verification

    • All deployed Edge Functions are active and non-empty
    • Functions successfully uploaded with new html-cleaner.ts dependency

Example Transformations

Before:

<p>Product with <strong>bold</strong> text&nbsp;&amp; entities</p><br/><p>Second paragraph</p>

After:

Product with bold text & entities

Second paragraph

Technical Details

The HTML cleaner:

  • Handles block-level tags (<p>, <div>, <h1>-<h6>, <ul>, <ol>, <li>)
  • Converts list items to bullet points ()
  • Decodes 40+ common HTML entities
  • Handles numeric entities (&#xxx; and &#xXXX;)
  • Normalizes line endings (Windows/Mac → Unix)
  • Trims leading/trailing whitespace

Committed: dec7b6b Pushed: ✅ origin/main

## ✅ Issue Completed I've successfully implemented HTML cleanup for product descriptions in Qdrant synchronization. ### Changes Made 1. **Created `html-cleaner.ts` utility** (`supabase/functions/_shared/html-cleaner.ts`) - Comprehensive HTML tag removal - HTML entity decoding (`&nbsp;`, `&amp;`, `&lt;`, etc.) - Converts `<br/>`, `<br>`, `</p>`, `</div>` tags to newlines - Preserves paragraph structure - Removes duplicate newlines (max 2 consecutive for paragraphs) - Normalizes all whitespace 2. **Updated `qdrant-client.ts`** - Integrated `cleanHtmlContent()` function into `createProductText()` - Applied to `description`, `short_description`, and `meta_description` fields - All product text stored in Qdrant is now plain text without HTML formatting 3. **Deployed Edge Functions** - ✅ `shopify-sync` (v7) - ✅ `woocommerce-sync` (v15) - ✅ `shoprenter-sync` (v30) - ✅ `shoprenter-scheduled-sync` (v28) - ✅ `mcp-shopify` (v6) - ✅ `mcp-woocommerce` (v7) - ✅ `mcp-shoprenter` (v19) 4. **Verification** - All deployed Edge Functions are active and non-empty - Functions successfully uploaded with new `html-cleaner.ts` dependency ### Example Transformations **Before:** ``` <p>Product with <strong>bold</strong> text&nbsp;&amp; entities</p><br/><p>Second paragraph</p> ``` **After:** ``` Product with bold text & entities Second paragraph ``` ### Technical Details The HTML cleaner: - Handles block-level tags (`<p>`, `<div>`, `<h1>-<h6>`, `<ul>`, `<ol>`, `<li>`) - Converts list items to bullet points (`• `) - Decodes 40+ common HTML entities - Handles numeric entities (`&#xxx;` and `&#xXXX;`) - Normalizes line endings (Windows/Mac → Unix) - Trims leading/trailing whitespace **Committed:** dec7b6b **Pushed:** ✅ origin/main
Войдите, чтобы присоединиться к обсуждению.
Нет этапа
Нет ответственного
2 участников
Загрузка...
Отмена
Сохранить
Пока нет содержимого.