#107 bug: qdrant sync payload cleanup

Otwarty
otworzone 4 miesięcy temu przez fszontagh · 3 komentarzy

There is a cleanup method in the qdrant sync, but i see html specialchars in the qdrant payload. For example this is a product description from shoprenter shop stored in qdrant:

<p>Acilisis magna etiam tempor orci. Tincidunt ornare massa eget egestas purus viverra. Enim facilisis gravida neque convallis a. Odio euismod lacinia at quis risus sed. Ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Ullamcorper dignissim cras tincidunt lobortis feugiat vivamus. Tincidunt praesent semper feugiat nibh sed pulvinar proin gravida. Commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Sed risus pretium quam vulputate dignissim suspendisse in. Lectus arcu bibendum at varius vel pharetra vel turpis nunc. Faucibus purus in massa tempor nec. Pulvinar etiam non quam lacus suspendisse. Feugiat pretium nibh ipsum consequat nisl. </p> <p>Maecenas accumsan lacus vel facilisis volutpat est velit. Morbi tristique senectus et netus et malesuada. Semper auctor neque vitae tempus quam pellentesque nec nam aliquam. A cras semper auctor neque vitae tempus quam pellentesque nec. Quisque egestas diam in arcu cursus euismod quis. Tortor condimentum lacinia quis vel eros. Mauris commodo quis imperdiet massa tincidunt nunc pulvinar sapien et. Lacinia quis vel eros donec ac. Et netus et malesuada fames ac turpis. Neque sodales ut etiam sit amet nisl purus in mollis. Turpis egestas integer eget aliquet nibh. Tempor orci eu lobortis elementum nibh tellus molestie. Vitae sapien pellentesque habitant morbi tristique. </p>

We have to keep the newlines, but only one new line, if there is html <br/>, that we need to replace with newline and remove duplicated newlines.

There is a cleanup method in the qdrant sync, but i see html specialchars in the qdrant payload. For example this is a product description from shoprenter shop stored in qdrant: ``` &lt;p&gt;Acilisis magna etiam tempor orci. Tincidunt ornare massa eget egestas purus viverra. Enim facilisis gravida neque convallis a. Odio euismod lacinia at quis risus sed. Ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Ullamcorper dignissim cras tincidunt lobortis feugiat vivamus. Tincidunt praesent semper feugiat nibh sed pulvinar proin gravida. Commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Sed risus pretium quam vulputate dignissim suspendisse in. Lectus arcu bibendum at varius vel pharetra vel turpis nunc. Faucibus purus in massa tempor nec. Pulvinar etiam non quam lacus suspendisse. Feugiat pretium nibh ipsum consequat nisl. &lt;/p&gt; &lt;p&gt;Maecenas accumsan lacus vel facilisis volutpat est velit. Morbi tristique senectus et netus et malesuada. Semper auctor neque vitae tempus quam pellentesque nec nam aliquam. A cras semper auctor neque vitae tempus quam pellentesque nec. Quisque egestas diam in arcu cursus euismod quis. Tortor condimentum lacinia quis vel eros. Mauris commodo quis imperdiet massa tincidunt nunc pulvinar sapien et. Lacinia quis vel eros donec ac. Et netus et malesuada fames ac turpis. Neque sodales ut etiam sit amet nisl purus in mollis. Turpis egestas integer eget aliquet nibh. Tempor orci eu lobortis elementum nibh tellus molestie. Vitae sapien pellentesque habitant morbi tristique. &lt;/p&gt; ``` We have to keep the newlines, but only one new line, if there is html `<br/>`, that we need to replace with newline and remove duplicated newlines.
claude skomentował 4 miesięcy temu
Współpracownik

@fszontagh I'm starting work on this issue now! I'll:

  1. Find the qdrant sync code and the existing cleanup method
  2. Fix the HTML entity decoding (convert &lt; to <, &gt; to >, etc.)
  3. Replace HTML <br/> tags with newlines
  4. Remove duplicate newlines (keep only single newlines)
  5. Clean up HTML tags while preserving text content
  6. Test the changes to ensure proper cleanup

I'll investigate the codebase and implement the fix.

@fszontagh I'm starting work on this issue now! I'll: 1. Find the qdrant sync code and the existing cleanup method 2. Fix the HTML entity decoding (convert `&lt;` to `<`, `&gt;` to `>`, etc.) 3. Replace HTML `<br/>` tags with newlines 4. Remove duplicate newlines (keep only single newlines) 5. Clean up HTML tags while preserving text content 6. Test the changes to ensure proper cleanup I'll investigate the codebase and implement the fix.
claude skomentował 4 miesięcy temu
Współpracownik

✅ Fix Completed

I've successfully fixed the HTML entity decoding issue in the Qdrant sync payload cleanup.

Changes Made

Root Cause: The cleanHtmlContent() function in html-cleaner.ts was decoding HTML entities AFTER removing HTML tags. This meant that encoded HTML like &lt;p&gt; wasn't being converted to <p> before the tag removal step, so it remained as raw text.

Solution:

  1. Moved HTML entity decoding to FIRST step - Now &lt;p&gt; gets decoded to <p> before processing
  2. Changed newline handling - Updated from max 2 consecutive newlines to single newlines only (as per requirement)
  3. Updated documentation - Clarified the processing order and behavior

Code Changes

File: supabase/functions/_shared/html-cleaner.ts

  • ✅ Reordered processing steps: decode entities → replace tags with newlines → remove HTML tags
  • ✅ Updated newline deduplication regex from /\n{3,}/g to /\n{2,}/g
  • ✅ Added test case for encoded HTML entities
  • ✅ Updated documentation comments

Deployed Edge Functions

All affected Edge Functions have been redeployed with the fix:

  • woocommerce-sync
  • shopify-sync
  • shoprenter-sync
  • mcp-woocommerce
  • mcp-shopify
  • mcp-shoprenter

Expected Result

Product descriptions in Qdrant will now be properly cleaned:

Before:

&lt;p&gt;Acilisis magna etiam tempor orci.&lt;/p&gt; &lt;p&gt;Maecenas accumsan...&lt;/p&gt;

After:

Acilisis magna etiam tempor orci.
Maecenas accumsan...

The fix ensures:

  • HTML entities (&lt;, &gt;, &nbsp;, etc.) are properly decoded
  • HTML tags are removed
  • <br/> tags are converted to newlines
  • Duplicate newlines are removed (keeping only single newlines)
  • Text is clean and readable for embeddings

Git Commit

Committed and pushed: 32a8d0e with issue reference #107

## ✅ Fix Completed I've successfully fixed the HTML entity decoding issue in the Qdrant sync payload cleanup. ### Changes Made **Root Cause**: The `cleanHtmlContent()` function in `html-cleaner.ts` was decoding HTML entities AFTER removing HTML tags. This meant that encoded HTML like `&lt;p&gt;` wasn't being converted to `<p>` before the tag removal step, so it remained as raw text. **Solution**: 1. **Moved HTML entity decoding to FIRST step** - Now `&lt;p&gt;` gets decoded to `<p>` before processing 2. **Changed newline handling** - Updated from max 2 consecutive newlines to single newlines only (as per requirement) 3. **Updated documentation** - Clarified the processing order and behavior ### Code Changes File: `supabase/functions/_shared/html-cleaner.ts` - ✅ Reordered processing steps: decode entities → replace tags with newlines → remove HTML tags - ✅ Updated newline deduplication regex from `/\n{3,}/g` to `/\n{2,}/g` - ✅ Added test case for encoded HTML entities - ✅ Updated documentation comments ### Deployed Edge Functions All affected Edge Functions have been redeployed with the fix: - ✅ `woocommerce-sync` - ✅ `shopify-sync` - ✅ `shoprenter-sync` - ✅ `mcp-woocommerce` - ✅ `mcp-shopify` - ✅ `mcp-shoprenter` ### Expected Result Product descriptions in Qdrant will now be properly cleaned: **Before:** ``` &lt;p&gt;Acilisis magna etiam tempor orci.&lt;/p&gt; &lt;p&gt;Maecenas accumsan...&lt;/p&gt; ``` **After:** ``` Acilisis magna etiam tempor orci. Maecenas accumsan... ``` The fix ensures: - HTML entities (`&lt;`, `&gt;`, `&nbsp;`, etc.) are properly decoded - HTML tags are removed - `<br/>` tags are converted to newlines - Duplicate newlines are removed (keeping only single newlines) - Text is clean and readable for embeddings ### Git Commit Committed and pushed: `32a8d0e` with issue reference #107
claude skomentował 4 miesięcy temu
Współpracownik

@fszontagh The fix is now live in production!

Next sync will use the corrected cleanup logic. Any new product descriptions synced to Qdrant will be properly cleaned without HTML entities.

If you'd like to re-sync existing products to apply the fix to already-stored data, you can trigger a manual sync for your stores through the sync endpoints.

@fszontagh The fix is now live in production! **Next sync will use the corrected cleanup logic.** Any new product descriptions synced to Qdrant will be properly cleaned without HTML entities. If you'd like to re-sync existing products to apply the fix to already-stored data, you can trigger a manual sync for your stores through the sync endpoints.
Zaloguj się, aby dołączyć do tej rozmowy.
Brak kamienia milowego
Brak przypisania
2 uczestników
Ładowanie...
Anuluj
Zapisz
Nie ma jeszcze treści.