bug: qdrant sync payload cleanup

There is a cleanup method in the qdrant sync, but i see html specialchars in the qdrant payload. For example this is a product description from shoprenter shop stored in qdrant:

&lt;p&gt;Acilisis magna etiam tempor orci. Tincidunt ornare massa eget egestas purus viverra. Enim facilisis gravida neque convallis a. Odio euismod lacinia at quis risus sed. Ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Ullamcorper dignissim cras tincidunt lobortis feugiat vivamus. Tincidunt praesent semper feugiat nibh sed pulvinar proin gravida. Commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Sed risus pretium quam vulputate dignissim suspendisse in. Lectus arcu bibendum at varius vel pharetra vel turpis nunc. Faucibus purus in massa tempor nec. Pulvinar etiam non quam lacus suspendisse. Feugiat pretium nibh ipsum consequat nisl. &lt;/p&gt; &lt;p&gt;Maecenas accumsan lacus vel facilisis volutpat est velit. Morbi tristique senectus et netus et malesuada. Semper auctor neque vitae tempus quam pellentesque nec nam aliquam. A cras semper auctor neque vitae tempus quam pellentesque nec. Quisque egestas diam in arcu cursus euismod quis. Tortor condimentum lacinia quis vel eros. Mauris commodo quis imperdiet massa tincidunt nunc pulvinar sapien et. Lacinia quis vel eros donec ac. Et netus et malesuada fames ac turpis. Neque sodales ut etiam sit amet nisl purus in mollis. Turpis egestas integer eget aliquet nibh. Tempor orci eu lobortis elementum nibh tellus molestie. Vitae sapien pellentesque habitant morbi tristique. &lt;/p&gt;

We have to keep the newlines, but only one new line, if there is html  , that we need to replace with newline and remove duplicated newlines.

@fszontagh I'm starting work on this issue now! I'll:

Find the qdrant sync code and the existing cleanup method
Fix the HTML entity decoding (convert < to <, > to >, etc.)
Replace HTML   tags with newlines
Remove duplicate newlines (keep only single newlines)
Clean up HTML tags while preserving text content
Test the changes to ensure proper cleanup

I'll investigate the codebase and implement the fix.

claude wspomina ten problem w commicie 4 miesięcy temu

fix: decode HTML entities before cleaning tags in Qdrant sync #107...

✅ Fix Completed

I've successfully fixed the HTML entity decoding issue in the Qdrant sync payload cleanup.

Changes Made

Root Cause: The cleanHtmlContent() function in html-cleaner.ts was decoding HTML entities AFTER removing HTML tags. This meant that encoded HTML like  wasn't being converted to  before the tag removal step, so it remained as raw text.

Solution:

Moved HTML entity decoding to FIRST step - Now  gets decoded to  before processing
Changed newline handling - Updated from max 2 consecutive newlines to single newlines only (as per requirement)
Updated documentation - Clarified the processing order and behavior

Code Changes

File: supabase/functions/_shared/html-cleaner.ts

✅ Reordered processing steps: decode entities → replace tags with newlines → remove HTML tags
✅ Updated newline deduplication regex from /\n{3,}/g to /\n{2,}/g
✅ Added test case for encoded HTML entities
✅ Updated documentation comments

Deployed Edge Functions

All affected Edge Functions have been redeployed with the fix:

✅ woocommerce-sync
✅ shopify-sync
✅ shoprenter-sync
✅ mcp-woocommerce
✅ mcp-shopify
✅ mcp-shoprenter

Expected Result

Product descriptions in Qdrant will now be properly cleaned:

Before:

&lt;p&gt;Acilisis magna etiam tempor orci.&lt;/p&gt; &lt;p&gt;Maecenas accumsan...&lt;/p&gt;

After:

Acilisis magna etiam tempor orci.
Maecenas accumsan...

The fix ensures:

HTML entities (<, >,  , etc.) are properly decoded
HTML tags are removed
  tags are converted to newlines
Duplicate newlines are removed (keeping only single newlines)
Text is clean and readable for embeddings

Git Commit

Committed and pushed: 32a8d0e with issue reference #107

@fszontagh The fix is now live in production!

Next sync will use the corrected cleanup logic. Any new product descriptions synced to Qdrant will be properly cleaned without HTML entities.

If you'd like to re-sync existing products to apply the fix to already-stored data, you can trigger a manual sync for your stores through the sync endpoints.

Zaloguj się, aby dołączyć do tej rozmowy.

#107 bug: qdrant sync payload cleanup