There is a cleanup method in the qdrant sync, but i see html specialchars in the qdrant payload. For example this is a product description from shoprenter shop stored in qdrant:
<p>Acilisis magna etiam tempor orci. Tincidunt ornare massa eget egestas purus viverra. Enim facilisis gravida neque convallis a. Odio euismod lacinia at quis risus sed. Ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Ullamcorper dignissim cras tincidunt lobortis feugiat vivamus. Tincidunt praesent semper feugiat nibh sed pulvinar proin gravida. Commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Sed risus pretium quam vulputate dignissim suspendisse in. Lectus arcu bibendum at varius vel pharetra vel turpis nunc. Faucibus purus in massa tempor nec. Pulvinar etiam non quam lacus suspendisse. Feugiat pretium nibh ipsum consequat nisl. </p> <p>Maecenas accumsan lacus vel facilisis volutpat est velit. Morbi tristique senectus et netus et malesuada. Semper auctor neque vitae tempus quam pellentesque nec nam aliquam. A cras semper auctor neque vitae tempus quam pellentesque nec. Quisque egestas diam in arcu cursus euismod quis. Tortor condimentum lacinia quis vel eros. Mauris commodo quis imperdiet massa tincidunt nunc pulvinar sapien et. Lacinia quis vel eros donec ac. Et netus et malesuada fames ac turpis. Neque sodales ut etiam sit amet nisl purus in mollis. Turpis egestas integer eget aliquet nibh. Tempor orci eu lobortis elementum nibh tellus molestie. Vitae sapien pellentesque habitant morbi tristique. </p>
We have to keep the newlines, but only one new line, if there is html <br/>, that we need to replace with newline and remove duplicated newlines.
There is a cleanup method in the qdrant sync, but i see html specialchars in the qdrant payload. For example this is a product description from shoprenter shop stored in qdrant:
```
<p>Acilisis magna etiam tempor orci. Tincidunt ornare massa eget egestas purus viverra. Enim facilisis gravida neque convallis a. Odio euismod lacinia at quis risus sed. Ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Ullamcorper dignissim cras tincidunt lobortis feugiat vivamus. Tincidunt praesent semper feugiat nibh sed pulvinar proin gravida. Commodo nulla facilisi nullam vehicula ipsum a arcu cursus. Sed risus pretium quam vulputate dignissim suspendisse in. Lectus arcu bibendum at varius vel pharetra vel turpis nunc. Faucibus purus in massa tempor nec. Pulvinar etiam non quam lacus suspendisse. Feugiat pretium nibh ipsum consequat nisl. </p> <p>Maecenas accumsan lacus vel facilisis volutpat est velit. Morbi tristique senectus et netus et malesuada. Semper auctor neque vitae tempus quam pellentesque nec nam aliquam. A cras semper auctor neque vitae tempus quam pellentesque nec. Quisque egestas diam in arcu cursus euismod quis. Tortor condimentum lacinia quis vel eros. Mauris commodo quis imperdiet massa tincidunt nunc pulvinar sapien et. Lacinia quis vel eros donec ac. Et netus et malesuada fames ac turpis. Neque sodales ut etiam sit amet nisl purus in mollis. Turpis egestas integer eget aliquet nibh. Tempor orci eu lobortis elementum nibh tellus molestie. Vitae sapien pellentesque habitant morbi tristique. </p>
```
We have to keep the newlines, but only one new line, if there is html `<br/>`, that we need to replace with newline and remove duplicated newlines.
@fszontagh I'm starting work on this issue now! I'll:
Find the qdrant sync code and the existing cleanup method
Fix the HTML entity decoding (convert < to <, > to >, etc.)
Replace HTML <br/> tags with newlines
Remove duplicate newlines (keep only single newlines)
Clean up HTML tags while preserving text content
Test the changes to ensure proper cleanup
I'll investigate the codebase and implement the fix.
@fszontagh I'm starting work on this issue now! I'll:
1. Find the qdrant sync code and the existing cleanup method
2. Fix the HTML entity decoding (convert `<` to `<`, `>` to `>`, etc.)
3. Replace HTML `<br/>` tags with newlines
4. Remove duplicate newlines (keep only single newlines)
5. Clean up HTML tags while preserving text content
6. Test the changes to ensure proper cleanup
I'll investigate the codebase and implement the fix.
I've successfully fixed the HTML entity decoding issue in the Qdrant sync payload cleanup.
Changes Made
Root Cause: The cleanHtmlContent() function in html-cleaner.ts was decoding HTML entities AFTER removing HTML tags. This meant that encoded HTML like <p> wasn't being converted to <p> before the tag removal step, so it remained as raw text.
Solution:
Moved HTML entity decoding to FIRST step - Now <p> gets decoded to <p> before processing
Changed newline handling - Updated from max 2 consecutive newlines to single newlines only (as per requirement)
Updated documentation - Clarified the processing order and behavior
Code Changes
File: supabase/functions/_shared/html-cleaner.ts
✅ Reordered processing steps: decode entities → replace tags with newlines → remove HTML tags
✅ Updated newline deduplication regex from /\n{3,}/g to /\n{2,}/g
✅ Added test case for encoded HTML entities
✅ Updated documentation comments
Deployed Edge Functions
All affected Edge Functions have been redeployed with the fix:
✅ woocommerce-sync
✅ shopify-sync
✅ shoprenter-sync
✅ mcp-woocommerce
✅ mcp-shopify
✅ mcp-shoprenter
Expected Result
Product descriptions in Qdrant will now be properly cleaned:
Before:
<p>Acilisis magna etiam tempor orci.</p> <p>Maecenas accumsan...</p>
After:
Acilisis magna etiam tempor orci.
Maecenas accumsan...
The fix ensures:
HTML entities (<, >, , etc.) are properly decoded
HTML tags are removed
<br/> tags are converted to newlines
Duplicate newlines are removed (keeping only single newlines)
Text is clean and readable for embeddings
Git Commit
Committed and pushed: 32a8d0e with issue reference #107
## ✅ Fix Completed
I've successfully fixed the HTML entity decoding issue in the Qdrant sync payload cleanup.
### Changes Made
**Root Cause**: The `cleanHtmlContent()` function in `html-cleaner.ts` was decoding HTML entities AFTER removing HTML tags. This meant that encoded HTML like `<p>` wasn't being converted to `<p>` before the tag removal step, so it remained as raw text.
**Solution**:
1. **Moved HTML entity decoding to FIRST step** - Now `<p>` gets decoded to `<p>` before processing
2. **Changed newline handling** - Updated from max 2 consecutive newlines to single newlines only (as per requirement)
3. **Updated documentation** - Clarified the processing order and behavior
### Code Changes
File: `supabase/functions/_shared/html-cleaner.ts`
- ✅ Reordered processing steps: decode entities → replace tags with newlines → remove HTML tags
- ✅ Updated newline deduplication regex from `/\n{3,}/g` to `/\n{2,}/g`
- ✅ Added test case for encoded HTML entities
- ✅ Updated documentation comments
### Deployed Edge Functions
All affected Edge Functions have been redeployed with the fix:
- ✅ `woocommerce-sync`
- ✅ `shopify-sync`
- ✅ `shoprenter-sync`
- ✅ `mcp-woocommerce`
- ✅ `mcp-shopify`
- ✅ `mcp-shoprenter`
### Expected Result
Product descriptions in Qdrant will now be properly cleaned:
**Before:**
```
<p>Acilisis magna etiam tempor orci.</p> <p>Maecenas accumsan...</p>
```
**After:**
```
Acilisis magna etiam tempor orci.
Maecenas accumsan...
```
The fix ensures:
- HTML entities (`<`, `>`, ` `, etc.) are properly decoded
- HTML tags are removed
- `<br/>` tags are converted to newlines
- Duplicate newlines are removed (keeping only single newlines)
- Text is clean and readable for embeddings
### Git Commit
Committed and pushed: `32a8d0e` with issue reference #107
Next sync will use the corrected cleanup logic. Any new product descriptions synced to Qdrant will be properly cleaned without HTML entities.
If you'd like to re-sync existing products to apply the fix to already-stored data, you can trigger a manual sync for your stores through the sync endpoints.
@fszontagh The fix is now live in production!
**Next sync will use the corrected cleanup logic.** Any new product descriptions synced to Qdrant will be properly cleaned without HTML entities.
If you'd like to re-sync existing products to apply the fix to already-stored data, you can trigger a manual sync for your stores through the sync endpoints.
There is a cleanup method in the qdrant sync, but i see html specialchars in the qdrant payload. For example this is a product description from shoprenter shop stored in qdrant:
We have to keep the newlines, but only one new line, if there is html
<br/>, that we need to replace with newline and remove duplicated newlines.@fszontagh I'm starting work on this issue now! I'll:
<to<,>to>, etc.)<br/>tags with newlinesI'll investigate the codebase and implement the fix.
✅ Fix Completed
I've successfully fixed the HTML entity decoding issue in the Qdrant sync payload cleanup.
Changes Made
Root Cause: The
cleanHtmlContent()function inhtml-cleaner.tswas decoding HTML entities AFTER removing HTML tags. This meant that encoded HTML like<p>wasn't being converted to<p>before the tag removal step, so it remained as raw text.Solution:
<p>gets decoded to<p>before processingCode Changes
File:
supabase/functions/_shared/html-cleaner.ts/\n{3,}/gto/\n{2,}/gDeployed Edge Functions
All affected Edge Functions have been redeployed with the fix:
woocommerce-syncshopify-syncshoprenter-syncmcp-woocommercemcp-shopifymcp-shoprenterExpected Result
Product descriptions in Qdrant will now be properly cleaned:
Before:
After:
The fix ensures:
<,>, , etc.) are properly decoded<br/>tags are converted to newlinesGit Commit
Committed and pushed:
32a8d0ewith issue reference #107@fszontagh The fix is now live in production!
Next sync will use the corrected cleanup logic. Any new product descriptions synced to Qdrant will be properly cleaned without HTML entities.
If you'd like to re-sync existing products to apply the fix to already-stored data, you can trigger a manual sync for your stores through the sync endpoints.