Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PRO] Improve the way we remove duplicates during the import #1006

Open
vytisbulkevicius opened this issue Nov 25, 2024 · 32 comments · May be fixed by #1034
Open

[PRO] Improve the way we remove duplicates during the import #1006

vytisbulkevicius opened this issue Nov 25, 2024 · 32 comments · May be fixed by #1034
Assignees
Labels
enhancement Request to improve or optimize an existing feature or functionality in the project
Milestone

Comments

@vytisbulkevicius
Copy link
Contributor

What problem does this address?

The way we look for duplicates is not ideal, we compare by title of the feed AND link of an item, this works if a single Feed is used but if you are using multiple feeds URLs will be different and it can be the same item.
We have this option:

Which to be honest I don't know what is doing, I think it should be enabled by default and documentation to which it points is a code snippet I added a few months ago as before it didn't make sense at all and was pointing to a code snippet used for shortcode/widget usage:
Image

I think what we can do is to rename it to remove duplicates by title and make it work as this code snippet:
https://github.com/Codeinwp/feedzy-rss-feeds-pro/issues/754#issuecomment-2382295616

What is your proposed solution?

No response

Will this feature require documentation? (Optional)

None

@vytisbulkevicius vytisbulkevicius added the enhancement Request to improve or optimize an existing feature or functionality in the project label Nov 25, 2024
@vytisbulkevicius vytisbulkevicius added this to the Feedzy 5.0 milestone Dec 2, 2024
@HardeepAsrani
Copy link
Member

@vytisbulkevicius This is not the best approach either because if we compare by the item title and the user is using two different feed URLs, it will remove unique items if they share the same title.

This is an edge case and we shouldn't be bothered about it apart from the documentation. The reason here being that whatever change we make, it will create more new edge cases. The current implementation that we have is quite reasonable.

@vytisbulkevicius
Copy link
Contributor Author

@HardeepAsrani can you share then what the Remove Duplicates Toggle does and why isn't it enabled by default?

Duplicate items is very common issue our customers/users have so I think we should consider improving the logic even if my suggestion is not the best (I understand that the same title doesn't always mean they are duplicates)

@vytisbulkevicius
Copy link
Contributor Author

Such a toggle can be used that if it's enabled, we then remove checking for duplicated title as well which I still think is good, we can mention what it does. I think it's better to skip some item from being imported than to see duplicates on a website

@selul
Copy link
Contributor

selul commented Dec 5, 2024

in the future we can offer a customization per job where people can define the duplicate criteria, i.e they can select title/url, etc as elements to compound the unique key.

@vytisbulkevicius vytisbulkevicius changed the title Improve the way we remove duplicates during the import [PRO] Improve the way we remove duplicates during the import Dec 5, 2024
@girishpanchal30
Copy link
Contributor

@vytisbulkevicius @selul @HardeepAsrani Currently, we identify duplicate items based on the item_url, which we believe is the correct approach.

i.e they can select title/url, etc as elements to compound the unique key.

Using the title field might lead to more issues, especially when users perform actions like translations or rewrites. It could become difficult to determine whether the title is a duplicate.

If you have any other suggestions in mind, please let me know.

Thank you!

@vytisbulkevicius
Copy link
Contributor Author

@girishpanchal30,

the idea was to give that flexibility to the end user, by default we keep it working as it is now but next to this toggle to remove duplicate we give an option to add fields like Title, Content, URL and if user thinks it's better for them to remove duplicates using such criteria they can choose them.

Do you see this as complicated and causing risks?

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius

It can be challenging to identify duplicate items in imported posts when users have utilized our translation or rewrite services for titles, and content fields.

For example, if a user translates a title, it becomes hard to find duplicates in previous imports because the original title is no longer available.

I hope this makes sense.

@vytisbulkevicius
Copy link
Contributor Author

@girishpanchal30,

Understood. I thought we don't compare with new/rewritten title but save the original one.

And the way we currently compare by link is that we have it stored (the original link) and compare by that? If we want to do by title we should also store original title the same way?

@selul
Copy link
Contributor

selul commented Dec 13, 2024

@girishpanchal30 I suggest the following:

  1. Being able to customize the uniqueness of the key, i.e having an input where user can add the fields that will be considered when building the unique key, i.e title, certain custom tag, etc. BY default it will be the current way we build the duplication key.
  2. We save the key value to the post imported, in this way any action we add on the titles wont affect the previous values.

@girishpanchal30
Copy link
Contributor

@selul, I understand your suggestion, and it seems possible to implement.

I need to add a new text box field next to the "Remove Duplicates" field, and this will be for PRO users only, correct?

Image

And users can enter any magic tag in this field, right?

@selul
Copy link
Contributor

selul commented Dec 13, 2024

That's correct.

And the computed value of each item will be saved with the import as a meta, we can use sanitize_key to create the value based item values so we can later identify them. We can use a fixed length for the key like 256 chars.

When checking future imports, we should check if there is a post with similar keys.

Please have in mind if the field is not saved/set, the previous mechanism you used for duplication should still apply.

@girishpanchal30 girishpanchal30 linked a pull request Dec 17, 2024 that will close this issue
6 tasks
@girishpanchal30
Copy link
Contributor

I have implemented the suggestion in this PR. Please review it and let me know if you encounter any issues.

@girishpanchal30 girishpanchal30 linked a pull request Dec 17, 2024 that will close this issue
6 tasks
@vytisbulkevicius
Copy link
Contributor Author

@girishpanchal30,

Tested and feature mostly works.
The feature is visible but locked with the Free plugin, and content is not imported ✅
If I choose [#item_title] for duplicate - it works as expected, items with the same title are not imported. Same for [#item_content] ✅

Potential issue:

Also, I'm not sure about whether you can add multiple tags only a single tag?
If I can add multiple tags and I add like this:
Image

Does it mean that items will be treated as duplicates ONLY IF both title and content are identical or enough for ONE OF THEM to be identical? [My expectation is AND condition, not OR if we allow multiple tags]

Or it shouldn't work at all and I can only add a single tag?

Because if I add both tags [i#tem_title] and [#item_content] duplicates are imported.

You can test with this feed:
https://approveskin.s4-tastewp.com/feed/ (both most recent items have same title and content)

@vytisbulkevicius
Copy link
Contributor Author

Also, how the Toggle correlates with the new section?
Image

Should it be enabled/disabled or it doesn't matter and magic tag functionality works on top of default removal of duplicates (if enabled)?

@girishpanchal30
Copy link
Contributor

Should it be enabled/disabled or it doesn't matter and magic tag functionality works on top of default removal of duplicates (if enabled)?

@vytisbulkevicius I’ve disabled the input field when the toggle is not enabled. Would you prefer to hide or show the entire section in this case?

Thanks!

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius

Also, I'm not sure about whether you can add multiple tags only a single tag?

It only works with a single tag, not multiple.

Because if I add both tags [i#tem_title] and [#item_content] duplicates are imported.

Using multiple tags doesn't work properly; it falls back to the old method for checking duplicate items.

Adding multiple tags might cause issues. For example:

Using multiple tags like [#item_content], [#item_title] will fetch the content for both tags. However, as per @selul note, only a 256-character string will be created from the content.

In this case, if [#item_content] is lengthy, the 256-character string might not include [#item_title].

@selul
Copy link
Contributor

selul commented Dec 18, 2024

@girishpanchal30 I think this is ok; if this is the only drawback, we can allow multiple tags having in mind that the limit of the key will be max 256. Please make the limit length customizable via some filters also.

@vytisbulkevicius
Copy link
Contributor Author

@girishpanchal30,

Should it be enabled/disabled or it doesn't matter and magic tag functionality works on top of default removal of duplicates (if enabled)?

@vytisbulkevicius I’ve disabled the input field when the toggle is not enabled. Would you prefer to hide or show the entire section in this case?

Good, I think it makes sense to have the input disabled. I tested and looks good to me.

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius @selul I've added support for multiple tags and introduced a new filter that allows users to modify the character limit.

add_filter(
	'feedzy_mark_duplicate_content_limit',
	function () {
		return 500;
	}
);

Thanks

@vytisbulkevicius
Copy link
Contributor Author

@girishpanchal30,

Unfortunately, after the recent changes it doesn't work at all for me.
All items are imported no matter if I use [#item_title] or [#item_content] or both tags at the same time.

You can check it here:

Admin area URL: https://boringpaste.s3-tastewp.com/wp-admin
Username: bvytis
Password: aFBAzZIycW8

Import: https://boringpaste.s3-tastewp.com/wp-admin/post.php?post=9&action=edit

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius I've checked with the local setup, and everything seems to be working fine.

Image

It seems there may be an issue with the long meta key. I'll check it and let you know.

Thanks

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius I have reset the import job in your test instance, re-run it, and it appears to be working fine.
I’ve also made some changes and pushed the updated code. Please test with the latest build ZIP.

@vytisbulkevicius
Copy link
Contributor Author

@girishpanchal30,

I don't know what I'm doing wrong, but it still doesn't work. I tried again on the test instance I shared above, and it still doesn't work.

I also created a new fresh one and same thing happens:

URL: https://clean-centipede-4cc67e.vertisite.cloud
Username: cicijicali7341
Password: sm7o4zQuriVWxZHFDBnt

Screencast: https://vertis.d.pr/v/xfOTSF

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius I’ve updated the meta key flag, and it seems to be working now. Please test it with the latest build zip.

@vytisbulkevicius
Copy link
Contributor Author

@girishpanchal30,

Yes, it works now!

  1. One more thing that I noticed is that if I add multiple tags:
Image

And if only one of them is the same (titles are the same but contents are not), such ones are still marked as duplicates. I think we should treat this as AND not OR. So I want to find duplicates that both hame same title and content not one of them.

  1. Also, we show 10 items were imported even though a duplicate was found and only 9 were imported:
Image

@girishpanchal30
Copy link
Contributor

  1. One more thing that I noticed is that if I add multiple tags:

I’ve already used the AND condition, but after checking the feed, the title and content seem the same.

Image

  1. Also, we show 10 items were imported even though a duplicate was found and only 9 were imported:

Fixed, please check with the latest build zip.

@vytisbulkevicius
Copy link
Contributor Author

vytisbulkevicius commented Dec 20, 2024

@girishpanchal30,
This was the slight difference that I made when importing to make them different:
Image

I will check again by modifying some words.

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius Also please increase the limit char if content is long.

@vytisbulkevicius
Copy link
Contributor Author

vytisbulkevicius commented Dec 20, 2024

@girishpanchal30,

Yes, it worked fine after I changed the content, not just added that additional dot in the end.

However, I'm not sure why this happens but with the latest build I'm getting weird behavior, there is an error in the input field and I can't overwrite it (happens on any fresh installation, so shouldn't be cache):
Image

Image

Error is:

If the error is related to some plugin, you can disable all plugins in TasteWP.com dashboard.

Warning: Undefined variable $mark_duplicate_tag in /s2-stupendouspush/wordpress/wp-content/plugins/feedzy-rss-feeds/includes/views/import-metabox-edit.php on line 823

@girishpanchal30
Copy link
Contributor

@vytisbulkevicius Resolved! The issue occurred after I rebased the branch code.

@vytisbulkevicius
Copy link
Contributor Author

works great now, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request to improve or optimize an existing feature or functionality in the project
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants