-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to scrape URI - Medium posts #589
Comments
I figured out that it was failing because of a redirect loop. It needs cookies. Tried to set them up in mg-webscraper, but it only sends cookies on the first request but not the redirects. So, as a hacky solution, I used a burp proxy to match and replace and edited the index.js in tinyreq/lib/ to route the requests through burp proxy.
Since HTTPS requests would fail, I used this command to ignore those.
Using this it ran successfully and there were no issues. |
I was able to work around this by making the following changes at migrate/packages/mg-webscraper/lib/WebScraper.js Lines 109 to 114 in 4fb9144
const reqOpts = {
url: url.replace('https://medium.com/@someSlug', 'https://your.customdomain.com'),
headers: {
'user-agent': 'Crawler/1.0',
'cookie': 'your cookie string here',
}
}; |
This will not work if the Medium author publishes their articles in multiple publications and not their own domain. |
This tool fails if Medium user posts are added to a publication with custom domains.
For example https://medium.com/@anangsha/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc this article redirects to https://baos.pub/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc which is a Medium publication which has its custom domain.
This tool fails for all such posts.
The text was updated successfully, but these errors were encountered: