Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to scrape URI - Medium posts #589

Open
kmskrishna opened this issue Nov 12, 2022 · 3 comments
Open

Unable to scrape URI - Medium posts #589

kmskrishna opened this issue Nov 12, 2022 · 3 comments

Comments

@kmskrishna
Copy link

This tool fails if Medium user posts are added to a publication with custom domains.

For example https://medium.com/@anangsha/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc this article redirects to https://baos.pub/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc which is a Medium publication which has its custom domain.

This tool fails for all such posts.

@kmskrishna
Copy link
Author

I figured out that it was failing because of a redirect loop.

https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Fbaos.pub%2Fdont-start-these-3-brilliant-books-on-a-holiday-ed84c3b143c9

It needs cookies. Tried to set them up in mg-webscraper, but it only sends cookies on the first request but not the redirects.

So, as a hacky solution, I used a burp proxy to match and replace User-Agent: Crawler/1.0 with Cookie: XXX. And used my Medium Cookies here.

and edited the index.js in tinyreq/lib/ to route the requests through burp proxy.

proxy = require("node-global-proxy").default;

proxy.setConfig({
        http: "http://127.0.0.1:8080",
        https: "http://127.0.0.1:8080",
      });
    proxy.start();

Since HTTPS requests would fail, I used this command to ignore those.

NODE_TLS_REJECT_UNAUTHORIZED='0' yarn dev medium folder.zip

Using this it ran successfully and there were no issues.

@jknight12882
Copy link

I was able to work around this by making the following changes at

const reqOpts = {
url: url,
headers: {
'user-agent': 'Crawler/1.0'
}
};

const reqOpts = {
    url: url.replace('https://medium.com/@someSlug', 'https://your.customdomain.com'),
    headers: {
        'user-agent': 'Crawler/1.0',
        'cookie': 'your cookie string here',
    }
};

@kmskrishna
Copy link
Author

This will not work if the Medium author publishes their articles in multiple publications and not their own domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants