Unable to scrape URI - Medium posts #589

kmskrishna · 2022-11-12T14:50:34Z

This tool fails if Medium user posts are added to a publication with custom domains.

For example https://medium.com/@anangsha/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc this article redirects to https://baos.pub/i-owe-an-apology-to-mostly-harmless-c5503f1f2acc which is a Medium publication which has its custom domain.

This tool fails for all such posts.

kmskrishna · 2022-11-13T07:51:35Z

I figured out that it was failing because of a redirect loop.

https://medium.com/m/global-identity?redirectUrl=https%3A%2F%2Fbaos.pub%2Fdont-start-these-3-brilliant-books-on-a-holiday-ed84c3b143c9

It needs cookies. Tried to set them up in mg-webscraper, but it only sends cookies on the first request but not the redirects.

So, as a hacky solution, I used a burp proxy to match and replace User-Agent: Crawler/1.0 with Cookie: XXX. And used my Medium Cookies here.

and edited the index.js in tinyreq/lib/ to route the requests through burp proxy.

proxy = require("node-global-proxy").default;

proxy.setConfig({
        http: "http://127.0.0.1:8080",
        https: "http://127.0.0.1:8080",
      });
    proxy.start();

Since HTTPS requests would fail, I used this command to ignore those.

NODE_TLS_REJECT_UNAUTHORIZED='0' yarn dev medium folder.zip

Using this it ran successfully and there were no issues.

jknight12882 · 2022-11-16T18:49:29Z

I was able to work around this by making the following changes at

migrate/packages/mg-webscraper/lib/WebScraper.js

Lines 109 to 114 in 4fb9144

    
           const reqOpts = { 
        
               url: url, 
        
               headers: { 
        
                   'user-agent': 'Crawler/1.0' 
        
               } 
        
           };

const reqOpts = {
    url: url.replace('https://medium.com/@someSlug', 'https://your.customdomain.com'),
    headers: {
        'user-agent': 'Crawler/1.0',
        'cookie': 'your cookie string here',
    }
};

kmskrishna · 2022-12-02T12:35:49Z

This will not work if the Medium author publishes their articles in multiple publications and not their own domain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to scrape URI - Medium posts #589

Unable to scrape URI - Medium posts #589

kmskrishna commented Nov 12, 2022

kmskrishna commented Nov 13, 2022

jknight12882 commented Nov 16, 2022

kmskrishna commented Dec 2, 2022

Unable to scrape URI - Medium posts #589

Unable to scrape URI - Medium posts #589

Comments

kmskrishna commented Nov 12, 2022

kmskrishna commented Nov 13, 2022

jknight12882 commented Nov 16, 2022

kmskrishna commented Dec 2, 2022