Skip to content

Latest commit

 

History

History
170 lines (132 loc) · 5.24 KB

README.md

File metadata and controls

170 lines (132 loc) · 5.24 KB

Web Scraping with Deno  –  DOM + GraphQL


DQL is a web scraping module for Deno and Deno Deploy that integrates the power of GraphQL Queries with the DOM tree of a remote webpage or HTML document fragment. This is a fork of DenoQL with some heavy refactoring and some additional features:

  • Compatibility with the Deno Deploy architecture
  • Ability to pass variables alongside all queries
  • New state-management class with additional methods
  • Modular project structure (as opposed to a mostly single-file design)
  • Improved types and schema structure

Note: This is a work-in-progress and there is still a lot to be done.


useQuery

The primary function exported by the module is the workhorse named useQuery:

import { useQuery } from "https://deno.land/x/dql/mod.ts";

const data = await useQuery(`query { ... }`);

QueryOptions

You can also provide a QueryOptions object as the second argument of useQuery, to further control the behavior of your query requests. All properties are optional.

const data = await useQuery(`query { ... }`, {
  concurrency: 8, // passed directly to PQueue initializer
  fetch_options: { // passed directly to Fetch API requests
    headers: {
      "Authorization": "Bearer ghp_a5025a80a24defd0a7d06b4fc215bb5635a167c6",
    },
  },
  variables: {}, // variables defined in your queries
  operationName: "", // when using multiple queries
});

createServer

With Deno Deploy, you can deploy DQL with a GraphQL Playground in only 2 lines of code:

import { createServer } from "https://deno.land/x/dql/mod.ts";

createServer(80, { endpoint: "https://dql.deno.dev" });

🛝 Try the GraphQL Playground at dql.deno.dev
🦕 View the source code in the Deno Playground

Command Line Usage (CLI)

deno run -A --unstable https://deno.land/x/dql/serve.ts

Custom port (default is 8080)

deno run -A https://deno.land/x/dql/serve.ts --port 3000

Warning: you need to have the Deno CLI installed first.


💻 Examples

🚛 Junkyard Scraper · Deno Playground 🦕

import { useQuery } from "https://deno.land/x/dql/mod.ts";
import { serve } from "https://deno.land/std@0.147.0/http/server.ts";

serve(async (res: Request) =>
  await useQuery(
    `
  query Junkyard (
    $url: String
    $itemSelector: String = "table > tbody > tr"
  ) {
    vehicles: page(url: $url) {
      totalCount: count(selector: $itemSelector)
      nodes: queryAll(selector: $itemSelector) {
        id: index
        vin:   text(selector: "td:nth-child(7)", trim: true)
        sku:   text(selector: "td:nth-child(6)", trim: true)
        year:  text(selector: "td:nth-child(1)", trim: true)
        model: text(selector: "td:nth-child(2) > .notranslate", trim: true)
        aisle: text(selector: "td:nth-child(3)", trim: true)
        store: text(selector: "td:nth-child(4)", trim: true)
        color: text(selector: "td:nth-child(5)", trim: true)
        date:  attr(selector: "td:nth-child(8)", name: "data-value")
        image: src(selector: "td > a > img")
      }
    }
  }`,
    {
      variables: {
        "url": "http://nvpap.deno.dev/action=getVehicles&makes=BMW",
      },
    },
  )
    .then((data) => JSON.stringify(data, null, 2))
    .then((json) =>
      new Response(json, {
        headers: { "content-type": "application/json;charset=utf-8" },
      })
    )
);

📝 HackerNews Scraper · Deno Playground 🦕

import { useQuery } from "https://deno.land/x/dql/mod.ts";
import { serve } from "https://deno.land/std@0.147.0/http/server.ts";

serve(async (res: Request) =>
  await useQuery(`
  query HackerNews (
    $url: String = "http://news.ycombinator.com"
    $rowSelector: String = "tr.athing"
  ) {
    page(url: $url) {
      title
      totalCount: count(selector: $rowSelector)
      nodes: queryAll(selector: $rowSelector) {
        rank: text(selector: "td span.rank", trim: true)
        title: text(selector: "td.title a", trim: true)
        site: text(selector: "span.sitestr", trim: true)
        url: href(selector: "td.title a")
        attrs: next {
          score: text(selector: "span.score", trim: true)
          user: text(selector: "a.hnuser", trim: true)
          date: attr(selector: "span.age", name: "title")
        }
      }
    }
  }`)
    .then((data) => JSON.stringify(data, null, 2))
    .then((json) =>
      new Response(json, {
        headers: { "content-type": "application/json;charset=utf-8" },
      })
    )
);

License

MIT © Nicholas Berlette, based on DenoQL.