Skip to content

Latest commit

 

History

History
84 lines (70 loc) · 3.27 KB

File metadata and controls

84 lines (70 loc) · 3.27 KB

Azure Exporter for Scrapy

Scrapy feed export storage backend for Azure Storage.

Requirements

  • Python 3.8+

Installation

pip install git+https://github.com/scrapy-plugins/scrapy-feedexporter-azure-storage

Usage

  • Add this storage backend to the FEED_STORAGES Scrapy setting. For example:

    # settings.py
    FEED_STORAGES = {'azure': 'scrapy_azure_exporter.AzureFeedStorage'}
  • Configure authentication via any of the following settings:

    • AZURE_CONNECTION_STRING
    • AZURE_ACCOUNT_URL_WITH_SAS_TOKEN
    • AZURE_ACCOUNT_URL & AZURE_ACCOUNT_KEY - If using this method, specify both of them.

    For example,

    AZURE_ACCOUNT_URL = "https://<your-storage-account-name>.blob.core.windows.net/"
    AZURE_ACCOUNT_KEY = "Account key for the Azure account"
  • Configure in the FEEDS Scrapy setting the Azure URI where the feed needs to be exported.

    FEEDS = {
        "azure://<account_name>.blob.core.windows.net/<container_name>/<file_name.extension>": {
            "format": "json"
        }
    }

Write mode and blob type

The overwrite feed option is False by default when using this feed export storage backend. An extra feed option is also provided, blob_type, which can be "BlockBlob" (default) or "AppendBlob". See Understanding blob types. The feed options overwrite and blob_type can be combined to set the write mode of the feed export:

  • overwrite=False and blob_type="BlockBlob" create the blob if it does not exist, and fail if it exists.
  • overwrite=False and blob_type="AppendBlob" append to the blob if it exists and it is an AppendBlob, and create it otherwise.
  • overwrite=True overwrites the blob, even if it exists. The blob_type must match that of the target blob.

Media pipeline usage

Use the Azure pipeline for Scrapy media pipelines and be able to use Azure Blob Storage.

Just add the pipeline to Scrapy:

ITEM_PIPELINES = {
    "scrapy_azure_exporter.AzureFilesPipeline": 1,
}

Azurite usage

You can use Azurite as a storage emulator for Azure Blob Storage and test your application locally. Just append or set the feed storage to azurite.

# settings.py
FEED_STORAGES = {'azurite': 'scrapy_azure_exporter.AzureFeedStorage'}

And add the Azurite URI to the FEEDS setting:

FEEDS = {
    "azurite://<ip>:<port>/<account_name>/<container_name>/[<file_name.extension>]": {
        // ...
    }
}

And finally run your Scrapy project as it is usually done for FilesPipeline or ImagesPipeline.