YouTube Bot Detection

Overview

If Astro is meant to eventually collect and arbitrate YouTube user data for bot identification, it would be prudent to first identify what information is available via the YouTube Data API.

The data

A full list of YouTube channel attributes accessible via the Data API can be found here.

The fields I'd like to highlight here are as follows:

API Field	Attribute	Description
`etag`	N/A	A hash to identify the state of the following data (for use in caching purposes)
`id`	Channel ID	The unique ID of the channel (hash)
`snippet.title`	Channel title	A title string that can be modified by the channel owner
`snippet.description`	Channel description	A description set by the channel owner
`snippet.publishedAt`	Channel creation date	Channel creation date
`snippet.customURL`	Custom URL	A custom URL based on the channel's handle (`youtube.com/@<handle>`)
`snippet.thumnails`	Channel thumbnails	A list of thumbnail links in various resolutions
`snippet.country`	Channel country	The country from which the channel operates
`snippet.localized`	Channel localization	Variant channel titles/descriptions for other languages/countries
`snippet.contentDetails`	Channel content details	Encapsulates information about the channel's content
`snippet.statistics`	Channel statistics	Includes views, number of published videos, and subscriber count

I'm curious about how these channel attributes might differ between bot and non-bot YouTube channels.

The identification strategy

This strategy is still very early in development.

I've recently become interested in the concept of "data embedding". It seems to be strongly associated with machine learning techniques as a means to simplify complex datasets into something more amenable to the performance demands of modern ML infrastructure (i.e. numerical data). The idea of encoding text into numerical representations isn't novel, but representing this data as a vector in a multidimensional space seems promising for the identification of bot accounts.

My theory is that if I can gather a sufficient amount of data from each YouTube channel (big 'if' there), I may be able to represent the data from each channel as a vector, and in so doing provide a means to associate common 'types' of channels. If the collected data is sufficient, I may for example see that suspected bot channels are represented as vectors of similar magnitude which point in similar directions.

However, I do not know how demanding the computation of this data may be. My worry is that meaningful work of this sort comes with some big infrastructure requirements, which is a demand I'm not personally prepared to meet with my current setup. Regardless, I think writing a POC of this method would be a good starting point for determining its viability as a bot identification strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YouTube Bot Detection

Overview

The data

The identification strategy

Clone this wiki locally