Skip to content

YouTube Bot Detection

Austin Cullar edited this page Oct 5, 2024 · 3 revisions

[ IN PROGRESS ]

Overview

If Astro is meant to eventually collect and arbitrate YouTube user data for bot identification, it would be prudent to first identify what information is available via the YouTube Data API.

The data

A full list of YouTube channel attributes accessible via the Data API can be found here.

The fields I'd like to highlight here are as follows:

API Field Attribute Description
etag N/A A hash to identify the state of the following data (for use in caching purposes)
id Channel ID The unique ID of the channel (hash)
snippet.title Channel title A title string that can be modified by the channel owner
snippet.description Channel description A description set by the channel owner
snippet.publishedAt Channel creation date Channel creation date
snippet.customURL Custom URL A custom URL based on the channel's handle (youtube.com/@<handle>)
snippet.thumnails Channel thumbnails A list of thumbnail links in various resolutions
snippet.country Channel country The country from which the channel operates
snippet.localized Channel localization Variant channel titles/descriptions for other languages/countries
snippet.contentDetails Channel content details Encapsulates information about the channel's content
snippet.statistics Channel statistics Includes views, number of published videos, and subscriber count

I'm curious about how these channel attributes might differ between bot and non-bot YouTube channels.

The identification strategy

This strategy is still very early in development.

I've recently become interested in the concept of "data embedding". It seems to be strongly associated with machine learning techniques as a means to simplify complex datasets into something more amenable to the performance demands of modern machine learning infrastructure (i.e. numerical data). The idea of encoding text into numerical representations doesn't seem novel, but representing this data as a vector in a multidimensional space seems promising for the identification of bot accounts.

My theory is that if I can gather a sufficient amount of data from each YouTube channel (big 'if' there), I may be able to represent the data from each channel as a vector, and in so doing provide a means to associate common 'types' of channels. If the collected data is sufficient, I may for example see that suspected bot channels are represented as vectors of similar magnitude which point in similar directions.

However, I do not know how demanding the computation of this data may be. My worry is that meaningful work of this sort comes with some big infrastructure requirements, which is a demand I'm not personally prepared to meet with my current setup. Regardless, I think writing a POC of this method would be a good starting point for determining its viability as a bot identification strategy.

Clone this wiki locally