-
Notifications
You must be signed in to change notification settings - Fork 0
YouTube Bot Detection
If Astro is meant to eventually collect and arbitrate YouTube user data for bot identification, it would be prudent to first identify what information is available via the YouTube Data API.
A full list of YouTube channel attributes accessible via the Data API can be found here.
The fields I'd like to highlight here are as follows:
API Field | Attribute | Description |
---|---|---|
etag |
N/A | A hash to identify the state of the following data (for use in caching purposes) |
id |
Channel ID | The unique ID of the channel (hash) |
snippet.title |
Channel title | A title string that can be modified by the channel owner |
snippet.description |
Channel description | A description set by the channel owner |
snippet.publishedAt |
Channel creation date | Channel creation date |
snippet.customURL |
Custom URL | A custom URL based on the channel's handle (youtube.com/@<handle> ) |
snippet.thumnails |
Channel thumbnails | A list of thumbnail links in various resolutions |
snippet.country |
Channel country | The country from which the channel operates |
snippet.localized |
Channel localization | Variant channel titles/descriptions for other languages/countries |
snippet.contentDetails |
Channel content details | Encapsulates information about the channel's content |
snippet.statistics |
Channel statistics | Includes views, number of published videos, and subscriber count |
I'm curious about how these channel attributes might differ between bot and non-bot YouTube channels.
This strategy is still very early in development.
I've recently become interested in the concept of "data embedding". It seems to be strongly associated with machine learning techniques as a means to simplify complex datasets into something more amenable to the performance demands of modern ML infrastructure (i.e. numerical data). The idea of encoding text into numerical representations isn't novel, but representing this data as a vector in a multidimensional space seems promising for the identification of bot accounts.
My theory is that if I can gather a sufficient amount of data from each YouTube channel (big 'if' there), I may be able to represent the data from each channel as a vector, and in so doing provide a means to associate common 'types' of channels. If the collected data is sufficient, I may for example see that suspected bot channels are represented as vectors of similar magnitude which point in similar directions.
However, I do not know how demanding the computation of this data may be. My worry is that meaningful work of this sort comes with some big infrastructure requirements, which is a demand I'm not personally prepared to meet with my current setup. Regardless, I think writing a POC of this method would be a good starting point for determining its viability as a bot identification strategy.