Nov 13, 2022 3 min read

Youtube RSS processing

As I mentioned in the RSS feeds post it is totally possible to get an RSS feed from a youtube channel. The problem is that sometimes the same channel posts videos on different topics & you only want to know about one of them (or more, the processing principles are the same).

For the sake of a target let's pick the Numberphile channel. Go to the main channel page & press F12, then search for channelid & you'll find that it's "UCoxcjq-8xIDTYp3uz647V5A". We'll need that in a minute.

Open Huginn & create a new agent (type is "RSS agent"). Call it what you like (I used "Numberphile" & edit the json to

{
  "expected_update_period_in_days": "1",
  "clean": "false",
  "url": "https://www.youtube.com/feeds/videos.xml?channel_id=UCoxcjq-8xIDTYp3uz647V5A"
}

The problem is, if you do a dry run with the json above is that you get an output like this...

[
  {
    "id": "yt:video:p-HN_ICaCyM",
    "url": "https://www.youtube.com/watch?v=p-HN_ICaCyM",
    "urls": [
      "https://www.youtube.com/watch?v=p-HN_ICaCyM"
    ],
    "links": [
      {
        "href": "https://www.youtube.com/watch?v=p-HN_ICaCyM",
        "rel": "alternate"
      }
    ],
    "title": "The Troublemaker Number - Numberphile",
    "description": null,
    "content": "Dr Harini Desiraju discusses Somos Sequences and a number which breaks a streak.\nMore links & stuff in full description below ↓↓↓\n\nDr Harini Desiraju is a postdoctoral fellow at The University of Sydney. This video was recorded at MSRI.\n\nLike sequences - see these videos with Neil Sloane: http://bit.ly/Sloane_Numberphile\n\nNumberphile is supported by the Mathematical Sciences Research Institute (MSRI): http://bit.ly/MSRINumberphile\n\nWe are also supported by Science Sandbox, a Simons Foundation initiative dedicated to engaging everyone with the process of science. https://www.simonsfoundation.org/outreach/science-sandbox/\n\nAnd support from The Akamai Foundation - dedicated to encouraging the next generation of technology innovators and equitable access to STEM education - https://www.akamai.com/company/corporate-responsibility/akamai-foundation\n\nNUMBERPHILE\nWebsite: http://www.numberphile.com/\nNumberphile on Facebook: http://www.facebook.com/numberphile\nNumberphile tweets: https://twitter.com/numberphile\nSubscribe: http://bit.ly/Numberphile_Sub\n\nVideos by Brady Haran\n\nPatreon: http://www.patreon.com/numberphile\n\nNumberphile T-Shirts and Merch: https://teespring.com/stores/numberphile\n\nBrady's videos subreddit: http://www.reddit.com/r/BradyHaran/\n\nBrady's latest videos across all channels: http://www.bradyharanblog.com/\n\nSign up for (occasional) emails: http://eepurl.com/YdjL9",
    "image": null,
    "enclosure": null,
    "authors": [
      "Numberphile (https://www.youtube.com/channel/UCoxcjq-8xIDTYp3uz647V5A)"
    ],
    "categories": [

    ],
    "date_published": "2022-05-23T14:13:57+00:00",
    "last_updated": "2022-05-29T03:55:28+00:00"
  },

All that I want in my RSS reader is the video URL, the title & the first line of the description - which is actually in the "content" tag & only goes as far as the first "\n" in the string.
So, we need to process things a bit...

Huginn uses a "trigger" agent to filter events generated by an agent that feeds events to it. So, create one of those & set the "source" to the agent that we just created. For the sake of adding a filter, let's say that we are only interested in videos about prime numbers - so we'll filter those out in order to pass just those ones on to the next step using this json for the trigger agent.

{
  "expected_receive_period_in_days": "1",
  "keep_event": "true",
  "rules": [
    {
      "type": "regex",
      "value": ".*Prime.*",
      "path": "title"
    }
  ],
  "message": "Looks like your pattern matched in '{{value}}'!"
}

In the output snippet above you can see that the title is "The Troublemaker Number - Numberphile", so that wouldn't match. (Incidentally the regex describes "match Prime, with any number of characters (.*), including no characters, before or after the string 'Prime'")

Now we need a data output agent to produce the RSS feed for us. Create one & tell it to recieve data from the trigger agent. Edit the json to something like this

{
  "secrets": [
    "Prime Numbers Only"
  ],
  "expected_receive_period_in_days": 2,
  "template": {
    "title": "Youtube channel Feed",
    "description": "Simple notification of new youtube content",
    "item": {
      "title": "{{title}}",
      "description": "{{content | split: '\\n' | first }}",
      "link": "{{url}}",
      "Last Updated": "{{last_updated}}"
    }
  },
  "ns_media": "true"
}

SO, our RSS output copies the title of the video & the link that plays it. However, it reads the content part of the original feed, breaks it into chunks using the newline character ('\n') & then returns the first chunk only. Note that the split command needs ' characters to identify the string that we use for the split & that the "\" character needs to be escaped with another "\" so that it gets seen.

You can, of course, mess with the filtering & content of your RSS output to suit your needs. Mine are pretty simple. :)

Andy