May 25, 2025 1 min read

Huginn - unequal matches (& adding images to the RSS output)

Scraping RSS feeds for useful summaries is a great use of Huginn, but there is a bit of a problem when you get some items in the feed that either have an image, or maybe have more than one. The scraper expects a single set of data items for each event that's extracted, & gets grumpy if things don't match.

I walked through iterating a redlib RSS feed to get the list of comments in this post.
Since then I've been able to add a few new tricks to the armoury for extracting data from RSS items...

First: Reading Nested Data

The trick is the xpath used in the Huginn data extraction can be used in the values extracted & not just the path to the values! So, '../@text' gives you the @text from the parent of the @xmlUrl

{
  "expected_update_period_in_days": "365",
  "url": "https://path.to.my.opml",
  "type": "xml",
  "mode": "on_change",
  "single_array": "true",
  "extract": {
    "opml_group": {
      "xpath": "//outline/outline",
      "value": "concat(../@text,'~', @xmlUrl)"
    }
  }

The '~' in the concat() can be changed to any separator you fancy - extraction like this gives you a string without a way to parse the 2 parts in your liquid template in the later agents otherwise.

Second: creating arrays from a string for use in Huginn Output

Using the technique above will get a set of entries separated by "~". If those are URLs & you want to pass them on as an array then you've got to convert this delimited string into an array. Huginn has a nice option that allows this.

Below is a section of an event formatting agent that adds a title to posts (typically scraped from Mastodon) which have no title - by using the first few words from the main text. It tidies up the content by removing html tags &, importantly for this discussion, it splits the "media" field (a ~ delimited list of image URLs) & returns an array that can be passed on for sharing later.

{
  "instructions": {
    "title": "{{title}}{% if title == nil %}{{content | strip_html | truncatewords: 8, '' }}{%endif%}{% if title == '' %}{{content | strip_html | truncatewords: 8, '' }}{%endif%}",
    "content": "{{content | strip_html }}",
    "image": "{{ media | split: '~' | as_object}}"
  },
  "mode": "merge"
}

When you get to the data output agent then don't forget that the image field needs to be presented as an array & not just text!
i.e.

      "image": "{{image | as_object}}"

First: Reading Nested Data

Second: creating arrays from a string for use in Huginn Output

Andy