Huginn - unequal matches (& adding images to the RSS output)
Scraping RSS feeds for useful summaries is a great use of Huginn, but there is a bit of a problem when you get some items in the feed that either have an image, or maybe have more than one. The scraper expects a single set of data items for each event that's extracted, & gets grumpy if things don't match.
I walked through iterating a redlib RSS feed to get the list of comments in this post.
Since then I've been able to add a few new tricks to the armoury for extracting data from RSS items...
First: Reading Nested Data
The trick is the xpath used in the Huginn data extraction can be used in the values extracted & not just the path to the values! So, '../@text' gives you the @text from the parent of the @xmlUrl
{
"expected_update_period_in_days": "365",
"url": "https://path.to.my.opml",
"type": "xml",
"mode": "on_change",
"single_array": "true",
"extract": {
"opml_group": {
"xpath": "//outline/outline",
"value": "concat(../@text,'~', @xmlUrl)"
}
}
The '~' in the concat() can be changed to any separator you fancy - extraction like this gives you a string without a way to parse the 2 parts in your liquid template in the later agents otherwise.
Second: creating arrays from a string for use in Huginn Output
Using the technique above will get a set of entries separated by "~". If those are URLs & you want to pass them on as an array then you've got to convert this delimited string into an array. Huginn has a nice option that allows this.
Below is a section of an event formatting agent that adds a title to posts (typically scraped from Mastodon) which have no title - by using the first few words from the main text. It tidies up the content by removing html tags &, importantly for this discussion, it splits the "media" field (a ~ delimited list of image URLs) & returns an array that can be passed on for sharing later.
{
"instructions": {
"title": "{{title}}{% if title == nil %}{{content | strip_html | truncatewords: 8, '' }}{%endif%}{% if title == '' %}{{content | strip_html | truncatewords: 8, '' }}{%endif%}",
"content": "{{content | strip_html }}",
"image": "{{ media | split: '~' | as_object}}"
},
"mode": "merge"
}
When you get to the data output agent then don't forget that the image field needs to be presented as an array & not just text!
i.e.
"image": "{{image | as_object}}"