complete-computing-environment/universal_aggregator.org

10 KiB

Universal Aggregator

The Universal Aggregator is a powerful collection of tools designed to take a feed of data-items and store them in a Maildir folder. This can be used to create human-legible archive of messages, twitter posts, rss feeds and various other scraped-data. It is a suite of Golang operator programs and some scrapers written in JavaScript which I do not use.

Container build using Ansible Bender and Podman

This uses Dynamic Ansible Bender playbooks.

cat <<EOF | python ~/org/cce/make-container-playbook.py
template: ~/org/cce/containers/simple.yml
friendly: Universal Aggregator
service_name: ua
cmd: ggs /data/ggsrc
build_roles:
- universal-aggregator-build
build_reqs:
- git
- make
- golang
- glib2-devel
- libxml2-devel
- python-devel
- python-pip
task_tags:
- comms
- universal-aggregators
EOF

ANSIBLE_ROLES_PATH=~/org/cce/roles ansible-bender build /home/rrix/org/cce/containers/ua/build.yml">Execute ANSIBLE_ROLES_PATH=~/org/cce/roles ansible-bender build /home/rrix/org/cce/containers/ua/build.yml

The Dockerfile I originally wrote for this uses gosu to set the UID and GID to my user; instead systemd/podman will just run this container as my user. the container can be started with /usr/bin/ggs ggsrc-path where ggsrc-path is probably mounted in to the container for my own sake.

Installing UA is a simple Makefile-based affair:

- name: ua cloned and up to date on personal branch
  become: yes
  become_user: "{{local_account}}"
  git:
    repo: https://code.rix.si/upstreams/ua/
    dest: "{{build_dir}}"
    version: rrix

- name: compile universal aggregator
  become: yes
  become_user: "{{local_account}}"
  shell:
    chdir: "{{build_dir}}"
    cmd: make
    creates:
    - "{{build_dir}}/ua-inline/ua-inline"
    - "{{build_dir}}/ua-maildir/ua-maildir"
    - "{{build_dir}}/ggs/ggs"

- name: install universal aggregator
  shell:
    chdir: "{{build_dir}}"
    cmd: make install
    creates: /usr/local/bin/ua-inline

- name: make clean
  tags:
  - postbuild
  shell:
    chdir: "{{build_dir}}"
    cmd: make clean

- name: jq installed
  dnf:
    state: installed
    name: jq

My UA configuration uses tweepy and a custom twitter client to pull tweets in to my Maildir, make sure that ends up in the image even if I don't use it right now.

- name: tweets.py is installed
  template:
    src: tweets.py
    dest: /usr/local/bin/tweets.py

- name: tweets.py deps installed
  pip:
    state: present
    name:
    - tweepy
    - click
    - twitter-text-python

Need a way to inject the access tokens; template + ansible-vault probably!

from __future__ import print_function
import tweepy
from email import utils
import time
import json
import click
from ttp import ttp
from ttp import utils as twutils

# auth = tweepy.OAuthHandler("", "")
# auth.set_access_token("", "")
api = tweepy.API(auth)
parser = ttp.Parser()

@click.group()
def cli():
    pass

def make_2822_date(dt):
    tup = dt.timetuple()
    flt = time.mktime(tup)
    return utils.formatdate(flt)

def render_tweet(status, retweeter=None):
    parsed = parser.parse(status.full_text)
    body = u'<a href="https://twitter.com/{twuser}/status/{twid}">{twuser}</a>: '.format(
        twuser=status.user.screen_name,
        twid=str(status.id)
    )
    body += parsed.html
    urls = twutils.follow_shortlinks(parsed.urls)
    for small, rest in urls.items():
        body = body.replace(small, rest[-1])
    references = None

    if retweeter:
        body += u'<br/> Retweeted by <a href="https://twitter.com/@{twuser}/">{twname}</a>.'.format(twuser=retweeter.screen_name,
               twname=retweeter.name)

    if status.entities.get("media") and len(status.entities["media"]) > 0:
        for medium in (status.entities["media"]):
            body += u'<br/><img src="{twimg}"/>'.format(
                twimg=medium[u"media_url_https"]
            )

    if status.in_reply_to_status_id:
        body += u'<br/> <a href="https://twitter.com/{twuser}/status/{twid}">in reply to {twuser}</a>'.format(
            twuser=status.in_reply_to_screen_name,
            twid=status.in_reply_to_status_id_str
        )
        references = [status.in_reply_to_status_id_str]

    return {
        'author': status.author.name,
        'title': status.full_text,
        'id': status.user.screen_name + "_" + status.id_str,
        'date': make_2822_date(status.created_at),
        'body': body,
        'references': references,
        'authorEmail': status.user.screen_name + "@twitter.com"
    }

def process_tweet(status):
    if status._json.get("retweeted_status"):
        return render_tweet(status.retweeted_status, retweeter=status.user)
    else:
        return render_tweet(status)

@cli.command()
def home():
    tweets = api.home_timeline(tweet_mode="extended")
    for tweet in tweets:
        print(json.dumps(process_tweet(tweet)))

@cli.command()
@click.option('--owner', type=str)
@click.option('--slug', type=str)
def list(owner, slug):
    tweets = api.list_timeline(owner, slug, tweet_mode='extended')
    for tweet in tweets:
        print(json.dumps(process_tweet(tweet)))

if __name__ == '__main__':
    cli()

NEXT load tokens from file…

Usage

Universal Aggregator is composed of a number of components, starting with the "grey goo spawner" ggs, which is a Golang program designed to run commands at intervals, and is designed to handle concurrency and work-sharing reasonably. In a really remarkable set of choices, ggs embeds a ggsrc file inside of a CONFIG_WRAPPER and then runs that like it's a shell script. The ggsrc file is for all intents and purposes a shell script and I can use this to my advantage to provide multiple paths for bringing data in to the system, running the shell script with an alternate rss command defined for example. I want to use this fact to also provide multiple paths for bringing data out of the system. By having functions which are defined differently depending on whether they are being run within ggs or Emacs a system for verifying feeds and inspecting their state can be built within org-mode, a small piece of Hypermedia which presents the state of the feed alongside the feed itself.

The files are generated and written to my homeserver with the Arroyo Feed Cache.

# nix-shell -p ansible --run 'ansible -i inventory -m assemble -a "remote_src=no src=ggs dest=/home/rrix/Maildir/ggsrc" fontkeming.fail'
nix-shell -p ansible --run 'ansible -i inventory --become -m systemd -a "name=ua state=restarted" fontkeming.fail'

The files come from places like:

Over time, I think I'll slowly factor those snippets in to pages dedicated to the source, many of them exist already, the work just needs to be done. I need to spend more time thinking about how to make this more managable, I want to build some prototypes for querying the org-roam database for things like the CCE loader table, maybe, this could be shared with the CCE to make itsl loaders better.

And some overflow:

ggs/70-overflow.ggs

Lectronice's Tokipona Blog https://tokipona.lectronice.com/atom/d Media
Polygon - All https://www.polygon.com/rss/index.xml Media
VICE US - undefined US https://waypoint.vice.com/en_us/rss Media
Kotaku https://kotaku.com/rss Media
Privacy Enhancing Tech Symposium Papers https://content.sciendo.com/journalnewarticlerss/journals/popets/popets-overview.xml Tech
Privacy Enhancing Tech Symposium https://www.youtube.com/feeds/videos.xml?channel_id=UC-m6oi7a-8LffTk64J3tq-w Videos
My SongKick feeds http://acousti.co/feeds/upcoming/songkick-670 Art
NWS Seattle Area Forecast Discussion https://afd.fontkeming.fail/AFDSEW.xml News
King County Metro Alerts https://kcmetro-rss.buttslol.net/D/40 News