You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.2 KiB

Memex Ingester Pipeline

Overview

Memex is an data storage and information retrieval system, built around the idea of ingesting data from a number of sources and storing them in a datastore which allows for easy queries of said data.

The current stack looks, in essence, like an ElasticSearch database, paired with a set of custom ingesters; currently, the ingesters are probably going to be written from scratch, instead of leveraging a platform like Logstash, as the facts I am looking at are less stream-based than what Logstash is designed to provide. For things with default Logstash providers, like twitter, I could leverage these if the memory trade-off of running a full Java application is deemed to be worthwhile.

Ingester Architecture

Because ElasticSearch exposes a simple HTTP API, ingesters can be written in the language most appropriate for them. This means that I could build an Org-mode parser that runs as an Emacs batch job, or a Matrix log parser that runs as a JavaScript process on my server, or a Twitter parser that uses a simple Python twitter streaming library to push my tweets and tweets mentioning me in to ElasticSearch.

Each ingester will have its own fact-type, which translates in to the ElasticSearch index in which the fact is stored. It is important to note that the ElasticSearch is not to be considered the source of truth; I won't be uploading image blobs in to the ElasticSearch, just pointing to where on my file store's file system it resides on. The presentation layer is responsible for providing that to me, be it as a Matrix Content Repository upload, or a temporary HTTPS URL served from said machine's apache2 instance and presented in the mobile application which generated the query results.

This is intentionally vague; the idea is that facts are generic enough to be useful and the meat is in the query system, the ability to link facts together, which is where this turns in to a "fun" NLP problem that I'm actually going to punt on for a while, and instead will probably be issuing multiple queries, chaining the facts to get the information I need by hand.

Required Fields

In order to make the clients easy to standardize, a number of fields are defined to make it easy to present the information a fact contains; each fact can have its own set of metadata, and each fact-type will probably have its own schema to make certain types of media more easy to standardize.

uri

A URI points to where a fact lives, simply. Given the fact that a Fact can be, for example, the ownership of a physical item, the URIs can either be URLs, or Emacs org-mode style links as generated by org-store-link1; HTTP or HTTPS URLs are preferred, but not required.

In the case of physical facts, the URI can be a human-readable string or a fact//:<INDEX>/<UUID> pointer to another fact, which is then interpreted as a parent-child relationship. A prime use-case here would be "where is this item stored?", allowing us to walk the fact tree to discover a direct physical location in which an object exists.

name

This is the human-readable name of a given fact; it can be the file name of a digital artifact, but should be more meaningful than this, for example "%ARTIST - %TRACK - %ALBUM" for audio files, or "%AUTHOR - %TITLE" for documents.

image

This is one of:

  • A fully qualified HTTP or HTTPS URI to an image which represents the fact (Cover art,

product photo of object, etc).

  • A fact://<INDEX>/<UUID> URI pointing to another fact, such as a photo fact whose URI is displayable.
  • A mxc://<MXID> pointer to an object in the Matrix Content Repository, this is fully qualified with the server_name of the homeserver, as it would be for a matrix.org upload.23
  • A file://<PATH> URI matching the uri property above.

Optional Fields

Past the Required Fields presented above, clients are expected to be able to render the following optional fields.

description

A human readable description of the fact. Pretty straightforward. This could come from photograph metadata, album synopses online, or entered by hand.

captured-at

Initial datetime at which a fact was captured. PUT requests should never set this, POST requests should always set this. This should be any time-shaped object which ElasticSearch can parse.4

updated-at

The datetime at which the fact was last changed. This should be set by both PUT and POST requests. This should be any time-shaped object which ElasticSearch can parse.4

geo

A geo URI5 which would allow the fact to be placed on a map and queried based on location.

Footnotes


1

(define-function 'org-store-link)

3

This is provided as a means to allow user using a theoretical Matrix.org client to capture facts to upload an image as part of the fact generation, without having to manually transfer an image in to where it can be referenced by the Image DB.