5.2 KiB
Memex Ingester Pipeline
Overview
Memex is an data storage and information retrieval system, built around the idea of ingesting data from a number of sources and storing them in a datastore which allows for easy queries of said data.
The current stack looks, in essence, like an ElasticSearch database, paired with a set of custom
ingesters; currently, the ingesters are probably going to be written from scratch, instead of
leveraging a platform like Logstash, as the facts I am looking at are less stream-based than what
Logstash is designed to provide. For things with default Logstash providers, like twitter
, I could
leverage these if the memory trade-off of running a full Java application is deemed to be worthwhile.
Ingester Architecture
Because ElasticSearch exposes a simple HTTP API, ingesters can be written in the language most
appropriate for them. This means that I could build an Org-mode
parser that runs as an Emacs batch
job, or a Matrix log parser that runs as a JavaScript process on my server, or a Twitter parser that
uses a simple Python twitter streaming library to push my tweets and tweets mentioning me in to
ElasticSearch.
Each ingester will have its own fact-type, which translates in to the ElasticSearch index in which
the fact is stored. It is important to note that the ElasticSearch is not to be considered the
source of truth; I won't be uploading image blobs in to the ElasticSearch, just pointing to where on
my file store's file system it resides on. The presentation layer is responsible for providing that
to me, be it as a Matrix Content Repository upload, or a temporary HTTPS URL served from said
machine's apache2
instance and presented in the mobile application which generated the query
results.
This is intentionally vague; the idea is that facts are generic enough to be useful and the meat is in the query system, the ability to link facts together, which is where this turns in to a "fun" NLP problem that I'm actually going to punt on for a while, and instead will probably be issuing multiple queries, chaining the facts to get the information I need by hand.
Required Fields
In order to make the clients easy to standardize, a number of fields are defined to make it easy to present the information a fact contains; each fact can have its own set of metadata, and each fact-type will probably have its own schema to make certain types of media more easy to standardize.
uri
A URI points to where a fact lives, simply. Given the fact that a Fact can be, for example, the
ownership of a physical item, the URIs can either be URLs, or Emacs org-mode style links as
generated by org-store-link
1; HTTP or HTTPS URLs are preferred, but not required.
In the case of physical facts, the URI can be a human-readable string or a fact//:<INDEX>/<UUID>
pointer to
another fact, which is then interpreted as a parent-child relationship. A prime use-case here would
be "where is this item stored?", allowing us to walk the fact tree to discover a direct physical
location in which an object exists.
name
This is the human-readable name of a given fact; it can be the file name of a digital artifact, but should be more meaningful than this, for example "%ARTIST - %TRACK - %ALBUM" for audio files, or "%AUTHOR - %TITLE" for documents.
image
This is one of:
- A fully qualified HTTP or HTTPS URI to an image which represents the fact (Cover art,
product photo of object, etc).
- A
fact://<INDEX>/<UUID>
URI pointing to another fact, such as a photo fact whose URI is displayable. - A
mxc://<MXID>
pointer to an object in the Matrix Content Repository, this is fully qualified with theserver_name
of the homeserver, as it would be for a matrix.org upload.23 - A
file://<PATH>
URI matching theuri
property above.
Optional Fields
Past the Required Fields presented above, clients are expected to be able to render the following optional fields.
description
A human readable description of the fact. Pretty straightforward. This could come from photograph metadata, album synopses online, or entered by hand.
captured-at
Initial datetime at which a fact was captured. PUT
requests should never set this, POST
requests
should always set this. This should be any time-shaped object which ElasticSearch can parse.4
updated-at
The datetime at which the fact was last changed. This should be set by both PUT
and POST
requests. This should be any time-shaped object which ElasticSearch can parse.4
geo
A geo URI5 which would allow the fact to be placed on a map and queried based on location.