Memex Ingester Pipeline
- Ingester Architecture
Memex is an data storage and information retrieval system, built around the idea of ingesting data from a number of sources and storing them in a datastore which allows for easy queries of said data.
The current stack looks, in essence, like an ElasticSearch database, paired with a set of custom
ingesters; currently, the ingesters are probably going to be written from scratch, instead of
leveraging a platform like Logstash, as the facts I am looking at are less stream-based than what
Logstash is designed to provide. For things with default Logstash providers, like
Because ElasticSearch exposes a simple HTTP API, ingesters can be written in the language most
appropriate for them. This means that I could build an
Org-mode parser that runs as an Emacs batch
uses a simple Python twitter streaming library to push my tweets and tweets mentioning me in to
Each ingester will have its own fact-type, which translates in to the ElasticSearch index in which
the fact is stored. It is important to note that the ElasticSearch is not to be considered the
source of truth; I won't be uploading image blobs in to the ElasticSearch, just pointing to where on
my file store's file system it resides on. The presentation layer is responsible for providing that
to me, be it as a Matrix Content Repository upload, or a temporary HTTPS URL served from said
apache2 instance and presented in the mobile application which generated the query
This is intentionally vague; the idea is that facts are generic enough to be useful and the meat is in the query system, the ability to link facts together, which is where this turns in to a "fun" NLP problem that I'm actually going to punt on for a while, and instead will probably be issuing multiple queries, chaining the facts to get the information I need by hand.
In order to make the clients easy to standardize, a number of fields are defined to make it easy to present the information a fact contains; each fact can have its own set of metadata, and each fact-type will probably have its own schema to make certain types of media more easy to standardize.
A URI points to where a fact lives, simply. Given the fact that a Fact can be, for example, the
ownership of a physical item, the URIs can either be URLs, or Emacs org-mode style links as
org-store-link1; HTTP or HTTPS URLs are preferred, but not required.
In the case of physical facts, the URI can be a human-readable string or a
fact//:<INDEX>/<UUID> pointer to
another fact, which is then interpreted as a parent-child relationship. A prime use-case here would
be "where is this item stored?", allowing us to walk the fact tree to discover a direct physical
location in which an object exists.
This is the human-readable name of a given fact; it can be the file name of a digital artifact, but should be more meaningful than this, for example "%ARTIST - %TRACK - %ALBUM" for audio files, or "%AUTHOR - %TITLE" for documents.
This is one of:
- A fully qualified HTTP or HTTPS URI to an image which represents the fact (Cover art,
product photo of object, etc).
fact://<INDEX>/<UUID>URI pointing to another fact, such as a photo fact whose URI is displayable.
mxc://<MXID>pointer to an object in the Matrix Content Repository, this is fully qualified with the
server_nameof the homeserver, as it would be for a matrix.org upload.23
file://<PATH>URI matching the
Past the Required Fields presented above, clients are expected to be able to render the following optional fields.
A human readable description of the fact. Pretty straightforward. This could come from photograph metadata, album synopses online, or entered by hand.
Initial datetime at which a fact was captured.
PUT requests should never set this,
should always set this. This should be any time-shaped object which ElasticSearch can parse.4
The datetime at which the fact was last changed. This should be set by both
requests. This should be any time-shaped object which ElasticSearch can parse.4
A geo URI5 which would allow the fact to be placed on a map and queried based on location.
This is provided as a means to allow user using a theoretical Matrix.org client to capture facts to upload an image as part of the fact generation, without having to manually transfer an image in to where it can be referenced by the Image DB.