master
Ryan Rix 7 years ago
commit 9a36a47548

@ -0,0 +1,67 @@
#+TITLE: Memex Client Specifications
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
* Overview
This document lays out the ways in which I will be interacting with my Memex, possibly with crudely
drawn interface descriptions.
* Chat Interface (Matrix)
I'd like a chat-style interface to be the primary system with which to interact with the Memex; this
allows me to form natural-language queries and send them to the system, in a way that is easy for me
to integrate with my other systems, such as the CCE[fn:1]. In essence, this boils down to a pretty
simple NLP system which needs to be able to extract a few parts of speech out of a query.
** Sample Flows
(=op= is the Operator of the bot, and =bot= is the bot)
*** Retrieving Information
#+BEGIN_EXAMPLE
<op> What photos did I take on January 17th?
<bot> I found 27 photos you took on January 17th, 2016.
<bot> I am uploading thumbnails now. Feel free to tell me to stop when you see what you are looking for.
<bot> =IMG_2011.jpg=
<bot> My friends and I at dinner fact://photos/ef066a9b-be38-4013-af4a-f4ed2faa410e
<bot> =IMG_2012.jpg=
<bot> Birthday Selfie! fact://photos/2a9b5f93-cb1d-4096-910f-38c3afb6cb9f
<bot> =IMG_2013.jpg=
<op> stop
<bot> fact://photos/7535a26e-11a0-44e9-ac30-0fff2e212abe
<bot> If you want more information about a fact, paste the description or the item's URI back to me.
<op> My friends and I at dinner (in Oakland, CA), fact://photos/ef066a9b-be38-4013-af4a-f4ed2faa410e
<bot> This fact was captured in Oakland, CA at 21:13 on January 17th 2016.
<bot> It is tagged with "Carl Johnson", "Jane Doe", "John Doo", "Ocean Waves", "Roxie Theatre"
#+END_EXAMPLE
The key bits here are =What photos= which references the =photos= fact index, and "on January 17th"
with an implied year of "last year", which is the parameter on which to narrow the query. Other
``narrowers" could be "in Phoenix, Arizona" and eventually more complex chained queries like "during
fact://event/ef2cdcbd-449f-4903-9119-69779e63c6c0", referencing an Event fact's start and end times.
*** Adding a fact
#+BEGIN_EXAMPLE
<op> I purchased "Site Reliability Engineering" EBook from O'Reilly
<bot> Okay, can you upload a picture of it?
<op> =IMG_2014.jpg=
<bot> Thanks, I have found the following information about the ebook, is it correct?
<bot> Title: Site Reliability Engineering
<bot> Author: John Doe
<bot> ISBN: 1234567890
<op> Correct.
<bot> Great, how much did it cost?
<op> 34.99
<bot> Alright, I've created fact://ebook/1560bff1-ae7a-4117-85b8-7c9caa294cb7
#+END_EXAMPLE
* Web Application
* Tasker Scripts
* Command Line query interface
* Footnotes
[fn:1] http://doc.rix.si/cce/cce.html

@ -0,0 +1,104 @@
#+TITLE: Memex Ingester Pipeline
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
* Overview
Memex is an data storage and information retrieval system, built around the idea of ingesting data
from a number of sources and storing them in a datastore which allows for easy queries of said data.
The current stack looks, in essence, like an ElasticSearch database, paired with a set of custom
ingesters; currently, the ingesters are probably going to be written from scratch, instead of
leveraging a platform like Logstash, as the facts I am looking at are less stream-based than what
Logstash is designed to provide. For things with default Logstash providers, like =twitter=, I could
leverage these if the memory trade-off of running a full Java application is deemed to be worthwhile.
* Ingester Architecture
Because ElasticSearch exposes a simple HTTP API, ingesters can be written in the language most
appropriate for them. This means that I could build an =Org-mode= parser that runs as an Emacs batch
job, or a Matrix log parser that runs as a JavaScript process on my server, or a Twitter parser that
uses a simple Python twitter streaming library to push my tweets and tweets mentioning me in to
ElasticSearch.
Each ingester will have its own fact-type, which translates in to the ElasticSearch index in which
the fact is stored. It is important to note that the ElasticSearch is *not* to be considered the
source of truth; I won't be uploading image blobs in to the ElasticSearch, just pointing to where on
my file store's file system it resides on. The presentation layer is responsible for providing that
to me, be it as a Matrix Content Repository upload, or a temporary HTTPS URL served from said
machine's =apache2= instance and presented in the mobile application which generated the query
results.
This is intentionally vague; the idea is that facts are generic enough to be useful and the meat is
in the query system, the ability to link facts together, which is where this turns in to a "fun" NLP
problem that I'm actually going to punt on for a while, and instead will probably be issuing
multiple queries, chaining the facts to get the information I need by hand.
** Required Fields
:PROPERTIES:
:ID: 407fd88b-f5c8-46de-b77d-6e9583714347
:END:
In order to make the clients easy to standardize, a number of fields are defined to make it easy to
present the information a fact contains; each fact can have its own set of metadata, and each
fact-type will probably have its own schema to make certain types of media more easy to standardize.
*** =uri=
A URI points to where a fact lives, simply. Given the fact that a Fact can be, for example, the
ownership of a physical item, the URIs can either be URLs, or Emacs org-mode style links as
generated by =org-store-link=[fn:1]; HTTP or HTTPS URLs are preferred, but not required.
In the case of physical facts, the URI can be a human-readable string or a =fact//:<INDEX>/<UUID>= pointer to
another fact, which is then interpreted as a parent-child relationship. A prime use-case here would
be "where is this item stored?", allowing us to walk the fact tree to discover a direct physical
location in which an object exists.
*** =name=
This is the human-readable name of a given fact; it can be the file name of a digital artifact, but
should be more meaningful than this, for example "%ARTIST - %TRACK - %ALBUM" for audio files, or
"%AUTHOR - %TITLE" for documents.
*** =image=
This is one of:
- A fully qualified HTTP or HTTPS URI to an image which represents the fact (Cover art,
product photo of object, etc).
- A =fact://<INDEX>/<UUID>= URI pointing to another fact, such as a photo fact whose URI is displayable.
- A =mxc://<MXID>= pointer to an object in the Matrix Content Repository, this is fully qualified
with the =server_name= of the homeserver, as it would be for a [[matrix.org]] upload.[fn:2][fn:3]
- A =file://<PATH>= URI matching the =uri= property above.
** Optional Fields
Past the [[id:407fd88b-f5c8-46de-b77d-6e9583714347][Required Fields]] presented above, clients are expected to be able to render the following
optional fields.
*** =description=
A human readable description of the fact. Pretty straightforward. This could come from photograph
metadata, album synopses online, or entered by hand.
*** =captured-at=
Initial datetime at which a fact was captured. =PUT= requests should never set this, =POST= requests
should always set this. This should be any time-shaped object which ElasticSearch can parse.[fn:4]
*** =updated-at=
The datetime at which the fact was last changed. This should be set by both =PUT= and =POST=
requests. This should be any time-shaped object which ElasticSearch can parse.[fn:4]
*** =geo=
A geo URI[fn:5] which would allow the fact to be placed on a map and queried based on location.
* Footnotes
[fn:1] (define-function 'org-store-link)
[fn:2] https://matrix.org/docs/spec/client_server/r0.2.0.html#id43
[fn:3] This is provided as a means to allow user using a theoretical Matrix.org client to capture
facts to upload an image as part of the fact generation, without having to manually transfer an
image in to where it can be referenced by the Image DB.
[fn:4] https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html
[fn:5] https://en.wikipedia.org/wiki/Geo_URI_scheme

@ -0,0 +1,62 @@
#+TITLE: Memex
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
* Memex: Extracting Information from Data
Memex is my external memory, a search engine for the facts surrounding my life.
** Use Cases
- What calendar events do I have today
- What photos did I take while I was in San Diego?
- What restaurant did I go to the last time I was in Tempe?
- Who was I talking to about Ear Plugs on Twitter?
- What page was I looking at that had
- Where did I store my USB mouse?
- Which e-books do I need to strip the DRM from?
- Who have I sent emails to in the last month?
- When was the last time I ate Burmese food
- Where was I when I took a picture of a truck
- Which calendar event did this page of notes come from
- Which calendar event was I at when I wrote this org-mode entry
* Architecture
This thing is pretty simple, right now; it's a simple backend that takes objects and shoves them in
to two stores.
#+BEGIN_SRC dot :file images/arch.png
digraph {
client -> backend
backend -> elasticsearch
backend -> postgres
ingester -> backend
}
#+END_SRC
#+results:
[[file:images/arch.png]]
I am not sure whether I want to have this running PostgreSQL, that feels very much like
overkill. The biggest unknown is being able to do recursive graph queries with Elasticsearch, but
I'm not even sure if that is a usecase I need. In the short/medium turn, I imagine I will stick with
just using ElasticSearch, and save the blobs to somewhere I can backfill them from, probably an S3
bucket.
** Fact Types
Facts, can be roughly defined as:
- time, date and location in which an Event takes place
- location of a digital artifact along with metadata to identify the artifact
- location of a physical artifact along with metadata surrounding the artifact
- =object= is a bit obtuse, but is a physical "thing"
- =event= is a time period, two datetimes with metadata attached to them.
- =location= is probably too abstract; it feels like an attribute attached to other facts, but there
might be value in presenting it directly.
- =statement= is a sentence or paragraph of either written text, recorded voice, or thoughts; linked
statements form a conversation or train of thought depending on the source.
- =page= is either an OCRd document, or the text of an HTML page.
- =photo= is, well, a photo. A jpeg or png with metadata attached.
- =file= is any other file, with whatever metadata can be extracted from them.
- =person= is self-explanatory. Basically looking at a vcard and arbitrary key-values otherwise.
Loading…
Cancel
Save