Splat.
commit
9a36a47548
@ -0,0 +1,67 @@
|
||||
#+TITLE: Memex Client Specifications
|
||||
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
|
||||
|
||||
* Overview
|
||||
|
||||
This document lays out the ways in which I will be interacting with my Memex, possibly with crudely
|
||||
drawn interface descriptions.
|
||||
|
||||
* Chat Interface (Matrix)
|
||||
|
||||
I'd like a chat-style interface to be the primary system with which to interact with the Memex; this
|
||||
allows me to form natural-language queries and send them to the system, in a way that is easy for me
|
||||
to integrate with my other systems, such as the CCE[fn:1]. In essence, this boils down to a pretty
|
||||
simple NLP system which needs to be able to extract a few parts of speech out of a query.
|
||||
|
||||
** Sample Flows
|
||||
|
||||
(=op= is the Operator of the bot, and =bot= is the bot)
|
||||
|
||||
*** Retrieving Information
|
||||
|
||||
#+BEGIN_EXAMPLE
|
||||
<op> What photos did I take on January 17th?
|
||||
<bot> I found 27 photos you took on January 17th, 2016.
|
||||
<bot> I am uploading thumbnails now. Feel free to tell me to stop when you see what you are looking for.
|
||||
<bot> =IMG_2011.jpg=
|
||||
<bot> My friends and I at dinner fact://photos/ef066a9b-be38-4013-af4a-f4ed2faa410e
|
||||
<bot> =IMG_2012.jpg=
|
||||
<bot> Birthday Selfie! fact://photos/2a9b5f93-cb1d-4096-910f-38c3afb6cb9f
|
||||
<bot> =IMG_2013.jpg=
|
||||
<op> stop
|
||||
<bot> fact://photos/7535a26e-11a0-44e9-ac30-0fff2e212abe
|
||||
<bot> If you want more information about a fact, paste the description or the item's URI back to me.
|
||||
<op> My friends and I at dinner (in Oakland, CA), fact://photos/ef066a9b-be38-4013-af4a-f4ed2faa410e
|
||||
<bot> This fact was captured in Oakland, CA at 21:13 on January 17th 2016.
|
||||
<bot> It is tagged with "Carl Johnson", "Jane Doe", "John Doo", "Ocean Waves", "Roxie Theatre"
|
||||
#+END_EXAMPLE
|
||||
|
||||
The key bits here are =What photos= which references the =photos= fact index, and "on January 17th"
|
||||
with an implied year of "last year", which is the parameter on which to narrow the query. Other
|
||||
``narrowers" could be "in Phoenix, Arizona" and eventually more complex chained queries like "during
|
||||
fact://event/ef2cdcbd-449f-4903-9119-69779e63c6c0", referencing an Event fact's start and end times.
|
||||
|
||||
*** Adding a fact
|
||||
|
||||
#+BEGIN_EXAMPLE
|
||||
<op> I purchased "Site Reliability Engineering" EBook from O'Reilly
|
||||
<bot> Okay, can you upload a picture of it?
|
||||
<op> =IMG_2014.jpg=
|
||||
<bot> Thanks, I have found the following information about the ebook, is it correct?
|
||||
<bot> Title: Site Reliability Engineering
|
||||
<bot> Author: John Doe
|
||||
<bot> ISBN: 1234567890
|
||||
<op> Correct.
|
||||
<bot> Great, how much did it cost?
|
||||
<op> 34.99
|
||||
<bot> Alright, I've created fact://ebook/1560bff1-ae7a-4117-85b8-7c9caa294cb7
|
||||
#+END_EXAMPLE
|
||||
|
||||
* Web Application
|
||||
|
||||
* Tasker Scripts
|
||||
|
||||
* Command Line query interface
|
||||
* Footnotes
|
||||
|
||||
[fn:1] http://doc.rix.si/cce/cce.html
|
@ -0,0 +1,104 @@
|
||||
#+TITLE: Memex Ingester Pipeline
|
||||
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
|
||||
|
||||
* Overview
|
||||
|
||||
Memex is an data storage and information retrieval system, built around the idea of ingesting data
|
||||
from a number of sources and storing them in a datastore which allows for easy queries of said data.
|
||||
|
||||
The current stack looks, in essence, like an ElasticSearch database, paired with a set of custom
|
||||
ingesters; currently, the ingesters are probably going to be written from scratch, instead of
|
||||
leveraging a platform like Logstash, as the facts I am looking at are less stream-based than what
|
||||
Logstash is designed to provide. For things with default Logstash providers, like =twitter=, I could
|
||||
leverage these if the memory trade-off of running a full Java application is deemed to be worthwhile.
|
||||
|
||||
* Ingester Architecture
|
||||
|
||||
Because ElasticSearch exposes a simple HTTP API, ingesters can be written in the language most
|
||||
appropriate for them. This means that I could build an =Org-mode= parser that runs as an Emacs batch
|
||||
job, or a Matrix log parser that runs as a JavaScript process on my server, or a Twitter parser that
|
||||
uses a simple Python twitter streaming library to push my tweets and tweets mentioning me in to
|
||||
ElasticSearch.
|
||||
|
||||
Each ingester will have its own fact-type, which translates in to the ElasticSearch index in which
|
||||
the fact is stored. It is important to note that the ElasticSearch is *not* to be considered the
|
||||
source of truth; I won't be uploading image blobs in to the ElasticSearch, just pointing to where on
|
||||
my file store's file system it resides on. The presentation layer is responsible for providing that
|
||||
to me, be it as a Matrix Content Repository upload, or a temporary HTTPS URL served from said
|
||||
machine's =apache2= instance and presented in the mobile application which generated the query
|
||||
results.
|
||||
|
||||
This is intentionally vague; the idea is that facts are generic enough to be useful and the meat is
|
||||
in the query system, the ability to link facts together, which is where this turns in to a "fun" NLP
|
||||
problem that I'm actually going to punt on for a while, and instead will probably be issuing
|
||||
multiple queries, chaining the facts to get the information I need by hand.
|
||||
|
||||
** Required Fields
|
||||
:PROPERTIES:
|
||||
:ID: 407fd88b-f5c8-46de-b77d-6e9583714347
|
||||
:END:
|
||||
|
||||
In order to make the clients easy to standardize, a number of fields are defined to make it easy to
|
||||
present the information a fact contains; each fact can have its own set of metadata, and each
|
||||
fact-type will probably have its own schema to make certain types of media more easy to standardize.
|
||||
|
||||
*** =uri=
|
||||
|
||||
A URI points to where a fact lives, simply. Given the fact that a Fact can be, for example, the
|
||||
ownership of a physical item, the URIs can either be URLs, or Emacs org-mode style links as
|
||||
generated by =org-store-link=[fn:1]; HTTP or HTTPS URLs are preferred, but not required.
|
||||
|
||||
In the case of physical facts, the URI can be a human-readable string or a =fact//:<INDEX>/<UUID>= pointer to
|
||||
another fact, which is then interpreted as a parent-child relationship. A prime use-case here would
|
||||
be "where is this item stored?", allowing us to walk the fact tree to discover a direct physical
|
||||
location in which an object exists.
|
||||
|
||||
*** =name=
|
||||
|
||||
This is the human-readable name of a given fact; it can be the file name of a digital artifact, but
|
||||
should be more meaningful than this, for example "%ARTIST - %TRACK - %ALBUM" for audio files, or
|
||||
"%AUTHOR - %TITLE" for documents.
|
||||
|
||||
*** =image=
|
||||
|
||||
This is one of:
|
||||
- A fully qualified HTTP or HTTPS URI to an image which represents the fact (Cover art,
|
||||
product photo of object, etc).
|
||||
- A =fact://<INDEX>/<UUID>= URI pointing to another fact, such as a photo fact whose URI is displayable.
|
||||
- A =mxc://<MXID>= pointer to an object in the Matrix Content Repository, this is fully qualified
|
||||
with the =server_name= of the homeserver, as it would be for a [[matrix.org]] upload.[fn:2][fn:3]
|
||||
- A =file://<PATH>= URI matching the =uri= property above.
|
||||
|
||||
** Optional Fields
|
||||
|
||||
Past the [[id:407fd88b-f5c8-46de-b77d-6e9583714347][Required Fields]] presented above, clients are expected to be able to render the following
|
||||
optional fields.
|
||||
|
||||
*** =description=
|
||||
|
||||
A human readable description of the fact. Pretty straightforward. This could come from photograph
|
||||
metadata, album synopses online, or entered by hand.
|
||||
|
||||
*** =captured-at=
|
||||
|
||||
Initial datetime at which a fact was captured. =PUT= requests should never set this, =POST= requests
|
||||
should always set this. This should be any time-shaped object which ElasticSearch can parse.[fn:4]
|
||||
|
||||
*** =updated-at=
|
||||
|
||||
The datetime at which the fact was last changed. This should be set by both =PUT= and =POST=
|
||||
requests. This should be any time-shaped object which ElasticSearch can parse.[fn:4]
|
||||
|
||||
*** =geo=
|
||||
|
||||
A geo URI[fn:5] which would allow the fact to be placed on a map and queried based on location.
|
||||
|
||||
* Footnotes
|
||||
|
||||
[fn:1] (define-function 'org-store-link)
|
||||
[fn:2] https://matrix.org/docs/spec/client_server/r0.2.0.html#id43
|
||||
[fn:3] This is provided as a means to allow user using a theoretical Matrix.org client to capture
|
||||
facts to upload an image as part of the fact generation, without having to manually transfer an
|
||||
image in to where it can be referenced by the Image DB.
|
||||
[fn:4] https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html
|
||||
[fn:5] https://en.wikipedia.org/wiki/Geo_URI_scheme
|
@ -0,0 +1,62 @@
|
||||
#+TITLE: Memex
|
||||
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
|
||||
|
||||
* Memex: Extracting Information from Data
|
||||
|
||||
Memex is my external memory, a search engine for the facts surrounding my life.
|
||||
|
||||
** Use Cases
|
||||
|
||||
- What calendar events do I have today
|
||||
- What photos did I take while I was in San Diego?
|
||||
- What restaurant did I go to the last time I was in Tempe?
|
||||
- Who was I talking to about Ear Plugs on Twitter?
|
||||
- What page was I looking at that had
|
||||
- Where did I store my USB mouse?
|
||||
- Which e-books do I need to strip the DRM from?
|
||||
- Who have I sent emails to in the last month?
|
||||
- When was the last time I ate Burmese food
|
||||
- Where was I when I took a picture of a truck
|
||||
- Which calendar event did this page of notes come from
|
||||
- Which calendar event was I at when I wrote this org-mode entry
|
||||
|
||||
* Architecture
|
||||
|
||||
This thing is pretty simple, right now; it's a simple backend that takes objects and shoves them in
|
||||
to two stores.
|
||||
|
||||
#+BEGIN_SRC dot :file images/arch.png
|
||||
digraph {
|
||||
client -> backend
|
||||
backend -> elasticsearch
|
||||
backend -> postgres
|
||||
ingester -> backend
|
||||
}
|
||||
#+END_SRC
|
||||
|
||||
#+results:
|
||||
[[file:images/arch.png]]
|
||||
|
||||
I am not sure whether I want to have this running PostgreSQL, that feels very much like
|
||||
overkill. The biggest unknown is being able to do recursive graph queries with Elasticsearch, but
|
||||
I'm not even sure if that is a usecase I need. In the short/medium turn, I imagine I will stick with
|
||||
just using ElasticSearch, and save the blobs to somewhere I can backfill them from, probably an S3
|
||||
bucket.
|
||||
|
||||
** Fact Types
|
||||
|
||||
Facts, can be roughly defined as:
|
||||
- time, date and location in which an Event takes place
|
||||
- location of a digital artifact along with metadata to identify the artifact
|
||||
- location of a physical artifact along with metadata surrounding the artifact
|
||||
|
||||
- =object= is a bit obtuse, but is a physical "thing"
|
||||
- =event= is a time period, two datetimes with metadata attached to them.
|
||||
- =location= is probably too abstract; it feels like an attribute attached to other facts, but there
|
||||
might be value in presenting it directly.
|
||||
- =statement= is a sentence or paragraph of either written text, recorded voice, or thoughts; linked
|
||||
statements form a conversation or train of thought depending on the source.
|
||||
- =page= is either an OCRd document, or the text of an HTML page.
|
||||
- =photo= is, well, a photo. A jpeg or png with metadata attached.
|
||||
- =file= is any other file, with whatever metadata can be extracted from them.
|
||||
- =person= is self-explanatory. Basically looking at a vcard and arbitrary key-values otherwise.
|
Loading…
Reference in New Issue