master
Ryan Rix 6 years ago
parent 155f2e8e94
commit 74c435112b

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

@ -0,0 +1,67 @@
#+TITLE: Melchior: The First Wiseman
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
* Extracting Information from Data
Melchior is an external memory, a search engine for the facts surrounding your life. You deploy an
instance of Melchior, wire up a set of ingester agents to pull information from your surroundings,
and then use Melchior to recall information, organize it, and present it for yourself and people
around you.
For now, Melchior doesn't have actual releases; it is pre-alpha software, the data model, frontend
and backend can change at the whims of the author. If, in spite of these warnings, you want to play
with it, you can get it from [[https://home.rix.si/git/melchior/melchior][the author's Gogs instance]].
The README document is [[./technical.org][here]]. Melchior is named after the first Magi, the biblical wisemen, but also
the computers controlling Tokyo-3 in the documentary Neon Genesis Evangelion.
** Use Cases
- [X] What page was I looking at that had a reference to Melchior.
- [ ] What calendar events do I have today
- [ ] What photos did I take while I was in San Diego?
- [ ] What restaurant did I go to the last time I was in Tempe?
- [ ] Who was I talking to about Ear Plugs on Twitter?
- [ ] Where did I store my USB mouse?
- [ ] Which e-books do I need to strip the DRM from?
- [ ] Who have I sent emails to in the last month?
- [ ] When was the last time I ate Burmese food
- [ ] Where was I when I took a picture of a truck
- [ ] Which calendar event did this page of notes come from
- [ ] Which calendar event was I at when I wrote this org-mode entry
* Rationale
I am a forgetful person. I also, at any given time, have a pile of projects, plans, and mostly
unformed thoughts in my head that get pushed out fairly easily. This suite of software is designed
to provide a simple way to pull those things out of my brain and represent them in a way that I can
use and find and develop upon.
In the past, and until this suite is useful enough to preempt it, I have been a heavy user of
Emacs's Org-mode software, an incredibly powerful outline and journaling system. However, as I try
to fit more and more information in to the system, using simple scripts and ingestion systems like
[[https://github.com/novoid/Memacs][Memacs]] Emacs grinds to a halt and querying the system becomes, essentially, impossible to
maintain. Moving more and more of these style of queries out of Emacs and in to a dedicated store is
an attempt to mold my system even more towards how my brain works.
I don't plan on ever hosting "shared" instances of Melchior, nor has the software been designed to make
this simple. I don't want your data, I don't want you to give me or others your personal lives. Any
plan for "monetizing" will either come through sponsorship of feature development, or through
efforts to make it simple to self-host Melchior. As such, I've chosen to license the code under a
strong copyleft license, the GNU Affero General Public License, which requires anyone hosting this
software for others to use to provide the source code to people they are hosting it for. Facilities
to enable this easily will be built in to the code at some point in the future, but do keep that in
mind if you decide to host a copy of Melchior for others.
* Legal Biz
Melchior is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
General Public License as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
Melchior is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General
Public License for more details.
You should have received a copy of the GNU Affero General Public License along with Melchior. If not,
see <http://www.gnu.org/licenses/>.

@ -1,62 +0,0 @@
#+TITLE: Memex
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
* Memex: Extracting Information from Data
Memex is my external memory, a search engine for the facts surrounding my life.
** Use Cases
- What calendar events do I have today
- What photos did I take while I was in San Diego?
- What restaurant did I go to the last time I was in Tempe?
- Who was I talking to about Ear Plugs on Twitter?
- What page was I looking at that had
- Where did I store my USB mouse?
- Which e-books do I need to strip the DRM from?
- Who have I sent emails to in the last month?
- When was the last time I ate Burmese food
- Where was I when I took a picture of a truck
- Which calendar event did this page of notes come from
- Which calendar event was I at when I wrote this org-mode entry
* Architecture
This thing is pretty simple, right now; it's a simple backend that takes objects and shoves them in
to two stores.
#+BEGIN_SRC dot :file images/arch.png
digraph {
client -> backend
backend -> elasticsearch
backend -> postgres
ingester -> backend
}
#+END_SRC
#+results:
[[file:images/arch.png]]
I am not sure whether I want to have this running PostgreSQL, that feels very much like
overkill. The biggest unknown is being able to do recursive graph queries with Elasticsearch, but
I'm not even sure if that is a usecase I need. In the short/medium turn, I imagine I will stick with
just using ElasticSearch, and save the blobs to somewhere I can backfill them from, probably an S3
bucket.
** Fact Types
Facts, can be roughly defined as:
- time, date and location in which an Event takes place
- location of a digital artifact along with metadata to identify the artifact
- location of a physical artifact along with metadata surrounding the artifact
- =object= is a bit obtuse, but is a physical "thing"
- =event= is a time period, two datetimes with metadata attached to them.
- =location= is probably too abstract; it feels like an attribute attached to other facts, but there
might be value in presenting it directly.
- =statement= is a sentence or paragraph of either written text, recorded voice, or thoughts; linked
statements form a conversation or train of thought depending on the source.
- =page= is either an OCRd document, or the text of an HTML page.
- =photo= is, well, a photo. A jpeg or png with metadata attached.
- =file= is any other file, with whatever metadata can be extracted from them.
- =person= is self-explanatory. Basically looking at a vcard and arbitrary key-values otherwise.

@ -0,0 +1,136 @@
#+TITLE: Melchior: Technical Documentation
#+AUTHOR: Ryan Rix <ryan@whatthefuck.computer>
* Usage
TODO Write me.
* Backend Architecture
This thing is pretty simple, right now; it's a simple backend that takes objects and shoves them in
to PostgreSQL. The backend is written in Python Flask as a minimum viable prototype, soon to be
rewritten in Clojure.
#+BEGIN_SRC dot :file images/arch.png
digraph {
client -> backend
backend -> postgres
ingester -> backend
}
#+END_SRC
#+results:
[[file:images/arch.png]]
Postgres was chosen as the data store due to the combination of interesting data-types, reliablity,
and full-text search APIs.
The Data Model is ... sort of silly. There's a =raw_entities= table with minimal denormalized
columns, and a jsonb column with the raw data in it. I'm trying to be careful putting indices on
this table, instead interfaces should be built on [[https://www.postgresql.org/docs/9.5/static/sql-creatematerializedview.html][materialized views]] dependent on the type of data
being pulled out.
Individual instances in the =raw_entities= table are referred to as Facts within document, code, and
interfaces. Users are expected to publish the truth and verifying this is left as an excercise to
the reader.
** Fact Types
Facts, can be roughly defined as:
- time, date and location in which an Event takes place
- location of a digital artifact along with metadata to identify the artifact
- location of a physical artifact along with metadata surrounding the artifact
- =object= is a bit obtuse, but is a physical "thing"
- =event= is a time period, two datetimes with metadata attached to them.
- =location= is probably too abstract; it feels like an attribute attached to other facts, but there
might be value in presenting it directly.
- =statement= is a sentence or paragraph of either written text, recorded voice, or thoughts; linked
statements form a conversation or train of thought depending on the source.
- =page= is either an OCRd document, or the text of an HTML page.
- =photo= is, well, a photo. A jpeg or png with metadata attached.
- =file= is any other file, with whatever metadata can be extracted from them.
- =person= is self-explanatory. Basically looking at a vcard and arbitrary key-values otherwise.
* Frontends
** Ingesters
Ingesters could come in many forms. The simplist is probably =memex.ingesters.simple=, which we can
use as an example for future designs.
Current ingesters:
- Simple URL importer
- Qutebrowser web history
Others I want to build:
- RSS feeds
- Custom CSV/Line importers with a simple DSL:
- Process GPS logs from phone's Tasker module or custom application
- Walk filesystem, pull metadata from the files
- MP3 tags
- EXIF tags
- Index Maildir files
- Chat Logs
- Events
- Org-mode
- `.ics` files
- Contacts
- Tweets
- Other web browser histories
Ingesters should be designed to be idempotent, and be able to be run on a repeating system such as
Cron or SystemD timers.
Of course, users will also be able to use a read-write query interface to input facts themselves,
turning this in to a genera-purpose information storage, search and retrieval system.
** Query Frontends
For now, I'm only going to build/support three frontends:
- Command line search tool
- Simple web-frontend that could be wrapped in Apache Cordova
- Read-only publishing frontend
The read-only publishing frontend makes Facts which the user has marked with certain configurable
tags available as a read-only interface, designed to publish subsets of Photos, GPS logs (run/bike
times, "check-ins"), Documents (blog posts, short-form updates), and entire strings of Facts (users
could publish a long-form research post, along with snapshots of the resources they collected in
writing it), as HTML, RSS, etc.
Long-term I have this crazy stupid vision for a Matrix bot which will supercede these, and make them
dumb interfaces using Matrix as an RPC to a central botserver. The bot would support e2e encryption,
meaning that new devices couldn't see the query history of old devices, and it would require a
verification stage for new devices. Plain-english queries would be deconstructed in to a search
dialect and the results could be sent in-line, with the raw document attached as invisible JSON
metadata, allowing rich clients to be built.
Even crazier, would be allowing other Memex instances to query yours, building a decentralized
network of Memex instances. Obviously a security model will have to be put in to place for this to
happen, but it'd be pretty nifty.
** Streaming Frontends
Streaming frontends work by "tailing" either the Raw Entities table or one of the materialized views
to "forward" those facts to other systems. Usecases include syndicating short-form update facts to
Twitter, Facebook, etc, uploading photos to Flickr, and doing other "post-processing" on them. This
is conceptual, and the interface for this has yet to be defined.
* Hacking
There's a test suite; run =make test= with a PostgreSQL running. If you're adding code, or a module,
please include tests.
* License
Memex is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
General Public License as published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.
Memex is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General
Public License for more details.
You should have received a copy of the GNU Affero General Public License along with Memex. If not,
see <http://www.gnu.org/licenses/>.
Loading…
Cancel
Save