arcology-elixir/arcology_page.org

#+TITLE: Arcology Page Module
#+ROAM_TAGS: Project Arcology
#+ROAM_ALIAS: Arcology.Page
#+ARCOLOGY_KEY: arcology/page

The [[file:arcology_roam.org][Arcology Roam Models]] provide Ecto support for the [[file:arcology_db.org][arcology-db]], including =Arcology.Roam.File= which provides associations to the full data-model but is not *designed* to be programmed against. And since these are read-only concerns, a "smarter" structure can be built without having to worry about moving between the facile interface and the data models. This structure contains the relations in easy to use fashions. The titles are a list of strings instead of [[file:arcology_roam.org][Arcology.Roam.Titles]], y'know, stuff like that. I do wonder how to implement some sort of "proxy" interface like this which can create better views of data, while (somehow) supporting write-backs.

#+begin_src elixir :tangle lib/arcology/page.ex :noweb yes
defmodule Arcology.Page do
  alias Arcology.Roam.{Keyword, Reference, Tag, Title}
  require Logger

  defstruct [
    :file, :file_path, :route, :key,
    :keywords, :backlinks, :reference, :tags, :titles,
    :html, :html_status, :backlinks_html, :backlinks_status
  ]

  <<page_from_file>>

  <<page_resolve_route>>
  <<page_resolve_path>>

  <<page_html>>
  <<page_rewrite_local>>
end
#+end_src

* Creating =Page= Structures

Getting an =Arcology.Page= from an [[file:arcology_roam.org][Arcology.Roam.File]] is done with =Arcology.Page.from_file/1= here. I tried as much as possible to load the pre-loaded entities on the =File= object, backlinks are loaded by a second query in =[[file:arcology_roam.org][Arcology.Roam.Link]].files/1=. It would be really nice to make this function accept a =File= object, but adding features to this is multiplicative until I re-implement this in a way that can be extended more easily.

#+begin_src elixir :noweb-ref page_from_file
@doc "return struct from Arcology.Roam.File"
def from_file(%Arcology.Roam.File{} = file) do
  # file_name = File.get_name(file)
  preloaded = file |> Arcology.Roam.File.preloads()
  %Arcology.Page{
    file: preloaded,
    key: Keyword.from_file(preloaded, "ARCOLOGY_KEY"),
    keywords: Keyword.from_file(preloaded),
    backlinks: Arcology.Roam.Link.files(to: preloaded),
    reference: preloaded.reference |> Reference.to_map,
    tags: preloaded.tags |> Tag.process_tags_sexp,
    # tags: preloaded.tags |> Tag.merge_tags |> Map.get(file_name),
    titles: preloaded.titles |> Title.to_list,
  }
  |> Arcology.Page.resolve_path
  |> Arcology.Page.resolve_route
end
#+end_src

These functions break an =ARCOLOGY_KEY= in to its site and page constituents. This is used by the page router.

#+begin_src elixir :noweb-ref page_resolve_route
def resolve_route(%Arcology.Page{} = page) do
  %Arcology.Page{ page | route: split_route(page.key) }
end

def split_route(key) when is_nil(key), do: []
def split_route key  do
  [site, path] = String.split(key, "/", parts: 2)
  [site: site, path: path]
end
#+end_src
=resolve_path/1= sticks the full file path in to an =Page=, based on the dynamic configuration entry in [[file:phoenix.org][Arcology Phoenix]].

#+begin_src elixir :noweb-ref page_resolve_path
def resolve_path(%Arcology.Page{} = page) do
  %Arcology.Page{page | file_path: Arcology.Roam.File.get_name(page.file)}
end
#+end_src

Tests for the Arcology Pages use this page itself, and probably needs to be updated when the syntax or structure of the project changes. That's fine. I'm going to sort of "paper over" the [[file:arcology_roam.org][Arcology Roam Models]], I think, and focus on testing this module, and the other things which use the Ecto models. At the end of the day, the Ecto code is mostly automatically generated and boiler-plate. Obviously, I need to test more than the "loading to Page model" code-paths, but I think I will do those in the Phoenix layers rather than the Ecto layers. The pages in this project *are* the test suite in the same way the code is, but the tests also use the metadata of the pages themselves, so they're going to be really sensitive to the overall project architecture, and this will probably be considered a mistake later on...[fn:1]

#+begin_src elixir :tangle test/arcology/page_test.exs
defmodule ArcologyPageTestFromFile do
  use ExUnit.Case

  setup do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(Arcology.Repo)
  end

  test 'from_file simple loading' do
    this_page =
      "arcology_page.org"
      |> Path.expand()
      |> Arcology.Roam.File.get()
      |> Arcology.Page.from_file()

    assert this_page.key == "arcology/page"
    assert length(this_page.keywords) == 1
    assert this_page.tags |> Enum.at(0) == "Arcology"
    assert length(this_page.backlinks) == 5
    assert length(this_page.titles) == 2

    assert this_page.route[:site] == "arcology"
    assert this_page.route[:path] == "page"

    assert this_page.file_path == "/home/rrix/org/arcology/arcology_page.org"

    file = this_page.file |> Arcology.Roam.File.preloads()
    assert length(file.links_from) == 13
  end
end
#+end_src

* HTML Rendering

Here is where I start to hit a problem, I need to build a URL rewriter, and an interface for it. Pandoc does not rewrite URLs when it compiles the documents and I choose to do this myself in Elixir in spite of Elixir's bad reputation for string processing. It's happening in regular expressions and using IO Lists under the hood, and so it's not unreasonable to do this way.[fn:2]

I'm not really sure this should live here as opposed to in a view module or something? it seems weird to be rendering HTML so far from the edge, but it's "kind of" by design, after all. I guess I could push it all the way out to the edge and use Phoenix's view template caching instead but golly

But I'm still at an impasse on how to structure all of this -- I need a router that lives in its own module, most likely, so that I can encapsulate the complexity involved in having subdomain-based routing in production, and not with local development. I guess I could fake it with DNS and only have a single router, relying on my VPN DNS for development resolution ... probably bad ideas, but largely feasible!

I still have only built the "local rewriter" in the existing MVP, which I pull in here.

This code uses [[https://github.com/melpon/memoize][memoize]] to cache the result of a [[https://github.com/marcelotto/panpipe/][Panpipe]] call because it's a fairly expensive process involving an external Linux process, dark arts, lazy evaluation, and a functional Pandoc installation. =Arcology.Page.resolve_html= returns a Page with the page HTML included, memoizing the HTML output with [[file:arcology_roam.org][Arcology.Roam.File]] hash for invalidation. =Arcology.Page.resolve_backlinks_html= returns a Page with the page's backlinks, but does *not* do proper cache invalidation right now, it needs to cache the hash of all the files included in the backlink template, not the hash of the "target" file.

#+begin_src elixir :noweb-ref page_html
use Memoize

defmemo compiled_html(path, _hash) do
  Panpipe.pandoc(
    input: path,
    to: :html,
    from: :org
  )
end

def resolve_html(%Arcology.Page{} = page) do
  page = pre_process_page_for_pandoc(page)
  case res = compiled_html(page.file_path, Arcology.Roam.File.get_hash(page.file)) do
    {:ok, html} -> %Arcology.Page{page | html_status: :raw, html: html}
    {:error, _} -> res
  end
end

# not sure entirely what the args should be here yet.
defmemo compiled_backlinks(links, _hash) do
  preloaded =  links |> Arcology.Repo.preload(from_file: :titles)
  content =
    for link <- preloaded do
      path = Arcology.Roam.File.get_name(link.from_file)
      title = Enum.at(link.from_file.titles, 0).title
      content = Arcology.Roam.Link.get_content(link)

      """
      ,*** in [[file:#{path}][#{title}]]
      ,#+begin_quote
      #{content}
      ,#+end_quote
      """
    end
    |> Enum.join("\n")

  Panpipe.pandoc(content,
    from: :org,
    to: :html,
    metadata: "pagetitle=''",
    standalone: true
  )
end

def collect_link_hashes(%Arcology.Page{backlinks: backlinks}) do
  backlinks
  # returns a file
  |> Enum.map(&Map.get(&1, :from_file))
  |> Enum.map(&Arcology.Roam.File.get_hash(&1))
  |> MapSet.new()
  |> MapSet.to_list()
end

def resolve_backlinks_html(%Arcology.Page{} = page) do
  backlink_hashes = collect_link_hashes(page)

  case res = compiled_backlinks(page.backlinks, backlink_hashes) do
    {:ok, html} -> %Arcology.Page{page | backlinks_status: :raw, backlinks_html: html}
    {:error, _} -> res
  end
end
#+end_src
=pre_process_page_for_pandoc/1= is a last-ditch effort to make changes to the org mode source before rendering in Pandoc and just calls in to the =clean_up_org_fc/1= function which tries to make my [[file:../spaced_repetition_study.org][SRS]] cards legible.

#+begin_src elixir :noweb-ref page_html
@doc "this works by returning a modified Page with a new file_path!"
def pre_process_page_for_pandoc(%Arcology.Page{} = page) do
  tmp_file_name = "/tmp/arcology-" <> (:crypto.hash(:sha256, page.key) |> Base.url_encode64()) <> ".org"
  File.open(page.file_path, [:read], fn file ->
    org_string = IO.read(file, :all) |> clean_up_org_fc()
    File.open(tmp_file_name, [:write], fn tmpfile ->
      IO.write(tmpfile, org_string)
    end)
  end)
  %Arcology.Page{page | file_path: tmp_file_name}
end
#+end_src

The thing I am most interested in checking here is in cache eviction, and that's gonna be a fucking pain in the ass, I guess. This tests implicitly the [[file:arcology_db.org][arcology-db]] codepaths that generate hashes, too. I'm sure that using =System.cmd= in tests to shell out to a fucking shell pipeline is a pattern that is wrought with chaos, but for now it'll do. As long as I'm reaching for the system hashing library, I might as well reach for a string-processing wrench while I'm in there! I do like these sorts of functional "reach in to the system" type of tests validating against the actual state of the files on-disk wherever possible. Ultimately, the cost of adding these utilities to every development environment is not worth spending a lot of time worrying about.

#+begin_src elixir :tangle test/arcology/page_test.exs
defmodule ArcologyPageTestPandocCompiler do
  use ExUnit.Case

  setup do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(Arcology.Repo)
  end

  test 'collect_link_hashes returns reasonable data' do
    this_page =
      "arcology_page.org"
      |> Path.expand()
      |> Arcology.Roam.File.get()
      |> Arcology.Page.from_file()

    hashes = Arcology.Page.collect_link_hashes(this_page)
    cmd = ~s(git ls-files | grep 'org$' | xargs sha1sum | awk '{print $1}')
    {cmd_out, 0} = System.cmd("bash", ["-c", cmd])
    system_hashes = cmd_out |> String.split("\n")

    assert Enum.all?(hashes, fn hash -> Enum.member?(system_hashes, hash) end)
  end
end
#+end_src

Now this provides the basic HTML -- it doesn't have the "smart" links in it, the ones based on =ARCOLOGY_KEY= keywords in the document rather than local file paths. This string is of the format =site/path=, site is one of a number of simple mnemonics I use, which map one-to-one with domains I own.
=localize_urls/1= implements a simple state-machine around =Page= objects, there is an =html_status= key in the =Page= which tracks whether the links have already been localized. The =is_binary= implementation of =localize_urls= runs a regexp search and replace calling in to =rewrite_local/2= which does the actual rewrite. Right now this has to do a database query but I intend to rewrite this to not need that; to perhaps pass in all of the keywords pulled from the database at once, a single query that can be cached between the possibly numerous calls to =rewrite_local/2=.
=rewrite_local/2= is the point where I will swap in a "production" URL generator eventually, this code creates a domain-less absolute-url of the form =/${ARCOLOGY_KEY}.html= and in the production case it'll be =${DOMAIN_FOR_KEY}/${path}.html= the domain is from the one-to-one mapping mentioned above from the =site= part of the arcology key, and the rest is the =path= portion. I add =html= suffixes but I may not in the future. This is largely an aesthetic choice. Oh, and if there is not an =ARCOLOGY_KEY= for a linked org-mode file, a stub link is generated with a CSS class on it.

#+begin_src elixir :noweb-ref page_rewrite_local
def with_localized_html(%Arcology.Page{html_status: nil} = page), do: localize_urls(page)

@doc "This function changes the pandoc-output URLs in to site/key URLs for local wiki"
def localize_urls(html, relative_to) when is_binary(html) do
  Logger.debug("string")

  Arcology.Page.expand_link_paths(html, relative_to)
  |> Arcology.LinkRouter.Local.normalize_urls()
  |> Arcology.Page.clean_up_org_fc()
end

def localize_urls(%Arcology.Page{html_status: nil} = page) do
  Logger.debug("nil")
  page
  |> resolve_html
  |> resolve_backlinks_html
  |> localize_urls
end

def localize_urls(%Arcology.Page{html_status: :localized, backlinks_status: :localized} = page) do
  Logger.debug("pass")
  page
end

def localize_urls(%Arcology.Page{html: html, html_status: :raw} = page) when is_binary(html) do
  Logger.debug("localized")
  %Arcology.Page{page |
     html: html|>localize_urls(page.file_path),
     html_status: :localized,
     backlinks_html: page.backlinks_html|>localize_urls(page.file_path),
     backlinks_status: :localized,
  }
end
#+end_src
=expand_link_paths/2= is responsible for re-writing links from relative file URIs to absolute paths for the link rewriter. The use of the sigil strings is a bit unfortunate, i choose to go this way because escaping quote marks is somehow less aesthetically pleasing to me. sorry.

This works, for the most part; right now, the backlinks html can sometimes render incorrectly, where links in the content will not resolve properly. This is fine, you can click the title to click through, and then the links work. When this code works, it should be moved in to the =memoize= calls for =resolve_html= and =resolve_backlinks_html= defined above.

#+begin_src elixir :noweb-ref page_rewrite_local
def expand_link_paths(html, relative_path) do
  Regex.replace(
    ~r/<a href="([~\.0-9a-zA-Z_\- \/]+.org)">/,
    html,
    fn _match, path ->
      expanded_path =
        Path.expand(
          path,
          Path.dirname(relative_path)
        )
      ~s(<a href=") <> expanded_path <> ~s(">)
    end
  )
end
#+end_src

#+begin_src elixir :tangle test/arcology/page_test.exs
defmodule TestExpandLinkPaths do
  use ExUnit.Case

  setup do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(Arcology.Repo)
  end

  test "best-case expand_link_paths validations" do
    html = """
    <a href="bingus.org">bingus</a>
    <a href="../bangus.org">bangus</a>
    <a href="../beep/bongus.org">bongus</a>
    """

    relative_path = "/home/wonka/factory/"

    expanded = Arcology.Page.expand_link_paths(html, relative_path)

    assert expanded =~ "/home/wonka/factory/bingus.org"
    assert expanded =~ "/home/wonka/bangus.org"
    assert expanded =~ "/home/wonka/beep/bongus.org"
  end
end
#+end_src
=clean_up_org_fc/1= takes an org-mode input and removes all the [[file:../spaced_repetition_study.org][SRS]] metadata from it. [[file:../spaced_repetition_study.org][org-fc]] uses a specialized markup for "clozing" parts of the text for quizzing, and stores metadata in a drawer under the entry which Pandoc renders by default. It would be nice to do something fancy with the clozes but for now I want to just make it legible.  This is a pretty awful soup of escapes and regular expressions though. the =@@html= syntax is used by Org to escape the HTML[fn:3].

#+begin_src elixir :noweb-ref page_rewrite_local
def clean_up_org_fc(input_org) do
  without_drawers = Regex.replace(
    ~r/:REVIEW_DATA:.*:END:/smU,
    input_org,
    &normalize_individual_org_fc(&1, &2)
  )

  without_clozes = Regex.replace(
    ~r/{{([^}]+)}({.*})?@([0-9])}/uU,
    without_drawers,
    &normalize_cloze(&1, &2, &3, &4)
  )

  without_clozes
end

def normalize_individual_org_fc(_match, capture), do: ""

def normalize_cloze(_match, first, optional_hint, position) do
  ~s(@@html:<span class="cloze" data-cloze=#{position} title="#{optional_hint}">#{first}</span>@@)
end
#+end_src

** NEXT paragraph anchors within text bodies

NAME keywords may do this, but make sure.

* Footnotes

[fn:1] [[file:../open_threads.org][open thread]] on whether this idea of [[file:../cce/literate_programming.org][Literate Programming]] meta-programing is good or not. might defeat the purpose, making the tests really brittle and make me unwilling to move code around or re-structure the doc to be more accessible.

[fn:2] https://www.evanmiller.org/elixir-ram-and-the-template-of-doom.html

[fn:3] https://orgmode.org/manual/Quoting-HTML-tags.html