arroyo/arroyo-native-parser.org

61 KiB
Raw Permalink Blame History

The arroyo_rs Native Org Parser

Overview

arroyo-rs is an org-mode parser library for The Arcology Project. It exposes a simple interface to the Rust Orgize library, wrapped in pyo3 bindings to be accessible from Python. Most of the Org-mode parsers I've looked at so far lack certain features or parsing capabilities that would make it difficult or impossible to implement Arroyo and the Arcology Project.

This package exports two public functions to be used by higher level interfaces like a Django DB or SQLModel.

  • arroyo_rs.parse_file(path: str) -> Document The Parser and The Parser Document Types
  • arroyo_rs.htmlize_file(path: str, options: ExportOptions) -> str The HTML exporter

    • COMING SOON arroyo_rs.atomize_file(path: str, **kwargs) -> str construct ExportOptions from kwargs
  • COMING SOON arroyo_rs.atomize_file(path: str, **kwargs) -> str The Atom exporter

Package Definitions

Rust package definition

[package]
name = "arroyo-rs"
version = "0.0.1"
edition = "2021"

We have to configure the crate to spit out a C FFI library for pyo3.

[lib]
name = "arroyo_rs"
crate-type = ["cdylib", "rlib"]

I probably don't need all of these dependencies, and would like to pull it down now that I'm not using the rust binary except to debug this

[dependencies]
lexpr = "0.2.7"
# orgize = "0.9.0"
orgize.git = "https://code.rix.si/rrix/orgize"
orgize.features = ["syntect"]
anyhow = "1.0.75"
pyo3 = { version = "0.20.0", features = ["anyhow"] }
itertools = "0.11.0"
regex = "1.10.2"

Python package definition

Maturin">Maturin is used for the build.

[project]
name = "arroyo"
version = "0.0.1"
description = "org-mode metadata extractor"
# license = "Hey Smell This"
readme = "README.org"
dependencies = ["click ~=8.1"]
requires-python = ">=3.10"
authors = [
    { name = "Ryan Rix", email = "code@whatthefuck.computer" }
]

[project.scripts]
arroyo = "arroyo.__main__:cli"

[build-system]
requires = ["maturin>=1.0,<2.0"]
build-backend = "maturin"

[tool.maturin]
module-name = "arroyo.arroyo_rs"
features = ["pyo3/extension-module"]
compatibility = "linux"

Nix package declarations

Development Shell

{ pkgs ? import <nixpkgs> {},
  python3 ? pkgs.python3
}:
let
  myPy = python3.withPackages( pp: with pp; [
    pip
    pytest
  ]);
in pkgs.mkShell {
  packages = with pkgs; [
    cargo
    rustc
    gcc

    rust-analyzer
    rustfmt
    clippy

    maturin
    myPy
    pyright
    black
  ];
  RUST_SRC_PATH = "${pkgs.rust.packages.stable.rustPlatform.rustLibSrc}";
  NIX_CONFIG = "builders =";

  shellHook = ''
    test -f venv/bin/python3 || python3 -m venv venv
    . venv/bin/activate
    maturin develop -j6
  '';
}

Python package built with maturin

{
  pkgs ? import <nixpkgs> {},
  lib  ? pkgs.lib,

  python3 ? pkgs.python3,
}:

python3.pkgs.buildPythonPackage rec {
  pname = "arroyo_rs";
  version = "0.0.1";
  format = "pyproject";

  src = ./.;

  nativeBuildInputs = with pkgs; [
    cargo rustc
  ] ++ (with pkgs.rustPlatform; [
    maturinBuildHook
    cargoSetupHook
  ]);

  propagatedBuildInputs = with pkgs; [
    python3.pkgs.click
  ];

  cargoDeps = pkgs.rustPlatform.importCargoLock {
    lockFile = ./Cargo.lock;

    outputHashes = {
      "orgize-0.9.0" = "sha256-Nn7+nQVBn2Gn0+uHvlY8NKSV/bPVEQK9sFPTzTAcDWY=";
    };
  };

  # when i `nix build' everything works; when i `nix-build' it fails
  # because it tries to install a `linux' wheel and a `manylinux'
  # wheel that conflict. no idea why since maturinBuildHook etc should
  # be disabling manylinux
  postBuild = "rm dist/*manylinux*.whl || true";

  meta = with lib; {
    description = "An Org-Mode parser library for the arcology project";
    homepage = "https://cce.whatthefuck.computer/arroyo";
    license = licenses.unfree;
    maintainers = with maintainers; [ rrix ];
  };
}

Put it all together and make it distributable with a Nix flake

Eventually this will include an app that will generate the DB, and commands which can generate the literate configurations like My NixOS configuration.

{
  description = "Arroyo Parser Exporter Library";

  inputs.nixpkgs.url = "git+ssh://gitea@last-bank:2222/rrix/nixpkgs?ref=nixos-23.11";
  inputs.flake-utils.url = "github:numtide/flake-utils";

  outputs = { self, nixpkgs, flake-utils }:
    flake-utils.lib.eachDefaultSystem (system:
      let
        pkgs = import nixpkgs {
          inherit system;
          config.allowUnfree = true;
        };

        python3 = pkgs.python3;
      in
        {
          devShells.default = pkgs.callPackage ./shell.nix { inherit python3; };

          packages = rec {
            arroyo_rs = pkgs.callPackage ./default.nix { inherit python3; };
            default = arroyo_rs;
          };

          apps = rec {
            arroyo_rs = flake-utils.lib.mkApp {
              drv = self.packages.${system}.arroyo_rs;
              exePath = "/bin/arroyo";
            };
            default = arroyo_rs;
          };
        }
    );
}

Dev support files

rustfmt configuration needs to be specified since it's not clever enough to just read the Rust edition out of the Cargo.toml defined above.

edition = "2021"

The Parser Document Types

Let's start by defining the types. We use pyo3 macro annotations so that these types are accessible and tactile in the Python interface, so it's a bit noisy, but things should be clear enough.

use pyo3::exceptions::PyException;
use pyo3::prelude::*;
use pyo3::pyclass;
use std::collections::HashMap;

use std::fmt;

pyo3::create_exception!(arroyo_rs, InvalidDocError, PyException);

The parser is a big state machine that iterates over every node in the document to spit out a tree shaped like:

  • Document

    • Keywords: list of [file, key, value]
    • Headings: list of

      • id, level, text, tags, refs, aliases from Org-roam metadata and
      • list of links containing:

        • information about the heading the link is in
        • information about the link destination
        • text of the link

Document

A document contains a list of headings, and a list of keywords.

#[derive(Debug, Clone, Default)]
#[pyclass(dict)]
pub struct Document {
    #[pyo3(get)]
    pub path: String,
    #[pyo3(get)]
    pub headings: Vec<Heading>,
    #[pyo3(get)]
    pub keywords: Vec<Keyword>,
}

It implements fmt::Display so that {} in formatters works well with it.

impl fmt::Display for Document {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "OrgDoc(from {} with {} headings and {} keywords)",
            self.path,
            self.headings.len(),
            self.keywords.len()
        )
    }
}

That formatter is also used in Python __repr__ and __str__. this pattern is used in all of the types defined below, as well, and we'll be less verbose later on.

#[pymethods]
impl Document {
    pub fn __repr__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Ok(slf.to_string())
    }

    pub fn __str__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Self::__repr__(slf)
    }

    pub fn collect_keywords(&self, keyword: String) -> PyResult<Vec<String>> {
        let kws: Vec<String> = self
            .keywords
            .iter()
            .filter(|kw| kw.keyword.to_uppercase() == keyword)
            .map(|kw| kw.value.clone())
            .collect();

        return Ok(kws);
    }
}

Keyword

A Keyword is extracted from #+KEYWORD: value syntax keywords in the document. These three-value tuples of [file, keyword, value] are where metadata used by the Arroyo generators is defined.

#[derive(Debug, Clone)]
#[pyclass(dict)]
pub struct Keyword {
    #[pyo3(get)]
    pub file: String,
    #[pyo3(get)]
    pub keyword: String,
    #[pyo3(get)]
    pub value: String,
}

impl fmt::Display for Keyword {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "Keyword(file={}, {}={})",
            self.file, self.keyword, self.value
        )
    }
}

#[pymethods]
impl Keyword {
    pub fn __repr__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Ok(slf.to_string())
    }

    pub fn __str__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Self::__repr__(slf)
    }
}

Heading

The fields in Heading are defined based on what is available in org-roam; These are the basic information used by the Arroyo Arcology Generator.

#[derive(Debug, Clone, Default)]
#[pyclass(dict)]
pub struct Heading {
    // note that some Headlines may not have an ID, but for purpose of
    // arcology we only care about ones with ID
    #[pyo3(get)]
    pub id: Option<String>,
    #[pyo3(get)]
    pub level: usize,
    #[pyo3(get)]
    pub text: String,
    #[pyo3(get)]
    pub properties: HashMap<String, String>,
    #[pyo3(get)]
    pub tags: Option<Vec<String>>,
    #[pyo3(get)]
    pub refs: Option<Vec<String>>,
    #[pyo3(get)]
    pub aliases: Option<Vec<String>>,
    #[pyo3(get)]
    pub attachments: Option<Vec<Attachment>>,

    #[pyo3(get)]
    pub links: Option<Vec<Link>>,
}

impl fmt::Display for Heading {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "Heading(id={}, title={}, {} tags, {} refs, {} aliases, {} links, {} attachments, props {:?})",
            self.id.clone().unwrap_or("None".to_owned()),
            self.text,
            self.tags.clone().unwrap_or(vec![]).len(),
            self.refs.clone().unwrap_or(vec![]).len(),
            self.aliases.clone().unwrap_or(vec![]).len(),
            self.links.clone().unwrap_or(vec![]).len(),
            self.attachments.clone().unwrap_or(vec![]).len(),
            self.properties.clone(),
        )
    }
}

#[pymethods]
impl Heading {
    pub fn __repr__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Ok(slf.to_string())
    }

    pub fn __str__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Self::__repr__(slf)
    }
}

Link

A link knows where it is because it knows where it isn't. Just kidding, they have information about where they were and where they're going and the text inside the link.

#[derive(Debug, Clone, Default)]
#[pyclass(dict)]
pub struct Link {
    #[pyo3(get)]
    pub from_file: String,
    #[pyo3(get)]
    pub from_id: String,
    #[pyo3(get)]
    pub to: String,
    #[pyo3(get)]
    pub to_proto: Option<String>,
    #[pyo3(get)]
    pub text: Option<String>,
}

impl fmt::Display for Link {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "Link({}#{} -> [[{}][{}]])",
            self.from_file,
            self.from_id.clone(),
            self.to,
            self.text.clone().unwrap_or("".to_owned())
        )
    }
}

#[pymethods]
impl Link {
    pub fn __repr__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Ok(slf.to_string())
    }

    pub fn __str__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Self::__repr__(slf)
    }
}

Attachment

Headings can have attachments, arbitrary files which may be linked to in a shorthand attachment: org link or referred to by relative path.

we'll probably have multiple Attachment types with Images that can be post_processed in to a cache file directory, pngcrushed etc… or maybe thatll go in the arcology layer…

#[derive(Debug, Clone, Default)]
#[pyclass(dict)]
pub enum AttachmentType {
    #[default]
    File,
    Document,
    Image,
    Video,
}

#[derive(Debug, Clone, Default)]
#[pyclass(dict)]
pub struct Attachment {
    #[pyo3(get)]
    pub node_id: String,
    #[pyo3(get)]
    pub file_path: String,
    #[pyo3(get)]
    pub atype: AttachmentType,
}

impl fmt::Display for Attachment {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(
            f,
            "Attachment({} in {} is {:?})",
            self.file_path,
            self.node_id,
            self.atype
        )
    }
}

#[pymethods]
impl Attachment {
    pub fn __repr__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Ok(slf.to_string())
    }

    pub fn __str__(slf: PyRef<'_, Self>) -> PyResult<String> {
        Self::__repr__(slf)
    }
}

The Arroyo Org Parser

This code is pretty simple, there's just a lot of it. I have a fork of orgize, the Rust org-mode parser I'm using, to make it a little bit easier to work with.

use anyhow::Result;
use itertools::Itertools;
use lexpr;
use orgize::Event;
use orgize::Org;
use std::boxed::Box;
use std::collections::HashMap;
use std::{error::Error, fs};
// use std::collections::HashMap;

use crate::types::{Attachment, AttachmentType, Document, Heading, InvalidDocError, Keyword, Link};

The public interface

The public interface parses the document in to an AST and then returns a Document, defined above, populated with the code below in Extracting Arroyo Keywords and Extracting Arroyo Headings.

pub fn parse_document(path: String) -> Result<Document> {
    let org = String::from_utf8(fs::read(path.clone())?).unwrap();
    let org_tree = &Org::parse_custom(
        &org,
        &orgize::ParseConfig {
            // Need to pull these from environment or options...
            todo_keywords: (
                vec![
                    "NEXT".to_string(),
                    "INPROGRESS".to_string(),
                    "WAITING".to_string(),
                ],
                vec!["DONE".to_string(), "CANCELLED".to_string()],
            ),
            ..Default::default()
        },
    );
    let keywords = extract_metadata(path.clone(), org_tree)?;
    let headings = extract_headings(path.clone(), org_tree)?;

    Ok(Document {
        path,
        headings,
        keywords,
    })
}

Extracting Arroyo Keywords

orgize provides a helper that iterates over all the keywords in the document, these just get crammed in to Keyword's and returned.

pub fn extract_metadata(path: String, tree: &Org) -> Result<Vec<Keyword>> {
    Ok(tree
        .keywords()
        .map(|kw| Keyword {
            file: path.clone(),
            keyword: kw.key.to_string(),
            value: kw.value.to_string(),
        })
        .collect())
}

Extracting Arroyo Headings

This is the meat and potatoes of the parser, we'll step through this more carefully.

pub fn extract_headings(path: String, tree: &Org) -> Result<Vec<Heading>> {

There are some mutable variables at the top of this function which are used for state tracking of the iterator.

  • in_drawer tracks whether the parser is inside of a PROPERTIES drawer; PROPERTIES drawers under a heading are parsed and work properly, but orgize does not support file-level PROPERTIES drawer parsing so they have to be tracked and parsed manually. The parser get an "enter" event for the drawer, and then the drawer's content itself is a text "enter" event that then has to be parsed. When we get an "exit" event this is set to false.
  • id_crumbs and cur_level are used to track where in the document the parser is so that tag inheritance and that sort of stuff works.

    • the crumbs are a list of IDs or Nones which are used to "walk" up and down the tree and ensure Links are attached to headings with IDs.
  • headings tracks the return objects, and it's seeded with a "level 0" heading representing the root of the document.
  • links are stored outside of the heading until the parser is complete.
  • inherited_tags is a list of lists of strings; the inner vector contains the list of tags for each header, starting at level 0 for FILETAGS entries. Combining this structure and cur_level allows the parser to perform tag inheritance by flattening the list, and by dropping everything "above" the current level when stepping to another header.
    let mut in_drawer: bool = false;
    let mut id_crumbs: Vec<Option<String>> = Vec::new();
    let mut cur_id: Option<String> = None;
    let mut cur_level: usize = 0;
    let mut headings: Vec<Heading> = Vec::new();
    headings.push(Heading::default());
    let mut links: HashMap<String, Vec<Link>> = HashMap::new();
    let mut inherited_tags: Vec<Vec<String>> = Vec::new();

FILETAG parsing is a bit nasty to read, but basically the Keyword's value is colon-separated list of strings, these are split and collected and stored in the root heading.

    // file level metadata + filetags
    let file_metadata = extract_metadata(path.clone(), tree)?;
    let filetags = match file_metadata
        .iter()
        .find(|kw| kw.keyword.to_lowercase() == "filetags")
    {
        Some(kw) => kw
            .value
            .split(':')
            .map(|s| s.to_string())
            .filter(|s| !s.is_empty())
            .collect(),
        _ => Vec::<String>::new(),
    };
    headings[0].tags = Some(filetags.clone());

The root heading's title is the value of the document's #+TITLE keyword:

    // Extract document title and apply to level 0 heading
    let doc_title = match file_metadata
        .iter()
        .find(|kw| kw.keyword.to_lowercase() == "title")
    {
        Some(kw) => kw.value.clone(),
        _ => String::from(""),
    };
    headings[0].text = doc_title;

And now we step in to the state machine. It iterates over each element, providing an Event::Start and Event::End for each element that the parser supports:

    // state machine go brrr
    tree.iter()
        .map(|event| {
            match event {

Heading parser

orgize::Element::Title is a heading. When the parser encounters a new heading it:

  • updates the cur_id and cur_level values, update the ID "breadcrumbs"; links use this to "wrap up" to the first parent with an ID.
  • extracts ROAM_REFS and ROAM_ALIASES using split_quoted_string defined below.
  • update the "tags breadcrumbs" tab, that Vec<Vec<String>> defined above, and then flattens that into a list that will be stored in the heading.
  • Stash the heading in the return vector
                Event::Start(orgize::Element::Title(title)) => {
                    let tmp_properties = title.properties.clone().into_hash_map();
                    let mut export_properties: HashMap<String, String> = HashMap::new();
                    tmp_properties.iter().for_each(|(k, v)| {
                        export_properties.insert(k.to_string(), v.to_string());
                    });
                    cur_id = export_properties.get("ID").cloned();

                    id_crumbs.truncate(cur_level + 1);
                    id_crumbs.push(cur_id.clone());

                    let refs = export_properties
                        .get("ROAM_REFS")
                        .map(|s| split_quoted_string(s.to_string()).ok())
                        .unwrap_or(Some(vec![]));
                    let aliases = export_properties
                        .get("ROAM_ALIASES")
                        .map(|s| split_quoted_string(s.to_string()).ok())
                        .unwrap_or(Some(vec![]));
                    cur_level = title.level;

                    // reset the tags table
                    inherited_tags.truncate(cur_level - 1);
                    let new_tags: Vec<String> = title
                        .tags
                        .iter()
                        .map(|mbox| mbox.clone().to_string())
                        .collect();
                    inherited_tags.push(new_tags.clone());

                    let most_tags = inherited_tags.concat();
                    let all_tags: Vec<String> = [filetags.clone(), most_tags].concat();

                    let attach_tag = String::from("ATTACH");
                    let maybe_has_attach = new_tags.contains(&attach_tag);
                    let attachments = if cur_id.is_some() && maybe_has_attach {
                        match find_attach_dir(
                            &export_properties,
                            Path::new(&path),
                            cur_id.clone().unwrap(),
                        ) {
                            Some(dir) => {
                                Some(fetch_attachments(cur_id.clone().unwrap(), dir))
                            }
                            None => None,
                        }
                    } else {
                        None
                    };

                    let h = Heading {
                        id: cur_id.clone(),
                        level: cur_level,
                        text: title.raw.to_string(),
                        tags: match all_tags.len() {
                            0 => None,
                            _ => Some(all_tags),
                        },
                        properties: export_properties,
                        attachments,
                        refs,
                        aliases,
                        ..Default::default()
                    };
                    headings.push(h);
                    Ok(())
                }
NEXT I should be doing something like the inherited_tags stuff to track cur_id inheritance…

File-level Property Drawer parsing

Handling the file-level properties drawer is a bit of a pain some day I'll roll this in to orgize itself so that these can be accessed via a PropertiesMap as in the heading parsing above, but I don't get this library well enough to do that right now.

When entering a drawer, the parser sets that in_drawer state variable; This is a bit boogy since in theory this could be a floating PROPERTIES drawer defined anywhere, but my org-mode docs are shaped reasonably enough that we'll cross that rubicon when someone else uses this.

                Event::Start(orgize::Element::Drawer(drawer)) => {
                    in_drawer = drawer.name == "PROPERTIES" && headings[0].id.is_none();
                    Ok(())
                }

If the parser encounters a Text block while inside of a drawer, that needs to be parsed, and then the keys and whatnot are shoved in to the root Heading.

The drawer is assumed to be a key/value list as in the PROPERTIES drawers; this relies on my fork of orgize which exposes parse_drawer_contents. I think this should be able to use prop_drawer.get as in the code handling Headings above, and then these should be de-duplicated.

                Event::Start(orgize::Element::Text { value }) => {
                    if in_drawer {
                        // this is where we rely on forked orgize
                        let (_, prop_drawer): (_, orgize::elements::PropertiesMap) =
                            orgize::elements::Drawer::parse_drawer_content(value)
                                .expect("failed to parse properties drawer");
                        let properties = prop_drawer.into_hash_map();

                        // update cur_id and heading 0 ID since this is
                        // implied to be the first drawer, but it's kind
                        // of :yikes: to think about it like that! we
                        // could be genious enough to have a floating
                        // PROPERTIES drawer that would muck things up
                        cur_id = properties.get("ID").map(|s| s.to_string());
                        if cur_id.is_none() {
                            cur_id = properties.get("CUSTOM_ID").map(|s| s.to_string())
                        }

                        id_crumbs = vec![cur_id.clone()];
                        headings[0].id = cur_id.clone();

                        headings[0].aliases = properties
                            .get("ROAM_ALIASES")
                            .map(|s| split_quoted_string(s.to_string()).ok())
                            .unwrap_or(Some(vec![]));
                        headings[0].refs = properties
                            .get("ROAM_REFS")
                            .map(|s| split_quoted_string(s.to_string()).ok())
                            .unwrap_or(Some(vec![]));
                    }

                    if headings[0].id.is_none() {
                        return Err(InvalidDocError::new_err(format!(
                            "Root ID is None in {}",
                            path
                        )));
                    }

                    Ok(())
                }

When we exit the Drawer, the state value is cleared.

                Event::End(orgize::Element::Drawer(_drawer)) => {
                    in_drawer = false;
                    Ok(())
                }
NEXT fix orgize to expose file-level propertiesmap

Link parsing

Look; I'm gonna be honest here. I don't remember why the links are stored outside the heading until the end of the document parsing. Some ownership bullshit, and the COW types, if I recall.

(maybe because they may have None IDs in the from_id?)

                // Stash links outside the match block in a HashMap shape
                // of heading id -> list of links; it would be nice if the
                // match block returned an Option<Link> but that doesn't
                // play well with the rest of the state machine
                Event::Start(orgize::Element::Link(link)) => {
                    let dest = link.path.to_string();
                    let (proto, stripped_dest): (Option<String>, String) =
                        match dest.split_once(':') {
                            Some((proto, stripped_dest)) => {
                                (Some(proto.to_string()), stripped_dest.to_string())
                            }
                            None => (None, dest.clone()),
                        };

                    let last_non_none = match id_crumbs.iter().rev().find_map(|x| x.clone()) {
                        Some(last_non_none) => last_non_none,
                        None => {
                            return Err(InvalidDocError::new_err(format!(
                                "no non-none ID in {}",
                                path
                            )));
                        }
                    };

                    let link_list = links.entry(last_non_none.clone()).or_insert(Vec::new());
                    link_list.push(Link {
                        from_file: path.clone().to_string(),
                        from_id: last_non_none.clone(),
                        to: stripped_dest.clone(),
                        to_proto: proto.clone(),
                        text: link.desc.clone().map(String::from),
                    });
                    Ok(())
                }

NEXT Attachment and image caching

Cleaning up

Having populated all these variables, the headings have the links spliced back in to them and returned.

                _ => Ok(()),
            }
        })
        .fold_ok(true, |_, _result| true)?;

    return Ok(headings
        .iter()
        .map(|heading| {
            let mut h = heading.clone();
            h.links = links
                .get(&h.id.clone().unwrap_or(String::from("")))
                .cloned();
            h
        })
        .collect());
}

Org Attachment Extraction

check some paths, verify the file exists, construct an Attachment for each.

use std::path::{Path, PathBuf};

fn find_attach_dir(
    props: &HashMap<String, String>,
    file_path: &Path,
    node_id: String,
) -> Option<PathBuf> {
    let maybe_prop_dir = vec![String::from("DIR"), String::from("ATTACH_DIR")]
        .iter()
        .filter_map(|prop| props.get(prop))
        .map(|as_str| Path::new(as_str))
        .filter(|path| path.exists())
        .next();

    if maybe_prop_dir.is_some() {
        let pd = maybe_prop_dir.unwrap();
        return Some(PathBuf::from(pd));
    }

    let filebuf = PathBuf::from(file_path);

    let from_id = vec![6, 2].iter().find_map(|at| {
        let mut the_rest = node_id.clone();
        let split = the_rest.split_off(*at);

        let data_dir = filebuf.parent()?.join("data");
        let attach_dir = data_dir.join(the_rest).join(split);
        if attach_dir.clone().exists() && attach_dir.is_dir() {
            return Some(attach_dir);
        } else {
            return None;
        }
    });

    if from_id.is_some() {
        return from_id;
    } else {
        return None;
    }
}

fn fetch_attachments(node_id: String, attach_dir: PathBuf) -> Vec<Attachment> {
    let err_m = format!(
        "could not open attach_dir {:?}",
        attach_dir.to_string_lossy()
    );
    attach_dir
        .read_dir()
        .expect(&err_m)
        .filter_map(|entry| match entry {
            // XXX: Attachment::new() that does atype detection
            Ok(attach) => Some(Attachment {
                file_path: attach.path().to_string_lossy().to_string(),
                atype: AttachmentType::File,
                node_id: node_id.clone(),
            }),
            _ => None,
        })
        .collect()
}

but org-attach-dir checks some properties to calculate a directory and org-attach-dir-from-id in org-attach.el has a list of functions that it tries to check

want to reimplement this properly… it's the first non-null of:

org-attach-dir:

  • "DIR" property
  • "ATTACH_DIR" property value

org-attach-dir-from-id substrings: (this is not the default, i set org-attach-id-to-path-function-list in my Org Mode Installation)

  • XXXXXX/X+
  • XX/X+

split_quoted_string

Okay, so I reached for a big hammer here to extract ROAM_REFS and ROAM_ALIASES since org-roam will quote strings that have spaces in them, but ones which do not are not quoted, so I Do Some Lazy Bullshit to dequote these.

While Orgize gives you each element sequentially lexpr only iterates the first-level; we could recurse here but instead of that I just kind of half-ass it and step in to the second level object where i can extract the vector's values and punch out. anyways, recursing this would be a pain in the ass since the vec's value is [Value] instead of a Value which can be iterated like its parent

fn split_quoted_string(quoted_str: String) -> Result<Vec<String>, Box<dyn Error>> {
    let str_as_list = format!("[{}]", &quoted_str);
    let mut sexp =
        lexpr::parse::Parser::from_str_custom(&str_as_list, lexpr::parse::Options::elisp());
    let ret: Vec<String> = sexp
        .value_iter()
        .filter_map(|val| {
            let ret: Option<Vec<String>> = match val {
                Ok(lexpr::Value::Null) => None,
                Ok(lexpr::Value::Vector(values)) => Some(
                    values
                        .iter()
                        .map(|inner_val| match inner_val {
                            lexpr::Value::String(string) => string.to_string(),
                            lexpr::Value::Symbol(string) => string.to_string(),
                            others => todo!("lexpr roam_aliases or so {:?}", others),
                        })
                        .collect(),
                ),
                Ok(lexpr::Value::Symbol(sym)) => Some(vec![sym.to_string()]),
                Err(the_err) => todo!("XXX {:?}", the_err),
                val => todo!("??? {:?}", val),
            };
            ret
        })
        .flatten()
        .collect();

    Ok(ret)
}

I wrote some simple unit tests for this below.

The Arroyo HTML exporter

This thing is a little bit complex, so I hope I've documented it well enough. In short, the feature-set of this HTML exporter is designed to match the requirements of The Arcology Project: Django Edition:

  • rewrite URLs from a HashMap passed in; higher-level things can map from IDs to URLs, files to URLs, and re-write the rest to 404s.
  • re-write org-fc clozes to be <span>'s (currently supported except for clozes with links in them)
  • drop org-fc drawers
  • code highlighting by building this on top of Orgize::export_html::SyntectHtmlHandler

stretch unimplemented:

  • tufte side-notes

crate::export_html::htmlize_file is the public entrypoint for this functionality, below.

Here's the top-matter:

use anyhow::Result;
use pyo3::prelude::*;
use std::collections::{HashMap, HashSet};
use std::convert::From;
use std::fs;
use std::io::{Error, Write};
use std::iter;
use std::marker::PhantomData;

use regex::Regex;

use orgize::export::{DefaultHtmlHandler, HtmlEscape, HtmlHandler, SyntectHtmlHandler};
use orgize::{Element, Org};

The Exporter is controlled by passing in a struct with (currently) just one thing inside of it, a map of IDs -> public URLs to rewrite them in to.1 This thing is a boilerplate pyo3 class with a constructor attached to it.

The Exporter is controlled by passing in a structure with a few configuration options:

  • link_retargets maps org IDs to public URLs to rewrite them for the web
  • ignore_tags is a list of tags which will cause the exporter to not include that heading or any of its children in the final document
  • limit_headings is a set of org IDs; if this is not empty, the Exporter will only export these headings. This will be called "subheading mode" or "limited mode"
  • include_subheadings will instruct subheading mode to also export child headings underneath the ones indicated by limit_headings. One hopes the interaction of these two options in the code below will make the semantics clear.
#[derive(Default, Debug, Clone)]
#[pyclass(dict)]
pub struct ExportOptions {
    /// id:{the_id} -> URL rewrites
    #[pyo3(get)]
    pub link_retargets: HashMap<String, String>,
    #[pyo3(get)]
    pub ignore_tags: HashSet<String>,
    #[pyo3(get)]
    pub limit_headings: HashSet<String>,
    #[pyo3(get)]
    pub include_subheadings: bool,
}

#[pymethods]
impl ExportOptions {
    #[new]
    fn new(
        link_retargets: HashMap<String, String>,
        ignore_tags: Vec<String>,
        limit_headings: Vec<String>,
        include_subheadings: Option<bool>,
    ) -> Self {
        let mut lh2 = HashSet::new();
        lh2.extend(limit_headings);
        let mut tags = HashSet::new();
        tags.extend(ignore_tags);
        ExportOptions {
            link_retargets,
            limit_headings: lh2,
            ignore_tags: tags,
            include_subheadings: include_subheadings.unwrap_or(false),
            ..Default::default()
        }
    }

    // add these later if i feel like it
    // pub fn __repr__(slf: PyRef<'_, Self>) -> PyResult<String> {
    //     Ok(slf.to_string())
    // }

    // pub fn __str__(slf: PyRef<'_, Self>) -> PyResult<String> {
    //     Self::__repr__(slf)
    // }
}

The HTML handler is takes a more featureful HTML handler in its constructor, adding the features outlined above and implemented in the following code section.

pub struct ArroyoHtmlHandler<E: From<Error>, H: HtmlHandler<E>> {
    pub options: ExportOptions,
    /// inner html handler
    pub inner: H,
    /// handler error type
    pub error_type: PhantomData<E>,
    /// file-property drawer state tracking
    current_drawer: Option<String>,
    in_public_heading: bool,
    ignore_deeper_than: usize,
    heading_breadcrumbs: Vec<String>,
}

impl<E: From<Error>, H: HtmlHandler<E>> ArroyoHtmlHandler<E, H> {
    pub fn new(options: ExportOptions, inner: H) -> Self {
        let iph = options.clone().limit_headings.is_empty();
        ArroyoHtmlHandler {
            inner,
            options,
            in_public_heading: iph,
            ..Default::default()
        }
    }

    pub fn rewrite_link_from(&self, id: &String) -> String {
        match self.options.link_retargets.get(id) {
            Some(path) => HtmlEscape(&path).to_string(),
            _ => HtmlEscape(format!("/404?key={}", id)).to_string(),
        }
    }
}

impl<E: From<Error>, H: HtmlHandler<E>> Default for ArroyoHtmlHandler<E, H> {
    fn default() -> Self {
        ArroyoHtmlHandler {
            inner: H::default(),
            error_type: PhantomData,
            current_drawer: None,
            in_public_heading: false,
            heading_breadcrumbs: vec![],
            ignore_deeper_than: 999, // please don't make a 1000 deep heading as the first one in the doc...

            options: ExportOptions::default(),
        }
    }
}

The Custom HTML Exporter Extensions

ArroyoHtmlHandler need to implement HtmlHandler's start and end function which take elements to act on, sort of like how the iterator for the parser works. We just go matching for the elements that are important and let the rest fall through to the "inner" handler.

impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E, H> {
    fn start<W: Write>(&mut self, mut w: W, element: &Element) -> Result<(), E> {
        match &self.current_drawer {
            None => {}
            Some(drawer_name) => {
                if vec![
                    String::from("PROPERTIES"),
                    String::from("REVIEW_DATA"),
                    String::from("LOGBOOK"),
                ]
                .contains(&drawer_name)
                {
                    return Ok(());
                }
            }
        };
        // if !self.in_public_heading {
        //     return Ok(());
        // }
        match element {

Titles need to be checked for export/ignore tags, that the current heading is not underneath one which is already ignored, and they add an anchor as an ID, they add task/priority information as CSS spans. This is also where the parser checks that the heading is one of the "limited" subheadings. This does a lot of stuff.

            Element::Title(title) => {
                let properties = title.properties.clone().into_hash_map();
                let our_new_id = properties.get("ID");
                let our_level = title.level;

                if self.heading_breadcrumbs.len() < our_level {
                    let diff = our_level - self.heading_breadcrumbs.len();
                    self.heading_breadcrumbs
                        .extend(iter::repeat("SENTINEL".into()).take(diff));
                }
                self.heading_breadcrumbs.truncate(our_level);
                match our_new_id.clone() {
                    None => {}
                    Some(id) => {
                        self.heading_breadcrumbs.push(id.to_string());
                    }
                }

                let has_ignore_tag = title
                    .tags
                    .clone()
                    .into_iter()
                    .map(String::from)
                    .find(|v| self.options.ignore_tags.contains(v))
                    .is_some();
                // if it's a new ignore tag or a subheading of one we already parsed
                self.in_public_heading = !has_ignore_tag && title.level <= self.ignore_deeper_than;
                self.in_public_heading = self.in_public_heading
                    && self.options.limit_headings.is_empty()
                    || self
                        .options
                        .limit_headings
                        .intersection(&HashSet::from_iter(
                            self.heading_breadcrumbs.clone().into_iter(),
                        ))
                        .count()
                        != 0;
                if self.in_public_heading {
                    self.ignore_deeper_than = 999;
                }
                self.in_public_heading;

                if has_ignore_tag {
                    self.ignore_deeper_than = title.level;
                }

                let mut parts = Vec::new();
                parts.push(format!(
                    "<h{}>",
                    if title.level <= 6 { title.level } else { 6 }
                ));

                let props = title.properties.clone().into_hash_map();
                let id = props.get("ID");
                match id {
                    Some(id) => parts.push(format!("<a class=\"id-anchor\" id=\"{}\"></a>", id)),
                    None => {}
                }

                match &title.keyword {
                    Some(kw) => parts.push(format!("<span class=\"task task-{}\"></span>", kw)),
                    None => {}
                }

                (title.tags)
                    .iter()
                    .for_each(|tag| parts.push(format!("<span class=\"tag tag-{}\"></span>", tag)));

                if self.in_public_heading {
                    write!(w, "{}", parts.join(""))?
                }
            }

Because orgize doesn't parse the file-level PROPERTIES drawer, they're elided from the export. org-fc state drawers are, too.

            Element::Drawer(drawer) => {
                self.current_drawer = Some(drawer.name.to_string());
            }

Text parsing is a bit weird to handle rewriting org-fc clozes, though this is still very much imperfect because it doesn't handle FC hints with links or any other objects which will break up the Text object… It's going to be hard to solve this with the current design, eugh. it might just be simpler to run this regex after the HTML export is complete even if that's more oogly.

            Element::Text { value: before } => {
                let re = Regex::new(
                    r"(?x)
                          \{\{([^\}]+)}     # grab the answer
                          (?:\{([^\}]+?)})? # grab a hint
                          @[0-9]+}          # exclude the question number
                        ",
                )
                .unwrap();
                if self.in_public_heading {
                    let after =
                        re.replace_all(before, "<span class='fc-cloze' title='$2'>$1</span>");
                    if after.eq(before) {
                        self.inner.start(w, &Element::Text { value: after })?
                    } else {
                        write!(w, "{}", after)?
                    }
                }
            }

Link exporting is going to be the most complicated part of this because it does a bunch of Bullshit to pull apart the link in to a protocol:destination pair, a bunch of clones and type nudges I'd like to clean up, and a logical fork that is can surely look more clean than it does to add an HTML class demarcating internal links. I'll fix this at some point!

            Element::Link(link) => {
                let string_path = link.path.to_string();
                let (proto, stripped_dest) = match string_path.split_once(':') {
                    Some((proto, stripped_dest)) => (proto, stripped_dest.into()),
                    None => ("", string_path.clone()),
                };
                let mut desc =
                    HtmlEscape(link.desc.clone().unwrap_or(link.path.clone())).to_string();

                if self.in_public_heading {
                    let image_maybe = vec![".png", ".svg", ".jpg", ".jpeg"];
                    if image_maybe
                        .iter()
                        .any(|&suffix| string_path.ends_with(suffix))
                    {
                        desc = format!("<img src=\"{}\" alt=\"{}\"/>", string_path, desc);
                        write!(w, "<a href=\"{}\">{}</a>", string_path, desc,)?
                    }

                    match proto {
                        "id" => write!(
                            w,
                            "<a href=\"{}\">{}</a>",
                            self.rewrite_link_from(&stripped_dest),
                            desc,
                        )?,
                        "roam" => write!(
                            w,
                            "<a href=\"/404?key={}\">{}</a>",
                            HtmlEscape(&link.path),
                            desc,
                        )?,
                        _ => self.inner.start(w, &Element::Link(link.clone()))?,
                    }
                }
            }

Source blocks should include any information necessary to re-tangle them.

            Element::SourceBlock(block) => {
                if self.in_public_heading {
                    let args = String::from(block.arguments.clone());
                    if args.len() > 0 {
                        write!(w, "<span class='babel-args'>{}</span>", args)?
                    }
                    self.inner.start(w, &Element::SourceBlock(block.clone()))?
                }
            }

Everything else is passed along to Syntect or the default HTML Handler.

            _ => {
                if self.in_public_heading {
                    self.inner.start(w, element)?
                }
            }
        }
        Ok(())
    }

    fn end<W: Write>(&mut self, w: W, element: &Element) -> Result<(), E> {
        match element {
            // reset the drawer state tracking
            Element::Drawer(_drawer) => {
                self.current_drawer = None;
            }
            _ => {
                if !self.in_public_heading {
                    return Ok(());
                }
                if self.current_drawer.is_some() {
                    return Ok(());
                }
                self.inner.end(w, element)?
            }
        }
        Ok(())
    }
}

And that's how we implement the export features. Calling it from Rust is pretty straightforward, and from Python it's even easier.

The API Interface

The exporter embeds another exporter with a code formatting pass applied to it, I'd like to be able to pass a configuration to that some day…

// sure would be nice..... some day i'll understand lifetimes enough
// to write a function that goes path -> orgize::Org
// use crate::parse::orgize_document;

pub fn htmlize_file(path: String, options: ExportOptions) -> Result<String> {
    let mut handler = ArroyoHtmlHandler::new(options, SyntectHtmlHandler::new(DefaultHtmlHandler));
    let org = String::from_utf8(fs::read(path.clone())?).unwrap();
    let org_tree = &Org::parse_custom(
        &org,
        &orgize::ParseConfig {
            // Need to pull these from environment or options...
            todo_keywords: (
                vec![
                    "NEXT".to_string(),
                    "INPROGRESS".to_string(),
                    "WAITING".to_string(),
                ],
                vec!["DONE".to_string(), "CANCELLED".to_string()],
            ),
            ..Default::default()
        },
    );

    let mut vec = vec![];
    org_tree.write_html_custom(&mut vec, &mut handler)?;
    Ok(String::from_utf8(vec)?)
}

Unfortunately the parse_custom invocation is repeated between many of these entrypoints, the Atom and HTML and parser, because I'm not a good enough rust programmer yet to figure out the lifecycle of some strings and whatnot that get embedded in the orgize::Org object.

NEXT organize this thing's module hierarchy a bit

why no

export::html -> export::document export::atom -> export::heading

Library definition and exports for the native Python library

pyo3 makes it pretty straightforward to export these functions as Python modules and handle the type coercion and whatnot:

use pyo3::prelude::*;

pub mod parse;
pub mod export_html;
// pub mod export_atom;
pub mod types;

#[pymodule]
fn arroyo_rs(py: Python, m: &PyModule) -> PyResult<()> {
    #[pyfn(m)]
    fn parse_file(path: String) -> PyResult<types::Document> {
        Ok(parse::parse_document(path)?)
    }

    #[pyfn(m)]
    fn htmlize_file(path: String, options: export_html::ExportOptions) -> PyResult<String> {
        Ok(export_html::htmlize_file(path, options)?)
    }

//    #[pyfn(m)]
//    fn atomize_file(path: String, options: export_html::ExportOptions) -> PyResult<String> {
//        Ok(export_atom::atomize_file(path, options)?)
//    }

    m.add_class::<types::Document>()?;
    m.add_class::<types::Heading>()?;
    m.add_class::<types::Keyword>()?;
    m.add_class::<types::Link>()?;
    m.add_class::<export_html::ExportOptions>()?;

    m.add("InvalidDocError", py.get_type::<types::InvalidDocError>())?;

    Ok(())
}

NEXT it would be cool if the htmlize_file call could take **kwargs and construct the ExportOptions itself.

This makes it easy to make the same interface for atomize_file

WAITING add atomize_file to the pyfn's

Code Unit Tests

I wrote Cargo tests for split_quoted_string, and some simple parser tests.

#[cfg(test)]
mod tests {
    use std::assert_eq;

    use crate::parse::*;

    #[test]
    fn test_split_quoted_string() {
        assert_eq!(
            split_quoted_string(String::from("")).expect("Could not parse"),
            Vec::<String>::new()
        );

        assert_eq!(
            split_quoted_string(String::from("CCE")).expect("Could not parse"),
            vec!["CCE"]
        );

        assert_eq!(
            split_quoted_string(String::from("\"CCE but with spaces\"")).expect("Could not parse"),
            vec!["CCE but with spaces"]
        );

        assert_eq!(
            split_quoted_string(String::from("\"CCE but with spaces\" \"And 2\""))
                .expect("Could not parse"),
            vec!["CCE but with spaces", "And 2"]
        );

        assert_eq!(
            split_quoted_string(String::from("\"CCE but with spaces\" CCE"))
                .expect("Could not parse"),
            vec!["CCE but with spaces", "CCE"]
        );
    }

    // test for tag inheritence and filetags
    // test for keyword extraction
    // test for link parsing

    // A fairly complicated doc
    #[test]
    fn test_iter_doc_cce() {
        let doc = parse_document(String::from("/home/rrix/org/cce/cce.org"))
            .expect("Did not parse a doc");

        assert_eq!(doc.headings.len(), 4, "Expected four headings in doc");

        assert_eq!(
            doc.headings[0].text, "The Complete Computing Environment",
            "Title correctly set"
        );

        assert_eq!(
            doc.headings[1].text, "[[id:20211219T184243.333209][\"Hey Smell This\"]]",
            "Title correctly set"
        );

        assert!(
            doc.headings[0].links.as_ref().unwrap().len() > 0,
            "No links!"
        );

        assert_eq!(doc.keywords[0].keyword, "TITLE", "Title at da top");

        assert_eq!(
            doc.collect_keywords("ARCOLOGY_KEY".to_string())
                .expect("Expected keywords..."),
            vec!["cce/cce".to_string()],
            "Keyword collection works"
        );
    }
}

I wrote Python tests to make sure the pyo3 interfaces work, too:

import arroyo.arroyo_rs
import arroyo.models

def test_assert_logic():
    doc = arroyo.arroyo_rs.parse_file("./arroyo-native-parser.org")

    assert doc.keywords[0].file == "./arroyo-native-parser.org"
    assert doc.keywords[0].keyword == "TITLE"
    assert doc.keywords[0].value == "The arroyo_rs Native Org Parser"

    assert doc.headings[0].id == "20231023T115950.248543"
    assert doc.headings[0].level == 0
    assert len(doc.headings[0].text) > 0
    assert len(doc.headings[0].links) > 0
    assert len(doc.headings[0].aliases) == 2
    assert doc.headings[0].refs[0] == "https://code.rix.si/rrix/arroyo_rs"
    assert doc.headings[0].tags == ["Project"]
    assert doc.headings[-1].tags == ["Project", "Code"]

    assert doc.headings[0].links[0].from_file == "./arroyo-native-parser.org"
    assert doc.headings[0].links[0].from_id == "20231023T115950.248543"
    assert doc.headings[0].links[0].to == "id:1fb8fb45-fac5-4449-a347-d55118bb377e"
    assert doc.headings[0].links[0].to_proto == "id"
    assert doc.headings[0].links[0].text == "org-mode"

def test_relationships():
    assert arroyo.arroyo_rs.Heading is not None
    assert arroyo.arroyo_rs.Document is not None
    assert arroyo.arroyo_rs.Keyword is not None
    assert arroyo.arroyo_rs.Link is not None

# def test_sqlmodel_conversion():
#   native = arroyo.arroyo_rs.parse_file("./arroyo-native-parser.org")
#   taglinks = arroyo.models.Tag.from_native_doc(native)
#   headings = arroyo.models.Heading.from_native_doc(native)
#   keywords = arroyo.models.Keyword.from_native_doc(native)
#   document = arroyo.models.Document.from_native_doc(native)
#
#   assert(len(keywords) == 5)
#   assert(headings[0].node_id == '20231023T115950.248543')
#   assert(headings[1].node_id == None)
#   assert(headings[1].text == "Overview")

ID inheritance needs to be fixed: inherited_tags stuff to track cur_id inheritance…">I should be doing something like the inherited_tags stuff to track cur_id inheritance…

Python Package   Code

Okay, with a Rust parser and exported types in place, Arroyo needs to store those in a database. Early designs built around the parser were pure Rust, but I found all of the Rust ORMs really a pain in the ass to use. I do hope that changes, and in theory things like SeaORM should be easy to use with the Document types as ORM entities, but I got pretty frustrated in trying to set that whole thing up, and anyways I would have had to rewrite the entire stack in Rust, which I am still not quite willing to do.

I'm still idly considering implementing the database layer as an Ecto module, too, and developing the site in Elixir, but I keep going back and forth on that, I find the tooling and developer experience story in Elixir to still be quite lacking, and deploying it with NixOS is always kind of a pain. Ultimately, Arroyo also needs to be developed as a library which can be used in my Literate Programming documents and to generate configuration files programmatically as in My NixOS configuration, which lends itself to being a Python library + click Application, so that's where we start.

Stub package interface

from .arroyo_rs import parse_file, InvalidDocError
from .arroyo_rs import htmlize_file, ExportOptions

Click command wrapper

Arroyo is likely to be used as a library, but there is a small click application included too which can be used to generate a database for testing, export an HTML file of a document, etc.

This is a stub, this module probably doesn't need to be runnable.

import os
import click
import glob
from typing import Optional

# from . import persist_one_file
from .arroyo_rs import parse_file, htmlize_file, ExportOptions
# from . import models
# from sqlmodel import Session

Files are collected with glob.glob and then filtered down to files that exist. Each is parsed and persisted to the database by the code in the next section. A single sqlite session is passed through the entire persistence pipeline here so that there is only a single write-path to worry about, it is created below.

@click.group()
def cli():
  pass

@cli.command()
@click.option("--source", "-s", help="Org source directory", default="~/org")
@click.option("--file-glob", "-g", help="File search glob", default="**/*.org")
def parse_files(source, file_glob):
  expanded_src = os.path.expanduser(source)
  files = glob.glob(file_glob, root_dir=expanded_src, recursive=True)
  files = map(lambda it: os.path.join(expanded_src, it), files)
  files = list(filter(os.path.isfile, files))

  docs = [
    parse_file(f)
    for f in files
  ]

  print(f"Parsed {len(files)} files.")
  print(f"Persisted {len(docs)} docs.")


@cli.command()
@click.option("--file", "-f", help="The file to export")
@click.option("--limit-headings", "-H", multiple=True, help="org ID to export")
@click.option("--include-subheadings", "-I", help="when headings are specified, this will control whether to export child headings")
def export_document(file, limit_headings: Optional[set] = None, include_subheadings=False):
  # in The Real World this is loaded from DB and generated.
  options = ExportOptions(
    link_retargets = {"currently_reading": "https://rix.si/hello-world"},
    limit_headings = limit_headings,
    include_subheadings = include_subheadings,
    ignore_tags = ["NOEXPORT", "noexport", "private"],
  )
  print(htmlize_file(file, options))
if __name__ == "__main__":
  cli()

NEXT Usage

See Arcology ingestfiles Command for examples of the parser interface

See org_page HTTP Handler">Arcology's org_page HTTP Handler and the atom endpoint for examples of the exporter interface

Addendum: Making sure org-auto-tangle works with this buffer and MELPA languages

IDK why but org-auto-tangle is struggling with files that include types from external packages; I tried just exporting load-path to it, but that didn't work very well since they still need to be require'd. This code adds an Emacs hook which runs before org-babel-tangle runs which modifies the load-path and loads the libraries required by this document. Some Computer :( has to occur because org-auto-tangle's async invocation doesn't let me do lexical closures or anything like that, so some backquoting to inject the local variables in to the lambda has to occur.

(with-eval-after-load "ob-tangle"
  (require 's)
  (let ((some-paths
         (list
          (seq-find (lambda (path) (s-contains-p "/rust-mode" path)) load-path)
          (seq-find (lambda (path) (s-contains-p "/nix-mode" path)) load-path)
          (seq-find (lambda (path) (s-contains-p "/magit-section" path)) load-path)
          (seq-find (lambda (path) (s-contains-p "/compat" path)) load-path)
          (seq-find (lambda (path) (s-contains-p "/dash" path)) load-path)
          )))
    (add-to-list 'org-babel-pre-tangle-hook
                 `(lambda ()
                    (dolist (path '(,@some-paths))
                      (add-to-list 'load-path path))
                    (require 'rust-mode)
                    (require 'nix-mode)))))

Future Work

NEXT remove the old elisp logic

NEXT [#C] source code block extraction & babel execution?

how wild would it be to have a little wasm environment in the rust to execute code blocks? at the very least i want to build a tangler and some Python modules to populate Arroyo code generators.


1

I'll still need to add a way to rewrite missing links in to 404/stub pages but for now they are just left as-is, but this is fine