Compare commits

...

2 Commits

Author SHA1 Message Date
Ryan Rix e3864b8127 refactor start(Title) export, add some more metadata as invisible <span>s 2024-02-17 16:41:08 -08:00
Ryan Rix 22a69abb89 remove Atom Export 2024-02-17 16:40:47 -08:00
3 changed files with 59 additions and 1001 deletions

View File

@ -894,7 +894,7 @@ The Exporter is controlled by passing in a structure with a few configuration op
- =link_retargets= maps org IDs to public URLs to rewrite them for the web
- =ignore_tags= is a list of tags which will cause the exporter to not include that heading or any of its children in the final document
- =limit_headings= is a set of org IDs; if this is not empty, the Exporter will *only* export these headings. This will be called "subheading mode"
- =limit_headings= is a set of org IDs; if this is not empty, the Exporter will *only* export these headings. This will be called "subheading mode" or "limited mode"
- =include_subheadings= will instruct subheading mode to also export child headings underneath the ones indicated by =limit_headings=. One hopes the interaction of these two options in the code below will make the semantics clear.
#+begin_src rust
@ -1023,6 +1023,8 @@ impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E,
match element {
#+end_src
Titles need to be checked for export/ignore tags, that the current heading is not underneath one which is already ignored, and they add an anchor as an ID, they add task/priority information as CSS spans. This is also where the parser checks that the heading is one of the "limited" subheadings. This does a lot of stuff.
#+begin_src rust
Element::Title(title) => {
let properties = title.properties.clone().into_hash_map();
@ -1037,8 +1039,6 @@ impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E,
}
}
let breadcrumb_set =
HashSet::from_iter(self.heading_breadcrumbs.clone().into_iter());
let has_ignore_tag = title
.tags
.clone()
@ -1046,13 +1046,16 @@ impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E,
.map(String::from)
.find(|v| self.options.ignore_tags.contains(v))
.is_some();
self.in_public_heading = !has_ignore_tag || self.ignore_below_level >= title.level;
// if it's a new ignore tag or a subheading of one we already parsed
self.in_public_heading = !has_ignore_tag && title.level <= self.ignore_below_level;
self.in_public_heading = self.in_public_heading
&& (self.options.limit_headings.is_empty()
|| self
.options
.limit_headings
.intersection(&breadcrumb_set)
.intersection(&HashSet::from_iter(
self.heading_breadcrumbs.clone().into_iter(),
))
.count()
!= 0);
@ -1060,22 +1063,34 @@ impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E,
self.ignore_below_level = title.level;
}
let mut parts = Vec::new();
parts.push(format!(
"<h{}>",
if title.level <= 6 { title.level } else { 6 }
));
let props = title.properties.clone().into_hash_map();
let id = props.get("ID");
match id {
Some(id) => parts.push(format!("<a class=\"id-anchor\" id=\"{}\"></a>", id)),
None => {}
}
match &title.keyword {
Some(kw) => parts.push(format!("<span class=\"task task-{}\"></span>", kw)),
None => {}
}
(title.tags)
.iter()
.for_each(|tag| parts.push(format!("<span class=\"tag tag-{}\"></span>", tag)));
if self.in_public_heading {
let props = title.properties.clone().into_hash_map();
let id = props.get("ID");
match id {
Some(id) => {
write!(w, "<h{}><a class=\"id-anchor\" id=\"{}\"></a>", if title.level <= 6 { title.level } else { 6 }, id.clone())?
},
None => {
write!(w, "<h{}>", if title.level <= 6 { title.level } else { 6 })?
}
}
write!(w, "{}", parts.join(""))?
}
}
#+end_src
Because =orgize= doesn't parse the file-level =PROPERTIES= drawer, they're elided from the export. [[id:2e31b385-a003-4369-a136-c6b78c0917e1][org-fc]] state drawers are, too.
#+begin_src rust
@ -1226,541 +1241,6 @@ why no
export::html -> export::document
export::atom -> export::heading
* INPROGRESS The Atom exporter
:PROPERTIES:
:ID: 20240112T121157.583809
:END:
:LOGBOOK:
- State "INPROGRESS" from "NEXT" [2024-02-02 Fri 14:41]
:END:
The HTML parser turns an org mode document in to HTML.
The Atom exporter turns a set of org mode headings in to an Atom feed for serving update notifications to a handful, perhaps even a dozen feed reader applications like [[id:20230310T155744.804329][tt-rss]] or feedly or what have you.
For now maybe it is easier to assume that the headings are all in one file; that's how the existing [[id:arcology/atom-gen][Arcology Feed Generator]] behaves, you can turn a page in to an rss feed with an unholy abomination of lua and pandoc and xml templates. Surely something better can be designed now.
the primary tension of the arroyo library now is that its design context is only in the realm of the arcology project's design goals. I need to start deciding whether a design goal of this library is to support non-arcology document systems. surely interoperable but different document systems could be built on top of arroyo
** CANCELLED First Pass
:LOGBOOK:
- State "CANCELLED" from [2024-02-04 Sun 16:02]
:END:
so the first pass of this API could take a file path, extract the feed metadata from keywords and heading properties; it could construct an entire atom feed, falling back to the custom HTML exporter to fill out the feed with text content. That's probably fine, and an API that other document servers could work with.
the trick to designing this is that a lot of different shit has to be bolted together
- the orgize parser has to iterate over each heading
- each heading needs to be htmlize
- each htmlize heading needs to be escaped
- the htmlized headings need to be injected in to the atom doc
it would be nice to do this in one pass...
This thing is similar in many respects to the HTML Handler, and it uses it directly in anger.
#+begin_src rust :tangle src/export_atom.rs
use anyhow::Result;
use regex;
use std::borrow::Cow;
use std::fs;
use std::io::{Error, Write};
use std::marker::PhantomData;
use orgize::export::{DefaultHtmlHandler, HtmlEscape, HtmlHandler, SyntectHtmlHandler};
use orgize::{elements, Element, Org};
use crate::export_html::ArroyoHtmlHandler;
use crate::export_html::ExportOptions;
#+end_src
This is just some basic implementation definitions and junk.
#+begin_src rust :tangle src/export_atom.rs
pub struct ArroyoAtomHandler<E: From<Error>, H: HtmlHandler<E>> {
pub options: ExportOptions,
pub inner: ArroyoHtmlHandler<E, H>,
pub error_type: PhantomData<E>,
// internal parser state
in_heading: bool,
in_drawer: bool,
heading_lvl: usize,
// Document metadata placed in feed
pub filetags: Vec<String>,
pub authors: Vec<String>,
pub feed_title: String,
pub feed_page_id: String,
pub last_date: String,
}
impl<E: From<Error>, H: HtmlHandler<E>> ArroyoAtomHandler<E, H> {
pub fn new(options: ExportOptions, inner: ArroyoHtmlHandler<E, H>) -> Self {
ArroyoAtomHandler {
inner,
options,
..Default::default()
}
}
}
impl<E: From<Error>, H: HtmlHandler<E>> Default for ArroyoAtomHandler<E, H> {
fn default() -> Self {
ArroyoAtomHandler {
inner: ArroyoHtmlHandler::default(),
error_type: PhantomData,
options: ExportOptions::default(),
in_heading: false,
in_drawer: false,
heading_lvl: 0,
last_date: "".into(),
feed_title: "".into(),
feed_page_id: "".into(),
filetags: vec![],
authors: vec![],
}
}
}
#+end_src
=atomize_file= is the "public" API point; it accepts an =ExportOptions= which is just passed to the underlying HTML exporter, but I really need to come up with some better options interface so that this exporter can take some options as well. These things are all nested like dolls from the Atom handler -> HTML handler -> Syntect handler -> Default handler. That nested chain of exporters is then invoke, and the =impl= below is used to export the document node-by-node.
#+begin_src rust :tangle src/export_atom.rs
pub fn atomize_file(path: String, options: ExportOptions) -> Result<String> {
let syntect_handler = SyntectHtmlHandler::new(DefaultHtmlHandler);
let html_handler = ArroyoHtmlHandler::new(options.clone(), syntect_handler);
let mut handler = ArroyoAtomHandler::new(options.clone(), html_handler);
let org = String::from_utf8(fs::read(path.clone())?).unwrap();
let org_tree = &Org::parse_custom(
&org,
&orgize::ParseConfig {
// Need to pull these from environment or options...
todo_keywords: (
vec![
"NEXT".to_string(),
"INPROGRESS".to_string(),
"WAITING".to_string(),
],
vec!["DONE".to_string(), "CANCELLED".to_string()],
),
..Default::default()
},
);
let mut vec = vec![];
org_tree.write_html_custom(&mut vec, &mut handler)?;
Ok(String::from_utf8(vec)?)
}
#+end_src
This is uses the same custom org handling that the HTML exporter does, but in a worse fashion.
#+begin_src rust :tangle src/export_atom.rs
impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoAtomHandler<E, H> {
fn start<W: Write>(&mut self, mut w: W, element: &Element) -> Result<(), E> {
(match element {
#+end_src
The root element contains metadata which need to be populated with data pulled from the doc, keywords, etc. This is defined below:
#+begin_src rust :tangle src/export_atom.rs
Element::Document { .. } => self.start_document(w, element),
#+end_src
=start_keyword= extracts document metadata required to populate the Atom feed.
#+begin_src rust :tangle src/export_atom.rs
Element::Keyword(kw) => self.start_keyword(w, kw),
#+end_src
Each title is immediately contained in a Heading, but contains all the actual metadata. It is the basis of a new heading and thus a new entry in the Atom document.
There's still some work to do here, but it will only publish headings that have an ID and a PUBDATE property, and don't have certain tags. The basic metadata for the heading is injected in to the final document.
The title needs to know to close the previous one, this is pretty ugly, but the structure of the =Orgize= iterator don't work well for this. This functionality is pretty ugly and explained better below:
#+begin_src rust :tangle src/export_atom.rs
Element::Title(title) => self.start_title(w, title),
#+end_src
Because Orgize doesn't properly parse file-level =PROPERTIES= drawers in to a pseudo-heading, I have a forked version of Orgize that lets me reach in and do that myself. I handle =Text= elements to do this.
I would love to eliminate this logic, it's lifted from the parser above.
#+begin_src rust :tangle src/export_atom.rs
Element::Drawer(drawer) => {
self.in_drawer = drawer.name == "PROPERTIES" && self.feed_page_id.eq("");
self.start_rest(w, element)
}
Element::Text { value } => self.start_text(w, value),
#+end_src
Any other elements will be HTML-ized and then escaped, and emitted in to the doc as HTML-encoded HTML, assuming we're inside of a heading.
#+begin_src rust :tangle src/export_atom.rs
_t => self.start_rest(w, element),
})
.unwrap(); // if we can't parse something, just fucken panic.
#+end_src
#+begin_src rust :tangle src/export_atom.rs
Ok(())
}
fn end<W: Write>(&mut self, mut w: W, element: &Element) -> Result<(), E> {
(match element {
#+end_src
The end functions is to close out document entities. we only use it to encode the end of the Document to make the XML valid, and everything goes through the same "HTML and HTML Escape" path as above, with Drawers resetting the internal state.
#+begin_src rust :tangle src/export_atom.rs
// Element::Title(_title) => {}
Element::Document { .. } => self.end_document(w, element),
#+end_src
#+begin_src rust :tangle src/export_atom.rs
Element::Drawer(drawer) => {
self.in_drawer = false;
self.end_rest(w, element)
}
#+end_src
#+begin_src rust :tangle src/export_atom.rs
_ => self.end_rest(w, element),
})
.ok();
Ok(())
}
}
#+end_src
All of those handlers are implemented in the =struct= =impl= below.
#+begin_src rust :tangle src/export_atom.rs
impl<E: From<Error>, H: HtmlHandler<E>> ArroyoAtomHandler<E, H> {
#+end_src
Processing the document to create metadata is going to be a bit of a pain because keywords need to be extracted out of the document and perhaps even turned in to URLs, etc...
#+begin_src rust :tangle src/export_atom.rs
fn start_document<W: Write>(&mut self, mut w: W, _document: &elements::Element) -> Result<()> {
Ok(write!(
w,
"<?xml version=\"1.0\" encoding=\"utf-8\"?>
<feed xmlns=\"http://www.w3.org/2005/Atom\">"
)?)
}
fn end_document<W: Write>(&mut self, mut w: W, _document: &elements::Element) -> Result<()> {
// the last heading/entry is still "open", close it.
write!(
w,
r#" </content>
</entry>
<updated>{}</updated>
</feed>"#,
self.last_date,
)?;
Ok(())
}
#+end_src
Processing a title in to an Atom =<entry>= is pretty ugly. Really what I want is a way to get all this out of Heading objects but those lack enough information to fill out the =<entry>= metadata.
#+begin_src rust :tangle src/export_atom.rs
fn start_title<W: Write>(&mut self, mut w: W, title: &elements::Title) -> Result<()> {
let ignore_tags = vec![
String::from("noexport"),
String::from("NOEXPORT"),
String::from("ignore"),
];
let export_tags = title
.tags
.clone()
.into_iter()
.map(String::from)
.find(|v| ignore_tags.contains(v));
let props = title.properties.clone().into_hash_map();
let id = props
.get("ID")
.map(|id| id.clone().into())
.unwrap_or("".to_string());
let pubdate = props
.get("PUBDATE")
.map(|pubdate| pubdate.clone().into())
.unwrap_or("".to_string());
let the_link = self.inner.rewrite_link_from(&id);
if id == "" || pubdate == "" || export_tags.is_some() && self.in_heading {
self.in_heading = false;
Ok(())
} else if id != "" && pubdate != "" && export_tags.is_none() {
// close previous heading; note that self.heading_lvl defaults to 0
if title.level <= self.heading_lvl {
write!(w, "</content>\n")?;
write!(w, "</entry>\n")?;
}
let date = match rfcize_datestamp(pubdate.clone()) {
Ok(date) => date,
Err(_) => {
dbg!(format!("bad date {}", pubdate.clone()));
HtmlEscape(pubdate.clone()).to_string()
}
};
if self.last_date < date {
self.last_date = date.clone();
}
let title_text = match strip_links_from_str(&title.raw.clone()) {
Ok(text) => HtmlEscape(text),
Err(the_err) => {
dbg!(format!("bad title {} {}", title.raw.clone(), the_err));
HtmlEscape(title.raw.to_string())
}
};
let cat_xmls = "";
let s = format!(
"<entry>
<title>{}</title>
<link href=\"{}\"/>
<id>{}</id>
<updated>{}</updated>\n
{}
<content type=\"html\">&lt;h{}&gt;", // the HTML encoded heading opening needs to be added here!
title_text,
the_link.to_string(), // link
the_link.to_string(), // ID
date,
cat_xmls,
title.level,
);
self.heading_lvl = title.level;
self.in_heading = true;
Ok(write!(w, "{}", s)?)
} else {
Ok(())
}
}
#+end_src
#+begin_src rust :tangle src/export_atom.rs
fn start_keyword<W: Write>(&mut self, mut w: W, kw: &elements::Keyword) -> Result<()> {
// dbg!(kw);
match kw.key.as_ref() {
"FILETAGS" => {
kw.value
.split(":")
.map(String::from)
.filter(|s| !s.is_empty())
.for_each(|s| self.filetags.push(s));
dbg!(&self.filetags);
}
"TITLE" => {
self.feed_title = kw.value.to_string();
write!(w, r#"<title>{}</title>"#, self.feed_title)?;
}
"AUTHOR" => {
let re = regex::Regex::new(r"(?<name>[\w\s\d]+) <(?<email>.*)>").unwrap();
re.captures_iter(&kw.value)
.map(|caps| {
format!(
"<author><name>{}</name><email>{}</email></author>",
&caps["name"], &caps["email"]
)
})
.for_each(|s| {
self.authors.push(s.clone());
write!(w, "{}", s).ok();
});
dbg!(&self.authors);
}
_ => {}
}
Ok(())
}
#+end_src
#+begin_src rust :tangle src/export_atom.rs
fn start_text<W: Write>(&mut self, mut w: W, text: &Cow<str>) -> Result<()> {
if self.in_drawer == true {
let (_, prop_drawer): (_, orgize::elements::PropertiesMap) =
orgize::elements::Drawer::parse_drawer_content(text)
.expect("failed to parse properties drawer");
let properties = prop_drawer.into_hash_map();
let mut cur_id: Option<String> = properties.get("ID").map(|s| s.to_string());
if cur_id.is_none() {
cur_id = properties.get("CUSTOM_ID").map(|s| s.to_string())
}
let the_id = cur_id.expect("Can't publish a page without ID");
self.feed_page_id = the_id.clone();
let page_link = self.inner.rewrite_link_from(&the_id);
write!(
w,
r#"<link href="{}"/>
<id>{}</id>"#,
page_link, page_link
)?;
} else {
self.start_rest(
w,
&Element::Text {
value: text.clone(),
},
)?
}
Ok(())
}
fn start_rest<W: Write>(&mut self, mut w: W, element: &elements::Element) -> Result<()> {
Ok(if self.in_heading == true {
let mut buf = InternalWriter::new();
self.inner.start(&mut buf, element).ok(); // panic if this fails.
let s = buf.to_utf8().unwrap();
write!(w, "{}", s)?
})
}
fn end_rest<W: Write>(&mut self, mut w: W, element: &elements::Element) -> Result<()> {
if self.in_heading == true {
let mut buf = InternalWriter::new();
self.inner.end(&mut buf, element).ok();
let s = buf.to_utf8().unwrap();
write!(w, "{}", s)?
}
Ok(())
}
}
#+end_src
*** Strip Links from Strings
#+begin_src rust :tangle src/export_atom.rs
fn strip_links_from_str(in_str: &str) -> Result<String> {
// title.raw.replace("[", "&#91;").replace("]", "&#92;"),
let re = regex::Regex::new(r"\[(?<wrapped_the_link>\[\])\[(?<text>)\]\]")?;
Ok(re.replace_all(in_str, "$text").to_string())
}
#+end_src
*** Convert my org-style timestamps to RFC-3339 strings
#+begin_src rust :tangle src/export_atom.rs
fn rfcize_datestamp(in_str: String) -> Result<String> {
let re = regex::Regex::new(
r"<?(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}) \w+ (?<hour>\d{2}):(?<minutes>\d{2})>?",
)?;
let date: Option<String> = re
.captures_iter(&in_str)
.map(|caps| {
let year = caps.name("year").unwrap().as_str();
let month = caps.name("month").unwrap().as_str();
let day = caps.name("day").unwrap().as_str();
let hour = caps.name("hour").unwrap().as_str();
let minutes = caps.name("minutes").unwrap().as_str();
let ret: String = format!("{}-{}-{}T{}:{}:00-08:00", year, month, day, hour, minutes);
ret
})
.next();
Ok(date.ok_or(Error::new(std::io::ErrorKind::Other, "invalid date"))?)
}
#+end_src
*** Internal Buffer Writer for Escaping Entities
I implemented a really simple/dumb =Write= interface so will =HtmlEscape= things which are written to it. I should make this take a String under the hood instead of =Vec<bytes>= but meh it's good enough for now.
#+begin_src rust :tangle src/export_atom.rs
struct InternalWriter {
inner: Vec<u8>,
}
impl InternalWriter {
pub fn to_utf8(self) -> Result<String> {
return Ok(String::from_utf8(self.inner)?);
}
pub fn new() -> InternalWriter {
return InternalWriter { inner: Vec::new() };
}
}
impl Write for &mut InternalWriter {
#[inline]
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
let s = String::from_utf8(buf.to_vec()).unwrap();
let esc = format!("{}", s);
self.inner.write(esc.as_bytes())
}
#[inline]
fn write_vectored(&mut self, bufs: &[std::io::IoSlice<'_>]) -> std::io::Result<usize> {
self.inner.write_vectored(bufs)
}
#[inline]
fn flush(&mut self) -> std::io::Result<()> {
self.inner.flush()
}
#[inline]
fn write_all(&mut self, buf: &[u8]) -> std::io::Result<()> {
let s = String::from_utf8(buf.to_vec()).unwrap();
let esc = format!("{}", s);
// dbg!(s.clone());
// dbg!(esc.clone());
self.inner.write_all(esc.as_bytes())
}
#[inline]
fn write_fmt(&mut self, fmt: std::fmt::Arguments<'_>) -> std::io::Result<()> {
let s = match fmt.as_str() {
Some(s) => String::from(s),
_ => fmt.to_string(),
};
let s2 = format!("{}", HtmlEscape(s));
self.inner.write(s2.as_bytes())?;
Ok(())
}
}
#+end_src
** About the First Pass
I hate that code. It was worth a try, but it's not good, it's super janky. I'm going to add sub-heading support and compose the feeds in the Django side. this API is cleaner but with a different separation of concerns[citation needed]. This can be done by just adding an ExportOption and struct state variable tracking whether the parser has reached a heading it should be exporting.
The Exporter design model is fine, the whole thing where you can nest them. but the code, my rust ability, and the structure of the element iterator in the orgize library make it sort of bodgy and difficult to understand or change, even though there is a literate discussion surrounding it. a subheading export API can be unit tested in ways the exporter cannot.
so the second pass:
** Second API
there's a another option, where an API takes a list of headings and feed metadata, and it parses each heading and its subheadings to HTML. *this is an API I already want to provide to document systems*, and should be written. it could take arbitrary document headings provided through the public interface, and construct multi-page feeds.
this requires the ability to export only a given subheading, which I could implement maybe more simply than the mess I wrote in the first pass.
This API could be memoized in the python side with functools.cache so that the headings could be exported.
this would allow me to microblog from my Journal, by allowing feeds to contain headings from arbitrary pages. this is Good. so let's do that.
** Hacky solution
we could just clobber together a version of [[https://github.com/tanrax/RSSingle][RSSingle]]; [[id:personal_software_can_be_shitty][Personal Software Can Be Shitty]].
** Future API
way out there: how do feed readers behave if the "feed" is just the linearized document with updated-at and whatnot applied to it? The feed would send the entire page with each update, but what if each heading could then be processed in to a diff or summary of changes? how could i possibly do that well, anyhow?
* Library definition and exports for the native Python library
:PROPERTIES:
:header-args:rust: :tangle src/lib.rs :mkdirp yes

View File

@ -1,436 +0,0 @@
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:1]]
use anyhow::Result;
use regex;
use std::borrow::Cow;
use std::fs;
use std::io::{Error, Write};
use std::marker::PhantomData;
use orgize::export::{DefaultHtmlHandler, HtmlEscape, HtmlHandler, SyntectHtmlHandler};
use orgize::{elements, Element, Org};
use crate::export_html::ArroyoHtmlHandler;
use crate::export_html::ExportOptions;
// First Pass:1 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:2]]
pub struct ArroyoAtomHandler<E: From<Error>, H: HtmlHandler<E>> {
pub options: ExportOptions,
pub inner: ArroyoHtmlHandler<E, H>,
pub error_type: PhantomData<E>,
// internal parser state
in_heading: bool,
in_drawer: bool,
heading_lvl: usize,
// Document metadata placed in feed
pub filetags: Vec<String>,
pub authors: Vec<String>,
pub feed_title: String,
pub feed_page_id: String,
pub last_date: String,
}
impl<E: From<Error>, H: HtmlHandler<E>> ArroyoAtomHandler<E, H> {
pub fn new(options: ExportOptions, inner: ArroyoHtmlHandler<E, H>) -> Self {
ArroyoAtomHandler {
inner,
options,
..Default::default()
}
}
}
impl<E: From<Error>, H: HtmlHandler<E>> Default for ArroyoAtomHandler<E, H> {
fn default() -> Self {
ArroyoAtomHandler {
inner: ArroyoHtmlHandler::default(),
error_type: PhantomData,
options: ExportOptions::default(),
in_heading: false,
in_drawer: false,
heading_lvl: 0,
last_date: "".into(),
feed_title: "".into(),
feed_page_id: "".into(),
filetags: vec![],
authors: vec![],
}
}
}
// First Pass:2 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:3]]
pub fn atomize_file(path: String, options: ExportOptions) -> Result<String> {
let syntect_handler = SyntectHtmlHandler::new(DefaultHtmlHandler);
let html_handler = ArroyoHtmlHandler::new(options.clone(), syntect_handler);
let mut handler = ArroyoAtomHandler::new(options.clone(), html_handler);
let org = String::from_utf8(fs::read(path.clone())?).unwrap();
let org_tree = &Org::parse_custom(
&org,
&orgize::ParseConfig {
// Need to pull these from environment or options...
todo_keywords: (
vec![
"NEXT".to_string(),
"INPROGRESS".to_string(),
"WAITING".to_string(),
],
vec!["DONE".to_string(), "CANCELLED".to_string()],
),
..Default::default()
},
);
let mut vec = vec![];
org_tree.write_html_custom(&mut vec, &mut handler)?;
Ok(String::from_utf8(vec)?)
}
// First Pass:3 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:4]]
impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoAtomHandler<E, H> {
fn start<W: Write>(&mut self, mut w: W, element: &Element) -> Result<(), E> {
(match element {
// First Pass:4 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:5]]
Element::Document { .. } => self.start_document(w, element),
// First Pass:5 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:6]]
Element::Keyword(kw) => self.start_keyword(w, kw),
// First Pass:6 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:7]]
Element::Title(title) => self.start_title(w, title),
// First Pass:7 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:8]]
Element::Drawer(drawer) => {
self.in_drawer = drawer.name == "PROPERTIES" && self.feed_page_id.eq("");
self.start_rest(w, element)
}
Element::Text { value } => self.start_text(w, value),
// First Pass:8 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:9]]
_t => self.start_rest(w, element),
})
.unwrap(); // if we can't parse something, just fucken panic.
// First Pass:9 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:10]]
Ok(())
}
fn end<W: Write>(&mut self, mut w: W, element: &Element) -> Result<(), E> {
(match element {
// First Pass:10 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:11]]
// Element::Title(_title) => {}
Element::Document { .. } => self.end_document(w, element),
// First Pass:11 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:12]]
Element::Drawer(drawer) => {
self.in_drawer = false;
self.end_rest(w, element)
}
// First Pass:12 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:13]]
_ => self.end_rest(w, element),
})
.ok();
Ok(())
}
}
// First Pass:13 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:14]]
impl<E: From<Error>, H: HtmlHandler<E>> ArroyoAtomHandler<E, H> {
// First Pass:14 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:15]]
fn start_document<W: Write>(&mut self, mut w: W, _document: &elements::Element) -> Result<()> {
Ok(write!(
w,
"<?xml version=\"1.0\" encoding=\"utf-8\"?>
<feed xmlns=\"http://www.w3.org/2005/Atom\">"
)?)
}
fn end_document<W: Write>(&mut self, mut w: W, _document: &elements::Element) -> Result<()> {
// the last heading/entry is still "open", close it.
write!(
w,
r#" </content>
</entry>
<updated>{}</updated>
</feed>"#,
self.last_date,
)?;
Ok(())
}
// First Pass:15 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:16]]
fn start_title<W: Write>(&mut self, mut w: W, title: &elements::Title) -> Result<()> {
let ignore_tags = vec![
String::from("noexport"),
String::from("NOEXPORT"),
String::from("ignore"),
];
let export_tags = title
.tags
.clone()
.into_iter()
.map(String::from)
.find(|v| ignore_tags.contains(v));
let props = title.properties.clone().into_hash_map();
let id = props
.get("ID")
.map(|id| id.clone().into())
.unwrap_or("".to_string());
let pubdate = props
.get("PUBDATE")
.map(|pubdate| pubdate.clone().into())
.unwrap_or("".to_string());
let the_link = self.inner.rewrite_link_from(&id);
if id == "" || pubdate == "" || export_tags.is_some() && self.in_heading {
self.in_heading = false;
Ok(())
} else if id != "" && pubdate != "" && export_tags.is_none() {
// close previous heading; note that self.heading_lvl defaults to 0
if title.level <= self.heading_lvl {
write!(w, "</content>\n")?;
write!(w, "</entry>\n")?;
}
let date = match rfcize_datestamp(pubdate.clone()) {
Ok(date) => date,
Err(_) => {
dbg!(format!("bad date {}", pubdate.clone()));
HtmlEscape(pubdate.clone()).to_string()
}
};
if self.last_date < date {
self.last_date = date.clone();
}
let title_text = match strip_links_from_str(&title.raw.clone()) {
Ok(text) => HtmlEscape(text),
Err(the_err) => {
dbg!(format!("bad title {} {}", title.raw.clone(), the_err));
HtmlEscape(title.raw.to_string())
}
};
let cat_xmls = "";
let s = format!(
"<entry>
<title>{}</title>
<link href=\"{}\"/>
<id>{}</id>
<updated>{}</updated>\n
{}
<content type=\"html\">&lt;h{}&gt;", // the HTML encoded heading opening needs to be added here!
title_text,
the_link.to_string(), // link
the_link.to_string(), // ID
date,
cat_xmls,
title.level,
);
self.heading_lvl = title.level;
self.in_heading = true;
Ok(write!(w, "{}", s)?)
} else {
Ok(())
}
}
// First Pass:16 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:17]]
fn start_keyword<W: Write>(&mut self, mut w: W, kw: &elements::Keyword) -> Result<()> {
// dbg!(kw);
match kw.key.as_ref() {
"FILETAGS" => {
kw.value
.split(":")
.map(String::from)
.filter(|s| !s.is_empty())
.for_each(|s| self.filetags.push(s));
dbg!(&self.filetags);
}
"TITLE" => {
self.feed_title = kw.value.to_string();
write!(w, r#"<title>{}</title>"#, self.feed_title)?;
}
"AUTHOR" => {
let re = regex::Regex::new(r"(?<name>[\w\s\d]+) <(?<email>.*)>").unwrap();
re.captures_iter(&kw.value)
.map(|caps| {
format!(
"<author><name>{}</name><email>{}</email></author>",
&caps["name"], &caps["email"]
)
})
.for_each(|s| {
self.authors.push(s.clone());
write!(w, "{}", s).ok();
});
dbg!(&self.authors);
}
_ => {}
}
Ok(())
}
// First Pass:17 ends here
// [[file:../arroyo-native-parser.org::*First Pass][First Pass:18]]
fn start_text<W: Write>(&mut self, mut w: W, text: &Cow<str>) -> Result<()> {
if self.in_drawer == true {
let (_, prop_drawer): (_, orgize::elements::PropertiesMap) =
orgize::elements::Drawer::parse_drawer_content(text)
.expect("failed to parse properties drawer");
let properties = prop_drawer.into_hash_map();
let mut cur_id: Option<String> = properties.get("ID").map(|s| s.to_string());
if cur_id.is_none() {
cur_id = properties.get("CUSTOM_ID").map(|s| s.to_string())
}
let the_id = cur_id.expect("Can't publish a page without ID");
self.feed_page_id = the_id.clone();
let page_link = self.inner.rewrite_link_from(&the_id);
write!(
w,
r#"<link href="{}"/>
<id>{}</id>"#,
page_link, page_link
)?;
} else {
self.start_rest(
w,
&Element::Text {
value: text.clone(),
},
)?
}
Ok(())
}
fn start_rest<W: Write>(&mut self, mut w: W, element: &elements::Element) -> Result<()> {
Ok(if self.in_heading == true {
let mut buf = InternalWriter::new();
self.inner.start(&mut buf, element).ok(); // panic if this fails.
let s = buf.to_utf8().unwrap();
write!(w, "{}", s)?
})
}
fn end_rest<W: Write>(&mut self, mut w: W, element: &elements::Element) -> Result<()> {
if self.in_heading == true {
let mut buf = InternalWriter::new();
self.inner.end(&mut buf, element).ok();
let s = buf.to_utf8().unwrap();
write!(w, "{}", s)?
}
Ok(())
}
}
// First Pass:18 ends here
// [[file:../arroyo-native-parser.org::*Strip Links from Strings][Strip Links from Strings:1]]
fn strip_links_from_str(in_str: &str) -> Result<String> {
// title.raw.replace("[", "&#91;").replace("]", "&#92;"),
let re = regex::Regex::new(r"\[(?<wrapped_the_link>\[\])\[(?<text>)\]\]")?;
Ok(re.replace_all(in_str, "$text").to_string())
}
// Strip Links from Strings:1 ends here
// [[file:../arroyo-native-parser.org::*Convert my org-style timestamps to RFC-3339 strings][Convert my org-style timestamps to RFC-3339 strings:1]]
fn rfcize_datestamp(in_str: String) -> Result<String> {
let re = regex::Regex::new(
r"<?(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}) \w+ (?<hour>\d{2}):(?<minutes>\d{2})>?",
)?;
let date: Option<String> = re
.captures_iter(&in_str)
.map(|caps| {
let year = caps.name("year").unwrap().as_str();
let month = caps.name("month").unwrap().as_str();
let day = caps.name("day").unwrap().as_str();
let hour = caps.name("hour").unwrap().as_str();
let minutes = caps.name("minutes").unwrap().as_str();
let ret: String = format!("{}-{}-{}T{}:{}:00-08:00", year, month, day, hour, minutes);
ret
})
.next();
Ok(date.ok_or(Error::new(std::io::ErrorKind::Other, "invalid date"))?)
}
// Convert my org-style timestamps to RFC-3339 strings:1 ends here
// [[file:../arroyo-native-parser.org::*Internal Buffer Writer for Escaping Entities][Internal Buffer Writer for Escaping Entities:1]]
struct InternalWriter {
inner: Vec<u8>,
}
impl InternalWriter {
pub fn to_utf8(self) -> Result<String> {
return Ok(String::from_utf8(self.inner)?);
}
pub fn new() -> InternalWriter {
return InternalWriter { inner: Vec::new() };
}
}
impl Write for &mut InternalWriter {
#[inline]
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
let s = String::from_utf8(buf.to_vec()).unwrap();
let esc = format!("{}", s);
self.inner.write(esc.as_bytes())
}
#[inline]
fn write_vectored(&mut self, bufs: &[std::io::IoSlice<'_>]) -> std::io::Result<usize> {
self.inner.write_vectored(bufs)
}
#[inline]
fn flush(&mut self) -> std::io::Result<()> {
self.inner.flush()
}
#[inline]
fn write_all(&mut self, buf: &[u8]) -> std::io::Result<()> {
let s = String::from_utf8(buf.to_vec()).unwrap();
let esc = format!("{}", s);
// dbg!(s.clone());
// dbg!(esc.clone());
self.inner.write_all(esc.as_bytes())
}
#[inline]
fn write_fmt(&mut self, fmt: std::fmt::Arguments<'_>) -> std::io::Result<()> {
let s = match fmt.as_str() {
Some(s) => String::from(s),
_ => fmt.to_string(),
};
let s2 = format!("{}", HtmlEscape(s));
self.inner.write(s2.as_bytes())?;
Ok(())
}
}
// Internal Buffer Writer for Escaping Entities:1 ends here

View File

@ -147,8 +147,6 @@ impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E,
}
}
let breadcrumb_set =
HashSet::from_iter(self.heading_breadcrumbs.clone().into_iter());
let has_ignore_tag = title
.tags
.clone()
@ -156,13 +154,16 @@ impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E,
.map(String::from)
.find(|v| self.options.ignore_tags.contains(v))
.is_some();
self.in_public_heading = !has_ignore_tag || self.ignore_below_level >= title.level;
// if it's a new ignore tag or a subheading of one we already parsed
self.in_public_heading = !has_ignore_tag && title.level <= self.ignore_below_level;
self.in_public_heading = self.in_public_heading
&& (self.options.limit_headings.is_empty()
|| self
.options
.limit_headings
.intersection(&breadcrumb_set)
.intersection(&HashSet::from_iter(
self.heading_breadcrumbs.clone().into_iter(),
))
.count()
!= 0);
@ -170,17 +171,30 @@ impl<E: From<Error>, H: HtmlHandler<E>> HtmlHandler<E> for ArroyoHtmlHandler<E,
self.ignore_below_level = title.level;
}
let mut parts = Vec::new();
parts.push(format!(
"<h{}>",
if title.level <= 6 { title.level } else { 6 }
));
let props = title.properties.clone().into_hash_map();
let id = props.get("ID");
match id {
Some(id) => parts.push(format!("<a class=\"id-anchor\" id=\"{}\"></a>", id)),
None => {}
}
match &title.keyword {
Some(kw) => parts.push(format!("<span class=\"task task-{}\"></span>", kw)),
None => {}
}
(title.tags)
.iter()
.for_each(|tag| parts.push(format!("<span class=\"tag tag-{}\"></span>", tag)));
if self.in_public_heading {
let props = title.properties.clone().into_hash_map();
let id = props.get("ID");
match id {
Some(id) => {
write!(w, "<h{}><a class=\"id-anchor\" id=\"{}\"></a>", if title.level <= 6 { title.level } else { 6 }, id.clone())?
},
None => {
write!(w, "<h{}>", if title.level <= 6 { title.level } else { 6 })?
}
}
write!(w, "{}", parts.join(""))?
}
}
// The Custom HTML Exporter Extensions:2 ends here