blog/posts/build_a_blog.md

Title:  Build-a-blog
Date:   2024-06-17T14:46:36-04:00
---
I want to share my thought process for how to go about building something from scratch in a stream-of-conciousness/live coding sort of format.

So for the first stupid post on this stupid blog, I've set a goal to take 1 afternoon + caffeine + some DIY spirit → _something_ resembling a static site/blog generator to build this website.

There will be nothing ground breaking here - in fact this software will not be good.

But if you decide to continue reading, I hope by the end of this post you might be inspired to re-invent your own wheel for the fun of it. Or don't and just use [Hugo][hugo].

Lets see how hard this will be.

Here are the requirements for this blog:

* Generate each individual post written in markdown -> html
* Generate an index with recent list of posts.
    * Will need some metadata (ex. date) in each post to do this.
* Generate RSS

That boils down to:

1. Read some files
2. Parse markdown, maybe parse a header with some key/values.
3. Template strings

So there is 1 "exotic" feature in parsing/rendering Markdown as HTML that will need some thought.

The rest is just file and string manipulation.

Lets get it on.

## Picking the tool for the job

Most scripting languages would be fine tools for this task. But how to handle Markdown?

I've had [Crystal][1] in the back of my mind for this task. It is a nice general purpose language that included Markdown in the stdlib! But unfortunately Markdown was removed in [0.31.0][2]. Other than that, I'm not sure any other languages include a well rounded Markdown implementation out of the box.

I'll likely end up building the site in docker with an alpine image down the road, so just a quick search in alpines repos to see what could be useful:

```shell
❯ docker run --rm -it alpine
/ # apk update
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/main]
v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/community]
OK: 20079 distinct packages available
/ # apk search markdown
discount-2.2.7c-r1
discount-dev-2.2.7c-r1
discount-libs-2.2.7c-r1
kdepim-addons-23.04.3-r0
markdown-1.0.1-r3
markdown-doc-1.0.1-r3
py3-docstring-to-markdown-0.12-r1
py3-docstring-to-markdown-pyc-0.12-r1
py3-html2markdown-0.1.7-r3
py3-html2markdown-pyc-0.1.7-r3
py3-markdown-3.4.3-r1
py3-markdown-it-py-2.2.0-r1
py3-markdown-it-py-pyc-2.2.0-r1
py3-markdown-pyc-3.4.3-r1
```

[`py3-markdown` in alpine][3] is the popular [`python-markdown`][4]. It's mature and available as a package in my [home distro][5].

Incredible.

## Let's build

First, lets read 1 post file and render some html.

I'll store posts in `posts/` like `posts/build_a_blog.md`.

And store the HTML output in the same directory: `posts/build_a_blog.html`.

```python
import re
import logging

import markdown
destpath_re = re.compile(r'\.md$')

logging.basicConfig(encoding='utf-8', level=logging.INFO)

def render_post(fpath):
	destpath = destpath_re.sub('.html', fpath)
	logging.info("opening %s for parsing, dest %s", fpath, destpath)
	# from: https://python-markdown.github.io/reference/
	with open(fpath, "r", encoding="utf-8") as input_file:
		logging.info("reading %s", fpath)
		text = input_file.read()

	logging.info("parsing %s", fpath)
	out = markdown.markdown(text)

	with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
		logging.info("writing to %s", destpath)
		output_file.write(out)

if __name__ == '__main__':
	render_post('posts/build_a_blog.md')
```

And run it:

```shell
❯ python3 ./main.py
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
INFO:root:reading posts/build_a_blog.md
INFO:root:parsing posts/build_a_blog.md
INFO:root:writing to posts/build_a_blog.html

❯ head posts/build_a_blog.html
<h1>Build-a-blog</h1>
<p>I want to share my thought process for how one would go about building a static blog generator from scratch.</p>
<ul>
<li>Generate an index with recent list of posts.</li>
<li>Generate each individual post written in markdown -&gt; html<ul>
<li>Support some metadata in each post</li>
<li>A post title should have a slug</li>
</ul>
</li>
<li>Generate RSS</li>
```

Looking pretty good.

Now lets do this for all `.md` files in `posts/`

```python
import glob
...

def render_posts():
	files = glob.glob('posts/*.md')
	logging.info('found post files %s', files)
	for fname in files:
		render_post(fname)

if __name__ == '__main__':
	render_posts()
```

And add another simple test post

```shell
❯ echo '# A new post' > ./posts/a_new_post.md
❯ python3 ./main.py
INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
INFO:root:reading posts/a_new_post.md
INFO:root:parsing posts/a_new_post.md
INFO:root:writing to posts/a_new_post.html
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
INFO:root:reading posts/build_a_blog.md
INFO:root:parsing posts/build_a_blog.md
INFO:root:writing to posts/build_a_blog.html
❯ head ./posts/a_new_post.html
<h1>A new post</h1>
```

Now on to listing and templating the posts.

## Post ordering and templating

`python-markdown` supports metadata embedded in posts: <https://python-markdown.github.io/extensions/meta_data/>

I thought I'd need to build something here, but turns out it's exactly what I need to assign a few extra attributes to a post.

I'll adjust our "spec" for posts such that each post must include the following metadata at the top of the file:

```txt
Title:  Build-a-blog
Date:   2024-06-17T14:46:36-04:00
---
```

And I'd like to insert the `Title` automatically as a `<h1>` tag in each post so I don't have to write it again in the markdown.

So first, lets test the metadata and adjust the test blog post.

```shell
❯ head -n4 ./posts/build_a_blog.md
Title:  Build-a-blog
Date:   2024-06-17T14:46:36-04:00
---
```

And pop open a python repl to see how this works.

```python
>>> md = markdown.Markdown(extensions = ['meta']); f = open('posts/build_a_blog.md', 'r'); txt = f.read(); out = md.convert(txt); md.Meta
{'title': ['Build-a-blog'], 'date': ['2024-06-17T14:46:36-04:00']}
```

Looks pretty nice!

So first I will adjust the rendering function to prepend a line:

```markdown
# {title}
```

This has to be done after the full document renders because the `meta` `python-markdown` extension extracts metadata and converts to html in one call.


```python
def render_post(fpath):
    ...

	md = markdown.Markdown(extensions = ['meta'])

	logging.info("parsing %s", fpath)
	out = md.convert(text)

	title = md.Meta.get('title')[0]
	date = md.Meta.get('date')[0]

	out = markdown.markdown('# ' + title) + out
```

Finally, lets return a structure that will make other parts of the program aware of the filename that was rendered and the metadata (title, date)


```python
def render_post(fpath):
    ...
    out = markdown.markdown('# ' + title) + out

	with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
		logging.info("writing to %s", destpath)
		output_file.write(out)

	return {
		'title': title,
		'date': date,
		'fpath': fpath,
		'destpath': destpath,
	}
```

Now I have what I need to generate the index page.

### Index templating

Lets start by defining what our index template file will be.

I'll choose `index.html.tmpl` which will write to `index.html` when rendered.

So lets make a function that will take a list of our post structure above and render it in a `<ul>`.

```
from string import Template
...
def posts_list_html(posts):
	post_tpl = """<li>
		<a href="{href}">{title}</a>
		<time datetime="{date}">{disp_date}</time>
	</li>"""
	out = '<ul class="blog-posts-list">'
	for post in posts:
		disp_date = datetime.datetime.fromisoformat(post.get('date')).strftime('%Y-%m-%d')
		out += post_tpl.format(href=post.get('destpath'),
						 title=post.get('title'),
						 date=post.get('date'),
						 disp_date=disp_date)
	return out + '</ul>'

def render_index(posts):
    fname = 'index.html.tmpl'
    outname = 'index.html'

    with open(fname, 'r', encoding='utf-8') as inf:
        tmpl = Template(inf.read())

    posts_html = posts_html(posts)

    html = tmpl.substitute(posts=posts_html)

    with open(outname, 'w', encoding='utf-8') as outf:
        outf.write(html)
```

Make sure that `index.html.tmpl` contains a template variable for `${posts}`

```shell
❯ grep -C2 '\${posts}' ./index.html.tmpl
  <div class="col-md-8 col-sm-12">
    <p>Welcome. Something will go here eventually.</p>
    ${posts}
  </div>
  <div class="col-md-4 col-sm-12">
```

And now need to connect `render_posts()` which returns each post that was processed to `render_index()`

```python
def render_posts():
	files = glob.glob('posts/*.md')
	logging.info('found post files %s', files)
	posts = []
	for fname in files:
		p = render_post(fname)
		posts.append(p)
		logging.info('rendered post: %s', p)

	return posts

if __name__ == '__main__':
	posts = render_posts()
	logging.info('rendered posts: %s', posts)
	render_index(posts)
```

And lets run it!

```shell
❯ python3 ./main.py
INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
INFO:root:reading posts/a_new_post.md
INFO:root:parsing posts/a_new_post.md
INFO:root:writing to posts/a_new_post.html
INFO:root:rendered post: {'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
INFO:root:reading posts/build_a_blog.md
INFO:root:parsing posts/build_a_blog.md
INFO:root:writing to posts/build_a_blog.html
INFO:root:rendered post: {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}
INFO:root:rendered posts: [{'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}, {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}]
```

And check how the output looks:
```shell
❯ grep -C4 'blog-posts-list' ./index.html
  </nav>
  <section class="container">
    <div class="row">
      <div class="col-md-8 col-sm-12">
        <ul class="blog-posts-list"><li>
		<a href="posts/a_new_post.html">A new post</a>
		<time datetime="2024-06-17T19:48:17-04:00">2024-06-17</time>
	</li><li>
		<a href="posts/build_a_blog.html">Build-a-blog</a>
```

Not bad!

### Post templating

I think I want my blog to just maintain the overall layout from the index page and just render the post body where the main post list is.

So lets make that template rendering a bit more general.

I'll redefine the content area template variable to replace as `${content}` too.

```python
def render_template(tpl_fname, out_fname, content_html):
    with open(tpl_fname, 'r', encoding='utf-8') as inf:
        tmpl = Template(inf.read())

    html = tmpl.substitute(content=content_html)

    with open(out_fname, 'w', encoding='utf-8') as outf:
        outf.write(html)

def render_index(posts):
	content_html = posts_list_html(posts)
	render_template('index.html.tmpl', 'index.html', content_html)
        outf.write(out)
```

And now adjust where posts are written out.

```python
def render_post(fpath):
    ...
	out = markdown.markdown('# ' + title) + out
    logging.info("writing to %s", destpath)
	render_template('index.html.tmpl', destpath, out)
```

After running you should see the each `post/*.html` file where each post file uses the full index template and includes each generated post HTML.

### Post sorting

With everything wired up now just need to sort the posts lists by the date metadata.

Lets do a bit of python repl sort testing because I never remember `datetime` usage.

Lets generate a few nicely formatted ISO date strings for testing.

```shell
❯ date -d'2023-01-01' -Is
2023-01-01T00:00:00-05:00
❯ date -Is
2024-06-17T16:30:35-04:00
```

And make a test array

```python
>>> posts = [{'date': '2023-01-01T00:00:00-05:00'}, {'date': '2024-06-17T16:30:35-04:00'}]
```

With our current script, the older post would be listed first. So lets try a sort.

```
# Double checking datetime parsing
>>> import datetime
>>> newer = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))
>>> older = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))

# Checking python sorting methods work as expected
>>> newer.__gt__(older)
True
>>> newer.__lt__(older)
False
>>> older.__gt__(newer)
False
>>> older.__lt__(newer)
True

# Doing the sort
>>> sorted(posts, key=lambda x: datetime.datetime.fromisoformat(x['date']), reverse=True)
[{'date': '2024-06-17T16:30:35-04:00'}, {'date': '2023-01-01T00:00:00-05:00'}]
```

Now lets apply this to our posts.

```python
if __name__ == '__main__':
	posts = render_posts()
	logging.info('rendered posts: %s', posts)
	sorted_posts = sorted(posts,
					   key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
	render_index(sorted_posts)
```

### `<title />` Templating

The last bit of templating is to make each post `<title>` different.

I'll try something like `<title>cfebs.com - ${title}</title>`

So `index.html.tmpl`

```html
<title>cfebs.com${more_title}</title>
```

And where using the title template `more_title` will default to empty string.

```python
def render_index(posts):
	content_html = posts_list_html(posts)
	render_template('index.html.tmpl', 'index.html', {'content': content_html, 'more_title': ''})
```

But for a post:

```python
def render_post(fpath):
    ...
	title = md.Meta.get('title')[0]
	date = md.Meta.get('date')[0]

	out = markdown.markdown('# ' + title) + out

	logging.info("writing to %s", destpath)
	render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
```

This is now a functioning blog generator with templating!


## RSS

This should be pretty easy as RSS is just reformatting our blog index list into different XML.

The `render_template` function will be useful here with a few more tweaks. So I'll make another template file (based off a reference <https://drewdevault.com/blog/index.xml>)

```shell
# Grab the reference
❯ curl -sL 'https://drewdevault.com/blog/index.xml' > index.xml.example

# After a bit of editing
❯ cat ./index.xml.tmpl
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>${site_title}</title>
		<link>${site_link}</link>
		<description>${description}</description>
		<language>en</language>
		<lastBuildDate>${last_build_date}</lastBuildDate>
		<atom:link href="${self_full_link}" rel="self" type="application/rss+xml" />
		${items}
	</channel>
</rss>
```

`render_template` now gets even more generic and passes a `dict` to `Template.substitute()`

```python
def render_template(tpl_fname, out_fname, subs):
    with open(tpl_fname, 'r', encoding='utf-8') as inf:
        tmpl = Template(inf.read())

    out = tmpl.substitute(subs)

    with open(out_fname, 'w', encoding='utf-8') as outf:
        outf.write(out)
```

And make sure to adjust any usages of `render_template` that exist.

```python
def render_index(posts):
	content_html = posts_list_html(posts)
	render_template('index.html.tmpl', 'index.html', {'content': content_html})

def render_post(fname):
    ...
	render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
```

And now can hack away at RSS generation:

```
def render_rss_index(posts):
    subs = {
        'site_title': 'cfebs.com',
        'site_link': 'https://cfebs.com',
        'self_full_link': 'https://cfebs.com/index.xml',
        'description': 'Recent content from cfebs.com',
        'last_build_date': 'TODO',
        'items': 'TODO',
    }
    render_template('index.xml.tmpl', 'index.xml', subs)
```

After this initial test and a `python3 ./main.py` run, should see xml filled out.

```
❯ cat ./index.xml
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>cfebs.com</title>
		<link>https://cfebs.com</link>
		<description>Recent content from cfebs.com</description>
		<language>en</language>
		<lastBuildDate>TODO</lastBuildDate>
		<atom:link href="https://cfebs.com/index.xml" rel="self" type="application/rss+xml" />
		TODO
	</channel>
</rss>
```

Now lets finish up by generating each item entry and collecting them to be replaced in the template.

And this adds another `md.convert()` call, so why not add a util to reuse a single Markdown instance.

```python
# add a module var and helper method to reuse Markdown instance
md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
def convert(text):
    md.reset()
    return md.convert()

def render_post(fpath):
    ...
    out = convert(text)

	title = md.Meta.get('title')[0]
	date = md.Meta.get('date')[0]

	out = convert('# ' + title) + out
    ...

def rss_post_xml(post):
	tpl = """
	<item>
		<title>{title}</title>
		<link>{link}</link>
		<pubDate>{pubdate}</pubDate>
		<guid>{link}</guid>
		<description>{description}</description>
	</item>
	"""

	with open(post['fpath'], 'r') as inf:
		text = inf.read()

	converted = convert(text)

	link = "https://cfebs.com/" + post['destpath']
    pubdate = email.utils.format_datetime(datetime.datetime.fromisoformat(post['date']))
	subs = dict(title=post['title'], link=link,
			pubdate=pubdate,
			description=converted)

	for k,v in subs.items():
		subs[k] = html.escape(v)

	return tpl.format(**subs)

def render_rss_index(posts):
	items = ''
	for post in posts[:5]:
		items += rss_post_xml(post)

	subs = {
		'site_title': 'cfebs.com',
		'site_link': 'https://cfebs.com',
		'self_full_link': 'https://cfebs.com/index.xml',
		'description': 'Recent content from cfebs.com',
		'last_build_date': email.utils.format_datetime(datetime.datetime.now()),
	}
	for k,v in subs.items():
		subs[k] = html.escape(v)

	subs['items'] = items
	render_template('index.xml.tmpl', 'index.xml', subs)
```

* Need to use `html.escape` anywhere there could be HTML tags in output.
* `posts[:5]` should always take the most recent 5 posts to add to the RSS feed.

## Wrapping up

Reached the end of the afternoon, so this is where I'll leave it.

It's not great software.

* No tests, no docs
* No validation of input or function arguments
* Hard coding values like the domain
* Using adhoc dicts for generic structures
* Relies on system python version and packages.
* Does not offer anything a tool like [hugo][hugo] does not already offer.
* Probably slow in extreme cases.

But, it's ~150 lines of python with 1 external dependency.

If python or `python-markdown` drastically changes, it'll probably take <10 minutes to debug.

And - it was fun to write and write about.

View the complete source for generating this blog:

* [main.py](https://gitlab.com/cfebs/cfebs-blog/-/blob/main/main.py)
* [index.html.tmpl](https://gitlab.com/cfebs/cfebs-blog/-/blob/main/index.html.tmpl)
* [index.xml.tmpl](https://gitlab.com/cfebs/cfebs-blog/-/blob/main/index.xml.tmpl)

Or the full repo tree: <https://gitlab.com/cfebs/cfebs-blog/>

## EDIT

Few additional things that will be added will go here.

### 2024-06-19, adding draft state

It might be nice to work on a rough draft, generate it for previewing, track it in git, but skip including it in the posts index.

So I'll add a piece of post metadata called `Draft` and use `filter()` before the posts are sorted or applied to the `index.html` or RSS `index.xml`.

This is the result.

```diff
diff --git a/main.py b/main.py
index bfd9382..52ce57b 100644
--- a/main.py
+++ b/main.py
@@ -31,6 +31,9 @@ def render_post(fpath):

 	title = md.Meta.get('title')[0]
 	date = md.Meta.get('date')[0]
+	draft = False
+	if md.Meta.get('draft'):
+		draft = True

 	out = convert('# ' + title) + out

@@ -42,6 +45,7 @@ def render_post(fpath):
 		'date': date,
 		'fpath': fpath,
 		'destpath': destpath,
+		'draft': draft,
 	}

 def render_posts():
@@ -134,6 +138,7 @@ def render_rss_index(posts):
 def main():
 	posts = render_posts()
 	logging.info('rendered posts: %s', posts)
+	posts = filter(lambda p: not p['draft'], posts)
 	sorted_posts = sorted(posts,
 					   key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
 	render_index(sorted_posts)
```

And testing it:
```shell
❯ cat ./posts/test_thing.md
Title: Test thing
Date: 2024-06-19T13:38:34-04:00
Draft: 1

❯ python main.py
...
```

html should get generated, but not in the index or xml
```shell
❯ grep 'Test thing' ./posts/test_thing.html
  <title>cfebs.com - Test thing</title>
        <h1 id="test-thing"><a class="toclink" href="#test-thing">Test thing</a></h1>

❯ grep 'Test thing' ./index.html ./index.xml | wc -l
0
```

### 2024-06-19, is it slow?

Quick benchmark script `bench.sh`

```bash
#!/usr/bin/env bash

amt=$1
if [[ -z "$amt" ]]; then
	echo "ERROR: pass number of test posts for bench" 1>&2
	exit 1
fi

echo "INFO: removing old __bench files" 1>&2
rm -f ./posts/*__bench*
for i in $(seq 1 "$amt"); do
	cp ./posts/build_a_blog.md ./posts/build_a_blog_${i}__bench.md
done

echo "INFO: number of *.md files $(find ./posts/ -iname '*.md' | wc -l)" 1>&2
echo "INFO: number of *.html files $(find ./posts/ -iname '*.html' | wc -l)" 1>&2
echo "INFO: running" 1>&2
time -p python main.py 2>/dev/null
rc=$?
if [[ "$rc" != "0" ]]; then
	echo "ERROR: program exited with $rc" 1>&2
	exit 1
fi
echo "INFO: number of *.html files $(find ./posts/ -iname '*.html' | wc -l)" 1>&2
echo "INFO: cleanup __bench files" 1>&2
rm -f ./posts/*__bench*
```

```shell
# Run on a 16 core AMD Ryzen 7 7840U
❯ ./bench.sh 100
INFO: removing old __bench files
INFO: number of *.md files 102
INFO: number of *.html files 2
INFO: running
real 0.94
user 0.92
sys 0.02
INFO: number of *.html files 102
INFO: cleanup __bench files

❯ ./bench.sh 1000
INFO: removing old __bench files
INFO: number of *.md files 1002
INFO: number of *.html files 2
INFO: running
real 8.45
user 8.31
sys 0.12
INFO: number of *.html files 1002
INFO: cleanup __bench files
```

So approx 0.8s per 100 posts which starts to get a bit painful in the thousands.

Will be a fun future idea to try to solve.

### 2024-06-19, gotta go fast?

The critical part of the program that gets slower with more files is when each file is rendered to markdown.

I'm by no means a python concurrency expert, but after a quick search `multiprocessing.Pool` looks like a really quick win here.

Luckily `render_posts()` is already in a great format for using `Pool.map`

* 1 array of input file names
* Call `render_post` with 1 file name as an argument
* Result is collected in a list.

So here is the diff to make that happen:

```diff
diff --git a/main.py b/main.py
index 52ce57b..2ea80cb 100644
--- a/main.py
+++ b/main.py
@@ -1,9 +1,11 @@
+import os
 import re
 import glob
 import html
 import email
 import logging
 import datetime
+from multiprocessing import Pool
 from string import Template

 import markdown
@@ -12,11 +14,12 @@ from markdown.extensions.toc import TocExtension
 destpath_re = re.compile(r'\.md$')
 logging.basicConfig(encoding='utf-8', level=logging.INFO)

-md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
+cpu_count = os.cpu_count()

 def convert(text):
-	md.reset()
-	return md.convert(text)
+	md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
+	res = md.convert(text)
+	return res, md.Meta

 def render_post(fpath):
 	destpath = destpath_re.sub('.html', fpath)
@@ -27,15 +30,16 @@ def render_post(fpath):
 		text = input_file.read()

 	logging.info("parsing %s", fpath)
-	out = convert(text)
+	out, meta = convert(text)

-	title = md.Meta.get('title')[0]
-	date = md.Meta.get('date')[0]
+	title = meta.get('title')[0]
+	date = meta.get('date')[0]
 	draft = False
-	if md.Meta.get('draft'):
+	if meta.get('draft'):
 		draft = True

-	out = convert('# ' + title) + out
+	title_out, _ = convert('# ' + title)
+	out = title_out + out

 	logging.info("writing to %s", destpath)
 	render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
@@ -52,11 +56,11 @@ def render_posts():
 	files = glob.glob('posts/*.md')
 	logging.info('found post files %s', files)
 	posts = []
-	for fname in files:
-		p = render_post(fname)
-		posts.append(p)
-		logging.info('rendered post: %s', p)
+	logging.info('starting render posts with cpu_count: %d', cpu_count)
+	with Pool(processes=cpu_count) as pool:
+		posts = pool.map(render_post, files)

+	logging.info("render_posts result: %s", posts)
 	return posts

 def posts_list_html(posts):
@@ -102,7 +106,7 @@ def rss_post_xml(post):
 		text = inf.read()


-	converted = convert(text)
+	converted, _ = convert(text)

 	pubdate = email.utils.format_datetime(datetime.datetime.fromisoformat(post['date']))
 	subs = {
```

`convert()` now creates a `Markdown` instance on each call and returns the HTML and meta. This protects against multiple processes trying to use the single module level `md` instance.

See <https://python-markdown.github.io/reference/#Markdown> for notes on `Markdown.reset()` usage and thread safety.

And re-run the benchmarks:

```shell
# Run on a 16 core AMD Ryzen 7 7840U
❯ ./bench.sh 100
INFO: removing old __bench files
INFO: number of *.md files 102
INFO: number of *.html files 2
INFO: running
real 0.31
user 2.44
sys 0.28
INFO: number of *.html files 102
INFO: cleanup __bench files

❯ ./bench.sh 1000
INFO: removing old __bench files
INFO: number of *.md files 1002
INFO: number of *.html files 2
INFO: running
real 1.34
user 18.09
sys 0.47
INFO: number of *.html files 1002
INFO: cleanup __bench files
```

Did I accidentally duplicate output during [one of the refactors of this multithreading change][duped]? Yup!

But now down to ~1.5s for 1000 posts 🎉

[1]: https://crystal-lang.org/
[2]: https://github.com/crystal-lang/crystal/releases/tag/0.31.0
[3]: https://pkgs.alpinelinux.org/package/edge/main/x86_64/py3-markdown
[4]: https://python-markdown.github.io/
[5]: https://archlinux.org/packages/extra/any/python-markdown/
[hugo]: https://gohugo.io/
[duped]: https://gitlab.com/cfebs/cfebs-blog/-/commit/4b39494e827245ce1fbf1cbd983786e8db34c645
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								Title:  Build-a-blog
 								Date:   2024-06-17T14:46:36-04:00
 								---
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								I want to share my thought process for how to go about building something from scratch in a stream-of-conciousness/live coding sort of format.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								So for the first stupid post on this stupid blog, I've set a goal to take 1 afternoon + caffeine + some DIY spirit → _something_ resembling a static site/blog generator to build this website.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								There will be nothing ground breaking here - in fact this software will not be good.
-												build_a_blog: rm index.xml, intro

											
										
										
											2024-06-18 03:02:33 +00:00
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								But if you decide to continue reading, I hope by the end of this post you might be inspired to re-invent your own wheel for the fun of it. Or don't and just use [Hugo][hugo].
-												build_a_blog: rm index.xml, intro

											
										
										
											2024-06-18 03:02:33 +00:00
 								Lets see how hard this will be.
 								Here are the requirements for this blog:
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								* Generate each individual post written in markdown -> html
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								* Generate an index with recent list of posts.
 								    * Will need some metadata (ex. date) in each post to do this.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								* Generate RSS
 								That boils down to:
 . Read some files
 . Parse markdown, maybe parse a header with some key/values.
 . Template strings
-												build_a_blog: words

											
										
										
											2024-06-18 00:03:29 +00:00
+								So there is 1 "exotic" feature in parsing/rendering Markdown as HTML that will need some thought.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								The rest is just file and string manipulation.
-												build_a_blog: rm index.xml, intro

											
										
										
											2024-06-18 03:02:33 +00:00
+								Lets get it on.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								## Picking the tool for the job
-												build_a_blog: rm index.xml, intro

											
										
										
											2024-06-18 03:02:33 +00:00
+								Most scripting languages would be fine tools for this task. But how to handle Markdown?
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								I've had [Crystal][1] in the back of my mind for this task. It is a nice general purpose language that included Markdown in the stdlib! But unfortunately Markdown was removed in [0.31.0][2]. Other than that, I'm not sure any other languages include a well rounded Markdown implementation out of the box.
-												build_a_blog: words

											
										
										
											2024-06-18 00:03:29 +00:00
+								I'll likely end up building the site in docker with an alpine image down the road, so just a quick search in alpines repos to see what could be useful:
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```shell
 								❯ docker run --rm -it alpine
 								/ # apk update
 								fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
 								fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
 								v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/main]
 								v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/community]
 								OK: 20079 distinct packages available
 								/ # apk search markdown
 								discount-2.2.7c-r1
 								discount-dev-2.2.7c-r1
 								discount-libs-2.2.7c-r1
 								kdepim-addons-23.04.3-r0
 								markdown-1.0.1-r3
 								markdown-doc-1.0.1-r3
 								py3-docstring-to-markdown-0.12-r1
 								py3-docstring-to-markdown-pyc-0.12-r1
 								py3-html2markdown-0.1.7-r3
 								py3-html2markdown-pyc-0.1.7-r3
 								py3-markdown-3.4.3-r1
 								py3-markdown-it-py-2.2.0-r1
 								py3-markdown-it-py-pyc-2.2.0-r1
 								py3-markdown-pyc-3.4.3-r1
 								```
 								[`py3-markdown` in alpine][3] is the popular [`python-markdown`][4]. It's mature and available as a package in my [home distro][5].
-												build_a_blog: words

											
										
										
											2024-06-18 00:03:29 +00:00
+								Incredible.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								## Let's build
 								First, lets read 1 post file and render some html.
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								I'll store posts in `posts/` like `posts/build_a_blog.md`.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								And store the HTML output in the same directory: `posts/build_a_blog.html`.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```python
 								import re
 								import logging
 								import markdown
 								destpath_re = re.compile(r'\.md$')
 								logging.basicConfig(encoding='utf-8', level=logging.INFO)
 								def render_post(fpath):
 									destpath = destpath_re.sub('.html', fpath)
 									logging.info("opening %s for parsing, dest %s", fpath, destpath)
 									# from: https://python-markdown.github.io/reference/
 									with open(fpath, "r", encoding="utf-8") as input_file:
 										logging.info("reading %s", fpath)
 										text = input_file.read()
 									logging.info("parsing %s", fpath)
 									out = markdown.markdown(text)
 									with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
 										logging.info("writing to %s", destpath)
 										output_file.write(out)
 								if __name__ == '__main__':
 									render_post('posts/build_a_blog.md')
 								```
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								And run it:
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```shell
 								❯ python3 ./main.py
 								INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
 								INFO:root:reading posts/build_a_blog.md
 								INFO:root:parsing posts/build_a_blog.md
 								INFO:root:writing to posts/build_a_blog.html
-												build_a_blog: format shell

											
										
										
											2024-06-18 02:46:01 +00:00
 								❯ head posts/build_a_blog.html
 								<h1>Build-a-blog</h1>
 								<p>I want to share my thought process for how one would go about building a static blog generator from scratch.</p>
 								<ul>
 								<li>Generate an index with recent list of posts.</li>
 								<li>Generate each individual post written in markdown -&gt; html<ul>
 								<li>Support some metadata in each post</li>
 								<li>A post title should have a slug</li>
 								</ul>
 								</li>
 								<li>Generate RSS</li>
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								```
 								Looking pretty good.
 								Now lets do this for all `.md` files in `posts/`
 								```python
 								import glob
 								...
 								def render_posts():
 									files = glob.glob('posts/*.md')
 									logging.info('found post files %s', files)
 									for fname in files:
 										render_post(fname)
 								if __name__ == '__main__':
 									render_posts()
 								```
 								And add another simple test post
 								```shell
 								❯ echo '# A new post' > ./posts/a_new_post.md
 								❯ python3 ./main.py
 								INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
 								INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
 								INFO:root:reading posts/a_new_post.md
 								INFO:root:parsing posts/a_new_post.md
 								INFO:root:writing to posts/a_new_post.html
 								INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
 								INFO:root:reading posts/build_a_blog.md
 								INFO:root:parsing posts/build_a_blog.md
 								INFO:root:writing to posts/build_a_blog.html
 								❯ head ./posts/a_new_post.html
 								<h1>A new post</h1>
 								```
-												build_a_blog: writing

											
										
										
											2024-06-20 03:04:39 +00:00
+								Now on to listing and templating the posts.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								## Post ordering and templating
 								`python-markdown` supports metadata embedded in posts: <https://python-markdown.github.io/extensions/meta_data/>
 								I thought I'd need to build something here, but turns out it's exactly what I need to assign a few extra attributes to a post.
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								I'll adjust our "spec" for posts such that each post must include the following metadata at the top of the file:
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```txt
 								Title:  Build-a-blog
 								Date:   2024-06-17T14:46:36-04:00
 								---
 								```
 								And I'd like to insert the `Title` automatically as a `<h1>` tag in each post so I don't have to write it again in the markdown.
 								So first, lets test the metadata and adjust the test blog post.
 								```shell
 								❯ head -n4 ./posts/build_a_blog.md
 								Title:  Build-a-blog
 								Date:   2024-06-17T14:46:36-04:00
 								---
 								```
 								And pop open a python repl to see how this works.
 								```python
 								>>> md = markdown.Markdown(extensions = ['meta']); f = open('posts/build_a_blog.md', 'r'); txt = f.read(); out = md.convert(txt); md.Meta
 								{'title': ['Build-a-blog'], 'date': ['2024-06-17T14:46:36-04:00']}
 								```
 								Looks pretty nice!
-												build_a_blog: md convert reuse

											
										
										
											2024-06-18 14:02:26 +00:00
+								So first I will adjust the rendering function to prepend a line:
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```markdown
 								# {title}
 								```
-												build_a_blog: md convert reuse

											
										
										
											2024-06-18 14:02:26 +00:00
+								This has to be done after the full document renders because the `meta` `python-markdown` extension extracts metadata and converts to html in one call.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```python
 								def render_post(fpath):
 								    ...
 									md = markdown.Markdown(extensions = ['meta'])
 									logging.info("parsing %s", fpath)
 									out = md.convert(text)
 									title = md.Meta.get('title')[0]
 									date = md.Meta.get('date')[0]
 									out = markdown.markdown('# ' + title) + out
 								```
 								Finally, lets return a structure that will make other parts of the program aware of the filename that was rendered and the metadata (title, date)
 								```python
 								def render_post(fpath):
 								    ...
 								    out = markdown.markdown('# ' + title) + out
 									with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
 										logging.info("writing to %s", destpath)
 										output_file.write(out)
 									return {
 										'title': title,
 										'date': date,
 										'fpath': fpath,
 										'destpath': destpath,
 									}
 								```
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								Now I have what I need to generate the index page.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								### Index templating
 								Lets start by defining what our index template file will be.
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								I'll choose `index.html.tmpl` which will write to `index.html` when rendered.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								So lets make a function that will take a list of our post structure above and render it in a `<ul>`.
 								```
 								from string import Template
 								...
 								def posts_list_html(posts):
 									post_tpl = """<li>
 										<a href="{href}">{title}</a>
 										<time datetime="{date}">{disp_date}</time>
 									</li>"""
 									out = '<ul class="blog-posts-list">'
 									for post in posts:
 										disp_date = datetime.datetime.fromisoformat(post.get('date')).strftime('%Y-%m-%d')
 										out += post_tpl.format(href=post.get('destpath'),
 														 title=post.get('title'),
 														 date=post.get('date'),
 														 disp_date=disp_date)
 									return out + '</ul>'
 								def render_index(posts):
 								    fname = 'index.html.tmpl'
 								    outname = 'index.html'
 								    with open(fname, 'r', encoding='utf-8') as inf:
 								        tmpl = Template(inf.read())
 								    posts_html = posts_html(posts)
 								    html = tmpl.substitute(posts=posts_html)
 								    with open(outname, 'w', encoding='utf-8') as outf:
 								        outf.write(html)
 								```
 								Make sure that `index.html.tmpl` contains a template variable for `${posts}`
 								```shell
 								❯ grep -C2 '\${posts}' ./index.html.tmpl
 								  <div class="col-md-8 col-sm-12">
 								    <p>Welcome. Something will go here eventually.</p>
 								    ${posts}
 								  </div>
 								  <div class="col-md-4 col-sm-12">
 								```
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								And now need to connect `render_posts()` which returns each post that was processed to `render_index()`
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```python
 								def render_posts():
 									files = glob.glob('posts/*.md')
 									logging.info('found post files %s', files)
 									posts = []
 									for fname in files:
 										p = render_post(fname)
 										posts.append(p)
 										logging.info('rendered post: %s', p)
 									return posts
 								if __name__ == '__main__':
 									posts = render_posts()
 									logging.info('rendered posts: %s', posts)
 									render_index(posts)
 								```
 								And lets run it!
 								```shell
 								❯ python3 ./main.py
 								INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
 								INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
 								INFO:root:reading posts/a_new_post.md
 								INFO:root:parsing posts/a_new_post.md
 								INFO:root:writing to posts/a_new_post.html
 								INFO:root:rendered post: {'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}
 								INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
 								INFO:root:reading posts/build_a_blog.md
 								INFO:root:parsing posts/build_a_blog.md
 								INFO:root:writing to posts/build_a_blog.html
 								INFO:root:rendered post: {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}
 								INFO:root:rendered posts: [{'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}, {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}]
 								```
 								And check how the output looks:
 								```shell
 								❯ grep -C4 'blog-posts-list' ./index.html
 								  </nav>
 								  <section class="container">
 								    <div class="row">
 								      <div class="col-md-8 col-sm-12">
 								        <ul class="blog-posts-list"><li>
 										<a href="posts/a_new_post.html">A new post</a>
 										<time datetime="2024-06-17T19:48:17-04:00">2024-06-17</time>
 									</li><li>
 										<a href="posts/build_a_blog.html">Build-a-blog</a>
 								```
 								Not bad!
 								### Post templating
 								I think I want my blog to just maintain the overall layout from the index page and just render the post body where the main post list is.
 								So lets make that template rendering a bit more general.
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								I'll redefine the content area template variable to replace as `${content}` too.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```python
 								def render_template(tpl_fname, out_fname, content_html):
 								    with open(tpl_fname, 'r', encoding='utf-8') as inf:
 								        tmpl = Template(inf.read())
 								    html = tmpl.substitute(content=content_html)
 								    with open(out_fname, 'w', encoding='utf-8') as outf:
 								        outf.write(html)
 								def render_index(posts):
 									content_html = posts_list_html(posts)
 									render_template('index.html.tmpl', 'index.html', content_html)
 								        outf.write(out)
 								```
 								And now adjust where posts are written out.
 								```python
 								def render_post(fpath):
 								    ...
 									out = markdown.markdown('# ' + title) + out
 								    logging.info("writing to %s", destpath)
-												build_a_blog: fix var name

											
										
										
											2024-06-19 20:50:24 +00:00
+									render_template('index.html.tmpl', destpath, out)
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								```
 								After running you should see the each `post/*.html` file where each post file uses the full index template and includes each generated post HTML.
 								### Post sorting
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								With everything wired up now just need to sort the posts lists by the date metadata.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								Lets do a bit of python repl sort testing because I never remember `datetime` usage.
 								Lets generate a few nicely formatted ISO date strings for testing.
 								```shell
 								❯ date -d'2023-01-01' -Is
 -01-01T00:00:00-05:00
 								❯ date -Is
 -06-17T16:30:35-04:00
 								```
 								And make a test array
 								```python
 								>>> posts = [{'date': '2023-01-01T00:00:00-05:00'}, {'date': '2024-06-17T16:30:35-04:00'}]
 								```
 								With our current script, the older post would be listed first. So lets try a sort.
 								```
 								# Double checking datetime parsing
 								>>> import datetime
 								>>> newer = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
 								datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))
 								>>> older = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
 								datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))
 								# Checking python sorting methods work as expected
 								>>> newer.__gt__(older)
 								True
 								>>> newer.__lt__(older)
 								False
 								>>> older.__gt__(newer)
 								False
 								>>> older.__lt__(newer)
 								True
 								# Doing the sort
 								>>> sorted(posts, key=lambda x: datetime.datetime.fromisoformat(x['date']), reverse=True)
 								[{'date': '2024-06-17T16:30:35-04:00'}, {'date': '2023-01-01T00:00:00-05:00'}]
 								```
 								Now lets apply this to our posts.
 								```python
 								if __name__ == '__main__':
 									posts = render_posts()
 									logging.info('rendered posts: %s', posts)
 									sorted_posts = sorted(posts,
 													   key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
 									render_index(sorted_posts)
 								```
 								### `<title />` Templating
 								The last bit of templating is to make each post `<title>` different.
 								I'll try something like `<title>cfebs.com - ${title}</title>`
 								So `index.html.tmpl`
 								```html
 								<title>cfebs.com${more_title}</title>
 								```
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								And where using the title template `more_title` will default to empty string.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```python
 								def render_index(posts):
 									content_html = posts_list_html(posts)
 									render_template('index.html.tmpl', 'index.html', {'content': content_html, 'more_title': ''})
 								```
 								But for a post:
 								```python
 								def render_post(fpath):
 								    ...
 									title = md.Meta.get('title')[0]
 									date = md.Meta.get('date')[0]
 									out = markdown.markdown('# ' + title) + out
 									logging.info("writing to %s", destpath)
 									render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
 								```
-												build_a_blog: writing

											
										
										
											2024-06-20 02:53:00 +00:00
+								This is now a functioning blog generator with templating!
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								## RSS
 								This should be pretty easy as RSS is just reformatting our blog index list into different XML.
 								The `render_template` function will be useful here with a few more tweaks. So I'll make another template file (based off a reference <https://drewdevault.com/blog/index.xml>)
 								```shell
 								# Grab the reference
 								❯ curl -sL 'https://drewdevault.com/blog/index.xml' > index.xml.example
 								# After a bit of editing
 								❯ cat ./index.xml.tmpl
 								<?xml version="1.0" encoding="utf-8" standalone="yes"?>
 								<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
 									<channel>
 										<title>${site_title}</title>
 										<link>${site_link}</link>
 										<description>${description}</description>
 										<language>en</language>
 										<lastBuildDate>${last_build_date}</lastBuildDate>
 										<atom:link href="${self_full_link}" rel="self" type="application/rss+xml" />
 										${items}
 									</channel>
 								</rss>
 								```
 								`render_template` now gets even more generic and passes a `dict` to `Template.substitute()`
 								```python
 								def render_template(tpl_fname, out_fname, subs):
 								    with open(tpl_fname, 'r', encoding='utf-8') as inf:
 								        tmpl = Template(inf.read())
 								    out = tmpl.substitute(subs)
 								    with open(out_fname, 'w', encoding='utf-8') as outf:
 								        outf.write(out)
 								```
 								And make sure to adjust any usages of `render_template` that exist.
 								```python
 								def render_index(posts):
 									content_html = posts_list_html(posts)
 									render_template('index.html.tmpl', 'index.html', {'content': content_html})
 								def render_post(fname):
 								    ...
 									render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
 								```
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								And now can hack away at RSS generation:
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```
 								def render_rss_index(posts):
 								    subs = {
 								        'site_title': 'cfebs.com',
 								        'site_link': 'https://cfebs.com',
 								        'self_full_link': 'https://cfebs.com/index.xml',
 								        'description': 'Recent content from cfebs.com',
 								        'last_build_date': 'TODO',
 								        'items': 'TODO',
 								    }
 								    render_template('index.xml.tmpl', 'index.xml', subs)
 								```
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								After this initial test and a `python3 ./main.py` run, should see xml filled out.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								```
 								❯ cat ./index.xml
 								<?xml version="1.0" encoding="utf-8" standalone="yes"?>
 								<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
 									<channel>
 										<title>cfebs.com</title>
 										<link>https://cfebs.com</link>
 										<description>Recent content from cfebs.com</description>
 										<language>en</language>
 										<lastBuildDate>TODO</lastBuildDate>
 										<atom:link href="https://cfebs.com/index.xml" rel="self" type="application/rss+xml" />
 										TODO
 									</channel>
 								</rss>
 								```
 								Now lets finish up by generating each item entry and collecting them to be replaced in the template.
-												build_a_blog: convert reuse

											
										
										
											2024-06-18 14:06:20 +00:00
+								And this adds another `md.convert()` call, so why not add a util to reuse a single Markdown instance.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								```python
-												build_a_blog: convert reuse

											
										
										
											2024-06-18 14:06:20 +00:00
+								# add a module var and helper method to reuse Markdown instance
 								md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
 								def convert(text):
 								    md.reset()
 								    return md.convert()
 								def render_post(fpath):
 								    ...
 								    out = convert(text)
 									title = md.Meta.get('title')[0]
 									date = md.Meta.get('date')[0]
 									out = convert('# ' + title) + out
 								    ...
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								def rss_post_xml(post):
 									tpl = """
 									<item>
 										<title>{title}</title>
 										<link>{link}</link>
 										<pubDate>{pubdate}</pubDate>
 										<guid>{link}</guid>
 										<description>{description}</description>
 									</item>
 									"""
 									with open(post['fpath'], 'r') as inf:
 										text = inf.read()
-												build_a_blog: convert reuse

											
										
										
											2024-06-18 14:06:20 +00:00
+									converted = convert(text)
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 									link = "https://cfebs.com/" + post['destpath']
 								    pubdate = email.utils.format_datetime(datetime.datetime.fromisoformat(post['date']))
 									subs = dict(title=post['title'], link=link,
 											pubdate=pubdate,
 											description=converted)
 									for k,v in subs.items():
 										subs[k] = html.escape(v)
 									return tpl.format(**subs)
 								def render_rss_index(posts):
 									items = ''
 									for post in posts[:5]:
 										items += rss_post_xml(post)
 									subs = {
 										'site_title': 'cfebs.com',
 										'site_link': 'https://cfebs.com',
 										'self_full_link': 'https://cfebs.com/index.xml',
 										'description': 'Recent content from cfebs.com',
 										'last_build_date': email.utils.format_datetime(datetime.datetime.now()),
 									}
 									for k,v in subs.items():
 										subs[k] = html.escape(v)
 									subs['items'] = items
 									render_template('index.xml.tmpl', 'index.xml', subs)
 								```
-												build_a_blog: remove some wes

											
										
										
											2024-06-17 23:58:08 +00:00
+								* Need to use `html.escape` anywhere there could be HTML tags in output.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								* `posts[:5]` should always take the most recent 5 posts to add to the RSS feed.
 								## Wrapping up
 								Reached the end of the afternoon, so this is where I'll leave it.
 								It's not great software.
 								* No tests, no docs
-												build_a_blog: readme, ignore pycache, cleanup

											
										
										
											2024-06-18 03:11:15 +00:00
+								* No validation of input or function arguments
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								* Hard coding values like the domain
 								* Using adhoc dicts for generic structures
 								* Relies on system python version and packages.
 								* Does not offer anything a tool like [hugo][hugo] does not already offer.
-												build_a_blog: bench

											
										
										
											2024-06-19 18:53:44 +00:00
+								* Probably slow in extreme cases.
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
 								But, it's ~150 lines of python with 1 external dependency.
 								If python or `python-markdown` drastically changes, it'll probably take <10 minutes to debug.
 								And - it was fun to write and write about.
 								View the complete source for generating this blog:
-												migrate srht

											
										
										
											2024-11-22 22:41:46 +00:00
+								* [main.py](https://gitlab.com/cfebs/cfebs-blog/-/blob/main/main.py)
 								* [index.html.tmpl](https://gitlab.com/cfebs/cfebs-blog/-/blob/main/index.html.tmpl)
 								* [index.xml.tmpl](https://gitlab.com/cfebs/cfebs-blog/-/blob/main/index.xml.tmpl)
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
-												migrate srht

											
										
										
											2024-11-22 22:41:46 +00:00
+								Or the full repo tree: <https://gitlab.com/cfebs/cfebs-blog/>
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
-												build_a_blog: draft state

											
										
										
											2024-06-19 17:49:09 +00:00
+								## EDIT
 								Few additional things that will be added will go here.
 								### 2024-06-19, adding draft state
 								It might be nice to work on a rough draft, generate it for previewing, track it in git, but skip including it in the posts index.
 								So I'll add a piece of post metadata called `Draft` and use `filter()` before the posts are sorted or applied to the `index.html` or RSS `index.xml`.
 								This is the result.
 								```diff
 								diff --git a/main.py b/main.py
 								index bfd9382..52ce57b 100644
 								--- a/main.py
 								+++ b/main.py
@@ -31,6 +31,9 @@ def render_post(fpath):
 								 	title = md.Meta.get('title')[0]
 								 	date = md.Meta.get('date')[0]
 								+	draft = False
 								+	if md.Meta.get('draft'):
 								+		draft = True
 								 	out = convert('# ' + title) + out
@@ -42,6 +45,7 @@ def render_post(fpath):
 								 		'date': date,
 								 		'fpath': fpath,
 								 		'destpath': destpath,
 								+		'draft': draft,
 								 	}
 								 def render_posts():
@@ -134,6 +138,7 @@ def render_rss_index(posts):
 								 def main():
 								 	posts = render_posts()
 								 	logging.info('rendered posts: %s', posts)
 								+	posts = filter(lambda p: not p['draft'], posts)
 								 	sorted_posts = sorted(posts,
 								 					   key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
 								 	render_index(sorted_posts)
 								```
 								And testing it:
-												build_a_blog: fences

											
										
										
											2024-06-19 18:55:27 +00:00
+								```shell
-												build_a_blog: draft state

											
										
										
											2024-06-19 17:49:09 +00:00
+								❯ cat ./posts/test_thing.md
 								Title: Test thing
 								Date: 2024-06-19T13:38:34-04:00
 								Draft: 1
 								❯ python main.py
 								...
 								```
 								html should get generated, but not in the index or xml
-												build_a_blog: fences

											
										
										
											2024-06-19 18:55:27 +00:00
+								```shell
-												build_a_blog: draft state

											
										
										
											2024-06-19 17:49:09 +00:00
+								❯ grep 'Test thing' ./posts/test_thing.html
 								  <title>cfebs.com - Test thing</title>
 								        <h1 id="test-thing"><a class="toclink" href="#test-thing">Test thing</a></h1>
 								❯ grep 'Test thing' ./index.html ./index.xml | wc -l
 
 								```
-												build_a_blog: bench

											
										
										
											2024-06-19 18:53:44 +00:00
+								### 2024-06-19, is it slow?
 								Quick benchmark script `bench.sh`
-												build_a_blog: fences

											
										
										
											2024-06-19 18:55:27 +00:00
+								```bash
-												build_a_blog: bench

											
										
										
											2024-06-19 18:53:44 +00:00
+								#!/usr/bin/env bash
 								amt=$1
 								if [[ -z "$amt" ]]; then
 									echo "ERROR: pass number of test posts for bench" 1>&2
 									exit 1
 								fi
 								echo "INFO: removing old __bench files" 1>&2
 								rm -f ./posts/*__bench*
 								for i in $(seq 1 "$amt"); do
 									cp ./posts/build_a_blog.md ./posts/build_a_blog_${i}__bench.md
 								done
 								echo "INFO: number of *.md files $(find ./posts/ -iname '*.md' | wc -l)" 1>&2
 								echo "INFO: number of *.html files $(find ./posts/ -iname '*.html' | wc -l)" 1>&2
 								echo "INFO: running" 1>&2
 								time -p python main.py 2>/dev/null
-												build_a_blog: bench, main errors

											
										
										
											2024-06-19 19:34:10 +00:00
+								rc=$?
 								if [[ "$rc" != "0" ]]; then
 									echo "ERROR: program exited with $rc" 1>&2
 									exit 1
 								fi
-												build_a_blog: bench

											
										
										
											2024-06-19 18:53:44 +00:00
+								echo "INFO: number of *.html files $(find ./posts/ -iname '*.html' | wc -l)" 1>&2
 								echo "INFO: cleanup __bench files" 1>&2
 								rm -f ./posts/*__bench*
 								```
 								```shell
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								# Run on a 16 core AMD Ryzen 7 7840U
-												build_a_blog: bench

											
										
										
											2024-06-19 18:53:44 +00:00
+								❯ ./bench.sh 100
 								INFO: removing old __bench files
 								INFO: number of *.md files 102
 								INFO: number of *.html files 2
 								INFO: running
 								real 0.94
 								user 0.92
 								sys 0.02
 								INFO: number of *.html files 102
 								INFO: cleanup __bench files
 								❯ ./bench.sh 1000
 								INFO: removing old __bench files
 								INFO: number of *.md files 1002
 								INFO: number of *.html files 2
 								INFO: running
 								real 8.45
 								user 8.31
 								sys 0.12
 								INFO: number of *.html files 1002
 								INFO: cleanup __bench files
 								```
 								So approx 0.8s per 100 posts which starts to get a bit painful in the thousands.
 								Will be a fun future idea to try to solve.
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								### 2024-06-19, gotta go fast?
 								The critical part of the program that gets slower with more files is when each file is rendered to markdown.
 								I'm by no means a python concurrency expert, but after a quick search `multiprocessing.Pool` looks like a really quick win here.
 								Luckily `render_posts()` is already in a great format for using `Pool.map`
 								* 1 array of input file names
 								* Call `render_post` with 1 file name as an argument
 								* Result is collected in a list.
 								So here is the diff to make that happen:
 								```diff
 								diff --git a/main.py b/main.py
-												build_a_blog: fix duped output

											
										
										
											2024-06-19 20:24:04 +00:00
+								index 52ce57b..2ea80cb 100644
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								--- a/main.py
 								+++ b/main.py
@@ -1,9 +1,11 @@
 								+import os
 								 import re
 								 import glob
 								 import html
 								 import email
 								 import logging
 								 import datetime
-												build_a_blog: remove unused timeouterror

											
										
										
											2024-06-19 19:37:35 +00:00
+								+from multiprocessing import Pool
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								 from string import Template
 								 import markdown
-												build_a_blog: safer markdown

											
										
										
											2024-06-19 20:08:40 +00:00
+								@@ -12,11 +14,12 @@ from markdown.extensions.toc import TocExtension
 								 destpath_re = re.compile(r'\.md$')
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								 logging.basicConfig(encoding='utf-8', level=logging.INFO)
-												build_a_blog: safer markdown

											
										
										
											2024-06-19 20:08:40 +00:00
+								-md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								+cpu_count = os.cpu_count()
 								 def convert(text):
-												build_a_blog: safer markdown

											
										
										
											2024-06-19 20:08:40 +00:00
+								-	md.reset()
 								-	return md.convert(text)
 								+	md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
 								+	res = md.convert(text)
 								+	return res, md.Meta
 								 def render_post(fpath):
 								 	destpath = destpath_re.sub('.html', fpath)
@@ -27,15 +30,16 @@ def render_post(fpath):
 								 		text = input_file.read()
 								 	logging.info("parsing %s", fpath)
 								-	out = convert(text)
 								+	out, meta = convert(text)
 								-	title = md.Meta.get('title')[0]
 								-	date = md.Meta.get('date')[0]
 								+	title = meta.get('title')[0]
 								+	date = meta.get('date')[0]
 								 	draft = False
 								-	if md.Meta.get('draft'):
 								+	if meta.get('draft'):
 								 		draft = True
 								-	out = convert('# ' + title) + out
-												build_a_blog: fix duped output

											
										
										
											2024-06-19 20:24:04 +00:00
+								+	title_out, _ = convert('# ' + title)
-												build_a_blog: safer markdown

											
										
										
											2024-06-19 20:08:40 +00:00
+								+	out = title_out + out
 								 	logging.info("writing to %s", destpath)
 								 	render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
@@ -52,11 +56,11 @@ def render_posts():
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								 	files = glob.glob('posts/*.md')
 								 	logging.info('found post files %s', files)
 								 	posts = []
 								-	for fname in files:
 								-		p = render_post(fname)
 								-		posts.append(p)
 								-		logging.info('rendered post: %s', p)
 								+	logging.info('starting render posts with cpu_count: %d', cpu_count)
 								+	with Pool(processes=cpu_count) as pool:
 								+		posts = pool.map(render_post, files)
-												build_a_blog: bench, main errors

											
										
										
											2024-06-19 19:34:10 +00:00
+								+	logging.info("render_posts result: %s", posts)
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								 	return posts
 								 def posts_list_html(posts):
-												build_a_blog: safer markdown

											
										
										
											2024-06-19 20:08:40 +00:00
+								@@ -102,7 +106,7 @@ def rss_post_xml(post):
 								 		text = inf.read()
 								-	converted = convert(text)
 								+	converted, _ = convert(text)
 								 	pubdate = email.utils.format_datetime(datetime.datetime.fromisoformat(post['date']))
 								 	subs = {
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								```
-												build_a_blog: words

											
										
										
											2024-06-19 20:42:46 +00:00
+								`convert()` now creates a `Markdown` instance on each call and returns the HTML and meta. This protects against multiple processes trying to use the single module level `md` instance.
-												build_a_blog: safer markdown

											
										
										
											2024-06-19 20:08:40 +00:00
-												build_a_blog: word error

											
										
										
											2024-06-19 20:33:13 +00:00
+								See <https://python-markdown.github.io/reference/#Markdown> for notes on `Markdown.reset()` usage and thread safety.
-												build_a_blog: safer markdown

											
										
										
											2024-06-19 20:08:40 +00:00
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								And re-run the benchmarks:
 								```shell
 								# Run on a 16 core AMD Ryzen 7 7840U
 								❯ ./bench.sh 100
 								INFO: removing old __bench files
 								INFO: number of *.md files 102
 								INFO: number of *.html files 2
 								INFO: running
-												build_a_blog: fix duped output

											
										
										
											2024-06-19 20:24:04 +00:00
+								real 0.31
 								user 2.44
 								sys 0.28
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								INFO: number of *.html files 102
 								INFO: cleanup __bench files
 								❯ ./bench.sh 1000
 								INFO: removing old __bench files
 								INFO: number of *.md files 1002
 								INFO: number of *.html files 2
 								INFO: running
-												build_a_blog: fix duped output

											
										
										
											2024-06-19 20:24:04 +00:00
+								real 1.34
 								user 18.09
 								sys 0.47
-												build_a_blog: multiprocessing

											
										
										
											2024-06-19 19:26:09 +00:00
+								INFO: number of *.html files 1002
 								INFO: cleanup __bench files
 								```
-												build_a_blog: spelling

											
										
										
											2024-06-19 20:30:29 +00:00
+								Did I accidentally duplicate output during [one of the refactors of this multithreading change][duped]? Yup!
-												build_a_blog: ack dupe error

											
										
										
											2024-06-19 20:28:39 +00:00
 								But now down to ~1.5s for 1000 posts 🎉
-												build_a_blog: bench

											
										
										
											2024-06-19 18:53:44 +00:00
-												build-a-blog

											
										
										
											2024-06-17 23:52:21 +00:00
+								[1]: https://crystal-lang.org/
 								[2]: https://github.com/crystal-lang/crystal/releases/tag/0.31.0
 								[3]: https://pkgs.alpinelinux.org/package/edge/main/x86_64/py3-markdown
 								[4]: https://python-markdown.github.io/
 								[5]: https://archlinux.org/packages/extra/any/python-markdown/
 								[hugo]: https://gohugo.io/
-												migrate srht

											
										
										
											2024-11-22 22:41:46 +00:00
+								[duped]: https://gitlab.com/cfebs/cfebs-blog/-/commit/4b39494e827245ce1fbf1cbd983786e8db34c645