933 lines
26 KiB
Markdown
933 lines
26 KiB
Markdown
Title: Build-a-blog
|
||
Date: 2024-06-17T14:46:36-04:00
|
||
---
|
||
I want to share my thought process for how to go about building a static blog generator from scratch.
|
||
|
||
There will be nothing ground breaking here - in fact this software will not be good. So turn back now if you're expecting the new [Hugo][hugo].
|
||
|
||
Actually you should probably stop reading and just use [Hugo][Hugo].
|
||
|
||
In case you are still interested, the goal is to take 1 afternoon + caffeine + some DIY spirit → _something_ resembling a static site/blog generator.
|
||
|
||
And I hope by the end of this post you might be inspired to build your own generation scripts, maybe in a new language you always wanted to try.
|
||
|
||
Lets see how hard this will be.
|
||
|
||
Here are the requirements for this blog:
|
||
|
||
* Generate an index with recent list of posts.
|
||
* Generate each individual post written in markdown -> html
|
||
* Support some metadata in each post
|
||
* A post title should have a slug
|
||
* Generate RSS
|
||
|
||
That boils down to:
|
||
|
||
1. Read some files
|
||
2. Parse markdown, maybe parse a header with some key/values.
|
||
3. Template strings
|
||
|
||
So there is 1 "exotic" feature in parsing/rendering Markdown as HTML that will need some thought.
|
||
|
||
The rest is just file and string manipulation.
|
||
|
||
Lets get it on.
|
||
|
||
## Picking the tool for the job
|
||
|
||
Most scripting languages would be fine tools for this task. But how to handle Markdown?
|
||
|
||
I've had [Crystal][1] in the back of my mind for this task. It is a nice general purpose language that included Markdown in the stdlib! But unfortunately Markdown was removed in [0.31.0][2]. Other than that, I'm not sure any other languages include a well rounded Markdown implementation out of the box.
|
||
|
||
I'll likely end up building the site in docker with an alpine image down the road, so just a quick search in alpines repos to see what could be useful:
|
||
|
||
```shell
|
||
❯ docker run --rm -it alpine
|
||
/ # apk update
|
||
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
|
||
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
|
||
v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/main]
|
||
v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/community]
|
||
OK: 20079 distinct packages available
|
||
/ # apk search markdown
|
||
discount-2.2.7c-r1
|
||
discount-dev-2.2.7c-r1
|
||
discount-libs-2.2.7c-r1
|
||
kdepim-addons-23.04.3-r0
|
||
markdown-1.0.1-r3
|
||
markdown-doc-1.0.1-r3
|
||
py3-docstring-to-markdown-0.12-r1
|
||
py3-docstring-to-markdown-pyc-0.12-r1
|
||
py3-html2markdown-0.1.7-r3
|
||
py3-html2markdown-pyc-0.1.7-r3
|
||
py3-markdown-3.4.3-r1
|
||
py3-markdown-it-py-2.2.0-r1
|
||
py3-markdown-it-py-pyc-2.2.0-r1
|
||
py3-markdown-pyc-3.4.3-r1
|
||
```
|
||
|
||
[`py3-markdown` in alpine][3] is the popular [`python-markdown`][4]. It's mature and available as a package in my [home distro][5].
|
||
|
||
Incredible.
|
||
|
||
## Let's build
|
||
|
||
First, lets read 1 post file and render some html.
|
||
|
||
I'll store posts in `posts/` like `posts/build_a_blog.md`.
|
||
|
||
And we'll store the HTML output in the same directory: `posts/build_a_blog.html`.
|
||
|
||
```python
|
||
import re
|
||
import logging
|
||
|
||
import markdown
|
||
destpath_re = re.compile(r'\.md$')
|
||
|
||
logging.basicConfig(encoding='utf-8', level=logging.INFO)
|
||
|
||
def render_post(fpath):
|
||
destpath = destpath_re.sub('.html', fpath)
|
||
logging.info("opening %s for parsing, dest %s", fpath, destpath)
|
||
# from: https://python-markdown.github.io/reference/
|
||
with open(fpath, "r", encoding="utf-8") as input_file:
|
||
logging.info("reading %s", fpath)
|
||
text = input_file.read()
|
||
|
||
logging.info("parsing %s", fpath)
|
||
out = markdown.markdown(text)
|
||
|
||
with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
|
||
logging.info("writing to %s", destpath)
|
||
output_file.write(out)
|
||
|
||
if __name__ == '__main__':
|
||
render_post('posts/build_a_blog.md')
|
||
```
|
||
|
||
And if we run it.
|
||
|
||
```shell
|
||
❯ python3 ./main.py
|
||
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
|
||
INFO:root:reading posts/build_a_blog.md
|
||
INFO:root:parsing posts/build_a_blog.md
|
||
INFO:root:writing to posts/build_a_blog.html
|
||
|
||
❯ head posts/build_a_blog.html
|
||
<h1>Build-a-blog</h1>
|
||
<p>I want to share my thought process for how one would go about building a static blog generator from scratch.</p>
|
||
<ul>
|
||
<li>Generate an index with recent list of posts.</li>
|
||
<li>Generate each individual post written in markdown -> html<ul>
|
||
<li>Support some metadata in each post</li>
|
||
<li>A post title should have a slug</li>
|
||
</ul>
|
||
</li>
|
||
<li>Generate RSS</li>
|
||
```
|
||
|
||
Looking pretty good.
|
||
|
||
Now lets do this for all `.md` files in `posts/`
|
||
|
||
```python
|
||
import glob
|
||
...
|
||
|
||
def render_posts():
|
||
files = glob.glob('posts/*.md')
|
||
logging.info('found post files %s', files)
|
||
for fname in files:
|
||
render_post(fname)
|
||
|
||
if __name__ == '__main__':
|
||
render_posts()
|
||
```
|
||
|
||
And add another simple test post
|
||
|
||
```shell
|
||
❯ echo '# A new post' > ./posts/a_new_post.md
|
||
❯ python3 ./main.py
|
||
INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
|
||
INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
|
||
INFO:root:reading posts/a_new_post.md
|
||
INFO:root:parsing posts/a_new_post.md
|
||
INFO:root:writing to posts/a_new_post.html
|
||
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
|
||
INFO:root:reading posts/build_a_blog.md
|
||
INFO:root:parsing posts/build_a_blog.md
|
||
INFO:root:writing to posts/build_a_blog.html
|
||
❯ head ./posts/a_new_post.html
|
||
<h1>A new post</h1>
|
||
```
|
||
|
||
Basically at this point, it's a blog generator!
|
||
|
||
But I want a few more features:
|
||
|
||
* Want the posts listed in the index sorted by date.
|
||
* Want each post to be templated in some html wrapper.
|
||
|
||
## Post ordering and templating
|
||
|
||
`python-markdown` supports metadata embedded in posts: <https://python-markdown.github.io/extensions/meta_data/>
|
||
|
||
I thought I'd need to build something here, but turns out it's exactly what I need to assign a few extra attributes to a post.
|
||
|
||
I'll adjust our "spec" for posts such that each post must include the following metadata at the top of the file:
|
||
|
||
```txt
|
||
Title: Build-a-blog
|
||
Date: 2024-06-17T14:46:36-04:00
|
||
---
|
||
```
|
||
|
||
And I'd like to insert the `Title` automatically as a `<h1>` tag in each post so I don't have to write it again in the markdown.
|
||
|
||
So first, lets test the metadata and adjust the test blog post.
|
||
|
||
```shell
|
||
❯ head -n4 ./posts/build_a_blog.md
|
||
Title: Build-a-blog
|
||
Date: 2024-06-17T14:46:36-04:00
|
||
---
|
||
```
|
||
|
||
And pop open a python repl to see how this works.
|
||
|
||
```python
|
||
>>> md = markdown.Markdown(extensions = ['meta']); f = open('posts/build_a_blog.md', 'r'); txt = f.read(); out = md.convert(txt); md.Meta
|
||
{'title': ['Build-a-blog'], 'date': ['2024-06-17T14:46:36-04:00']}
|
||
```
|
||
|
||
Looks pretty nice!
|
||
|
||
So first I will adjust the rendering function to prepend a line:
|
||
|
||
```markdown
|
||
# {title}
|
||
```
|
||
|
||
This has to be done after the full document renders because the `meta` `python-markdown` extension extracts metadata and converts to html in one call.
|
||
|
||
|
||
```python
|
||
def render_post(fpath):
|
||
...
|
||
|
||
md = markdown.Markdown(extensions = ['meta'])
|
||
|
||
logging.info("parsing %s", fpath)
|
||
out = md.convert(text)
|
||
|
||
title = md.Meta.get('title')[0]
|
||
date = md.Meta.get('date')[0]
|
||
|
||
out = markdown.markdown('# ' + title) + out
|
||
```
|
||
|
||
Finally, lets return a structure that will make other parts of the program aware of the filename that was rendered and the metadata (title, date)
|
||
|
||
|
||
```python
|
||
def render_post(fpath):
|
||
...
|
||
out = markdown.markdown('# ' + title) + out
|
||
|
||
with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
|
||
logging.info("writing to %s", destpath)
|
||
output_file.write(out)
|
||
|
||
return {
|
||
'title': title,
|
||
'date': date,
|
||
'fpath': fpath,
|
||
'destpath': destpath,
|
||
}
|
||
```
|
||
|
||
Now we have what we need to generate a complete index.
|
||
|
||
### Index templating
|
||
|
||
Lets start by defining what our index template file will be.
|
||
|
||
I'll choose `index.html.tmpl` and after rendering we will write to `index.html`.
|
||
|
||
So lets make a function that will take a list of our post structure above and render it in a `<ul>`.
|
||
|
||
```
|
||
from string import Template
|
||
...
|
||
def posts_list_html(posts):
|
||
post_tpl = """<li>
|
||
<a href="{href}">{title}</a>
|
||
<time datetime="{date}">{disp_date}</time>
|
||
</li>"""
|
||
out = '<ul class="blog-posts-list">'
|
||
for post in posts:
|
||
disp_date = datetime.datetime.fromisoformat(post.get('date')).strftime('%Y-%m-%d')
|
||
out += post_tpl.format(href=post.get('destpath'),
|
||
title=post.get('title'),
|
||
date=post.get('date'),
|
||
disp_date=disp_date)
|
||
return out + '</ul>'
|
||
|
||
def render_index(posts):
|
||
fname = 'index.html.tmpl'
|
||
outname = 'index.html'
|
||
|
||
with open(fname, 'r', encoding='utf-8') as inf:
|
||
tmpl = Template(inf.read())
|
||
|
||
posts_html = posts_html(posts)
|
||
|
||
html = tmpl.substitute(posts=posts_html)
|
||
|
||
with open(outname, 'w', encoding='utf-8') as outf:
|
||
outf.write(html)
|
||
```
|
||
|
||
Make sure that `index.html.tmpl` contains a template variable for `${posts}`
|
||
|
||
```shell
|
||
❯ grep -C2 '\${posts}' ./index.html.tmpl
|
||
<div class="col-md-8 col-sm-12">
|
||
<p>Welcome. Something will go here eventually.</p>
|
||
${posts}
|
||
</div>
|
||
<div class="col-md-4 col-sm-12">
|
||
```
|
||
|
||
And we now need to connect `render_posts()` which returns each post that was processed to `render_index()`
|
||
|
||
```python
|
||
def render_posts():
|
||
files = glob.glob('posts/*.md')
|
||
logging.info('found post files %s', files)
|
||
posts = []
|
||
for fname in files:
|
||
p = render_post(fname)
|
||
posts.append(p)
|
||
logging.info('rendered post: %s', p)
|
||
|
||
return posts
|
||
|
||
if __name__ == '__main__':
|
||
posts = render_posts()
|
||
logging.info('rendered posts: %s', posts)
|
||
render_index(posts)
|
||
```
|
||
|
||
And lets run it!
|
||
|
||
```shell
|
||
❯ python3 ./main.py
|
||
INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
|
||
INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
|
||
INFO:root:reading posts/a_new_post.md
|
||
INFO:root:parsing posts/a_new_post.md
|
||
INFO:root:writing to posts/a_new_post.html
|
||
INFO:root:rendered post: {'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}
|
||
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
|
||
INFO:root:reading posts/build_a_blog.md
|
||
INFO:root:parsing posts/build_a_blog.md
|
||
INFO:root:writing to posts/build_a_blog.html
|
||
INFO:root:rendered post: {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}
|
||
INFO:root:rendered posts: [{'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}, {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}]
|
||
```
|
||
|
||
And check how the output looks:
|
||
```shell
|
||
❯ grep -C4 'blog-posts-list' ./index.html
|
||
</nav>
|
||
<section class="container">
|
||
<div class="row">
|
||
<div class="col-md-8 col-sm-12">
|
||
<ul class="blog-posts-list"><li>
|
||
<a href="posts/a_new_post.html">A new post</a>
|
||
<time datetime="2024-06-17T19:48:17-04:00">2024-06-17</time>
|
||
</li><li>
|
||
<a href="posts/build_a_blog.html">Build-a-blog</a>
|
||
```
|
||
|
||
Not bad!
|
||
|
||
### Post templating
|
||
|
||
I think I want my blog to just maintain the overall layout from the index page and just render the post body where the main post list is.
|
||
|
||
So lets make that template rendering a bit more general.
|
||
|
||
I'll redefine the content area template variable to replace as `${content}` too.
|
||
|
||
```python
|
||
def render_template(tpl_fname, out_fname, content_html):
|
||
with open(tpl_fname, 'r', encoding='utf-8') as inf:
|
||
tmpl = Template(inf.read())
|
||
|
||
html = tmpl.substitute(content=content_html)
|
||
|
||
with open(out_fname, 'w', encoding='utf-8') as outf:
|
||
outf.write(html)
|
||
|
||
def render_index(posts):
|
||
content_html = posts_list_html(posts)
|
||
render_template('index.html.tmpl', 'index.html', content_html)
|
||
outf.write(out)
|
||
```
|
||
|
||
And now adjust where posts are written out.
|
||
|
||
```python
|
||
def render_post(fpath):
|
||
...
|
||
out = markdown.markdown('# ' + title) + out
|
||
logging.info("writing to %s", destpath)
|
||
render_template('index.html.tmpl', destpath, html)
|
||
```
|
||
|
||
After running you should see the each `post/*.html` file where each post file uses the full index template and includes each generated post HTML.
|
||
|
||
### Post sorting
|
||
|
||
With everything wired up now just need to sort the posts lists by the date metadata.
|
||
|
||
Lets do a bit of python repl sort testing because I never remember `datetime` usage.
|
||
|
||
Lets generate a few nicely formatted ISO date strings for testing.
|
||
|
||
```shell
|
||
❯ date -d'2023-01-01' -Is
|
||
2023-01-01T00:00:00-05:00
|
||
❯ date -Is
|
||
2024-06-17T16:30:35-04:00
|
||
```
|
||
|
||
And make a test array
|
||
|
||
```python
|
||
>>> posts = [{'date': '2023-01-01T00:00:00-05:00'}, {'date': '2024-06-17T16:30:35-04:00'}]
|
||
```
|
||
|
||
With our current script, the older post would be listed first. So lets try a sort.
|
||
|
||
```
|
||
# Double checking datetime parsing
|
||
>>> import datetime
|
||
>>> newer = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
|
||
datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))
|
||
>>> older = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
|
||
datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))
|
||
|
||
# Checking python sorting methods work as expected
|
||
>>> newer.__gt__(older)
|
||
True
|
||
>>> newer.__lt__(older)
|
||
False
|
||
>>> older.__gt__(newer)
|
||
False
|
||
>>> older.__lt__(newer)
|
||
True
|
||
|
||
# Doing the sort
|
||
>>> sorted(posts, key=lambda x: datetime.datetime.fromisoformat(x['date']), reverse=True)
|
||
[{'date': '2024-06-17T16:30:35-04:00'}, {'date': '2023-01-01T00:00:00-05:00'}]
|
||
```
|
||
|
||
Now lets apply this to our posts.
|
||
|
||
```python
|
||
if __name__ == '__main__':
|
||
posts = render_posts()
|
||
logging.info('rendered posts: %s', posts)
|
||
sorted_posts = sorted(posts,
|
||
key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
|
||
render_index(sorted_posts)
|
||
```
|
||
|
||
### `<title />` Templating
|
||
|
||
The last bit of templating is to make each post `<title>` different.
|
||
|
||
I'll try something like `<title>cfebs.com - ${title}</title>`
|
||
|
||
So `index.html.tmpl`
|
||
|
||
```html
|
||
<title>cfebs.com${more_title}</title>
|
||
```
|
||
|
||
And where using the title template `more_title` will default to empty string.
|
||
|
||
```python
|
||
def render_index(posts):
|
||
content_html = posts_list_html(posts)
|
||
render_template('index.html.tmpl', 'index.html', {'content': content_html, 'more_title': ''})
|
||
```
|
||
|
||
But for a post:
|
||
|
||
```python
|
||
def render_post(fpath):
|
||
...
|
||
title = md.Meta.get('title')[0]
|
||
date = md.Meta.get('date')[0]
|
||
|
||
out = markdown.markdown('# ' + title) + out
|
||
|
||
logging.info("writing to %s", destpath)
|
||
render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
|
||
```
|
||
|
||
At this point we have functioning blog post generation with templating.
|
||
|
||
|
||
## RSS
|
||
|
||
This should be pretty easy as RSS is just reformatting our blog index list into different XML.
|
||
|
||
The `render_template` function will be useful here with a few more tweaks. So I'll make another template file (based off a reference <https://drewdevault.com/blog/index.xml>)
|
||
|
||
```shell
|
||
# Grab the reference
|
||
❯ curl -sL 'https://drewdevault.com/blog/index.xml' > index.xml.example
|
||
|
||
# After a bit of editing
|
||
❯ cat ./index.xml.tmpl
|
||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
|
||
<channel>
|
||
<title>${site_title}</title>
|
||
<link>${site_link}</link>
|
||
<description>${description}</description>
|
||
<language>en</language>
|
||
<lastBuildDate>${last_build_date}</lastBuildDate>
|
||
<atom:link href="${self_full_link}" rel="self" type="application/rss+xml" />
|
||
${items}
|
||
</channel>
|
||
</rss>
|
||
```
|
||
|
||
`render_template` now gets even more generic and passes a `dict` to `Template.substitute()`
|
||
|
||
```python
|
||
def render_template(tpl_fname, out_fname, subs):
|
||
with open(tpl_fname, 'r', encoding='utf-8') as inf:
|
||
tmpl = Template(inf.read())
|
||
|
||
out = tmpl.substitute(subs)
|
||
|
||
with open(out_fname, 'w', encoding='utf-8') as outf:
|
||
outf.write(out)
|
||
```
|
||
|
||
And make sure to adjust any usages of `render_template` that exist.
|
||
|
||
```python
|
||
def render_index(posts):
|
||
content_html = posts_list_html(posts)
|
||
render_template('index.html.tmpl', 'index.html', {'content': content_html})
|
||
|
||
def render_post(fname):
|
||
...
|
||
render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
|
||
```
|
||
|
||
And now can hack away at RSS generation:
|
||
|
||
```
|
||
def render_rss_index(posts):
|
||
subs = {
|
||
'site_title': 'cfebs.com',
|
||
'site_link': 'https://cfebs.com',
|
||
'self_full_link': 'https://cfebs.com/index.xml',
|
||
'description': 'Recent content from cfebs.com',
|
||
'last_build_date': 'TODO',
|
||
'items': 'TODO',
|
||
}
|
||
render_template('index.xml.tmpl', 'index.xml', subs)
|
||
```
|
||
|
||
After this initial test and a `python3 ./main.py` run, should see xml filled out.
|
||
|
||
```
|
||
❯ cat ./index.xml
|
||
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
|
||
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
|
||
<channel>
|
||
<title>cfebs.com</title>
|
||
<link>https://cfebs.com</link>
|
||
<description>Recent content from cfebs.com</description>
|
||
<language>en</language>
|
||
<lastBuildDate>TODO</lastBuildDate>
|
||
<atom:link href="https://cfebs.com/index.xml" rel="self" type="application/rss+xml" />
|
||
TODO
|
||
</channel>
|
||
</rss>
|
||
```
|
||
|
||
Now lets finish up by generating each item entry and collecting them to be replaced in the template.
|
||
|
||
And this adds another `md.convert()` call, so why not add a util to reuse a single Markdown instance.
|
||
|
||
```python
|
||
# add a module var and helper method to reuse Markdown instance
|
||
md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
|
||
def convert(text):
|
||
md.reset()
|
||
return md.convert()
|
||
|
||
def render_post(fpath):
|
||
...
|
||
out = convert(text)
|
||
|
||
title = md.Meta.get('title')[0]
|
||
date = md.Meta.get('date')[0]
|
||
|
||
out = convert('# ' + title) + out
|
||
...
|
||
|
||
def rss_post_xml(post):
|
||
tpl = """
|
||
<item>
|
||
<title>{title}</title>
|
||
<link>{link}</link>
|
||
<pubDate>{pubdate}</pubDate>
|
||
<guid>{link}</guid>
|
||
<description>{description}</description>
|
||
</item>
|
||
"""
|
||
|
||
with open(post['fpath'], 'r') as inf:
|
||
text = inf.read()
|
||
|
||
converted = convert(text)
|
||
|
||
link = "https://cfebs.com/" + post['destpath']
|
||
pubdate = email.utils.format_datetime(datetime.datetime.fromisoformat(post['date']))
|
||
subs = dict(title=post['title'], link=link,
|
||
pubdate=pubdate,
|
||
description=converted)
|
||
|
||
for k,v in subs.items():
|
||
subs[k] = html.escape(v)
|
||
|
||
return tpl.format(**subs)
|
||
|
||
def render_rss_index(posts):
|
||
items = ''
|
||
for post in posts[:5]:
|
||
items += rss_post_xml(post)
|
||
|
||
subs = {
|
||
'site_title': 'cfebs.com',
|
||
'site_link': 'https://cfebs.com',
|
||
'self_full_link': 'https://cfebs.com/index.xml',
|
||
'description': 'Recent content from cfebs.com',
|
||
'last_build_date': email.utils.format_datetime(datetime.datetime.now()),
|
||
}
|
||
for k,v in subs.items():
|
||
subs[k] = html.escape(v)
|
||
|
||
subs['items'] = items
|
||
render_template('index.xml.tmpl', 'index.xml', subs)
|
||
```
|
||
|
||
* Need to use `html.escape` anywhere there could be HTML tags in output.
|
||
* `posts[:5]` should always take the most recent 5 posts to add to the RSS feed.
|
||
|
||
## Wrapping up
|
||
|
||
Reached the end of the afternoon, so this is where I'll leave it.
|
||
|
||
It's not great software.
|
||
|
||
* No tests, no docs
|
||
* No validation of input or function arguments
|
||
* Hard coding values like the domain
|
||
* Using adhoc dicts for generic structures
|
||
* Relies on system python version and packages.
|
||
* Does not offer anything a tool like [hugo][hugo] does not already offer.
|
||
* Probably slow in extreme cases.
|
||
|
||
But, it's ~150 lines of python with 1 external dependency.
|
||
|
||
If python or `python-markdown` drastically changes, it'll probably take <10 minutes to debug.
|
||
|
||
And - it was fun to write and write about.
|
||
|
||
View the complete source for generating this blog:
|
||
|
||
* [main.py](https://git.sr.ht/~cfebs/cfebs.srht.site/tree/main/item/main.py)
|
||
* [index.html.tmpl](https://git.sr.ht/~cfebs/cfebs.srht.site/tree/main/item/index.html.tmpl)
|
||
* [index.xml.tmpl](https://git.sr.ht/~cfebs/cfebs.srht.site/tree/main/item/index.xml.tmpl)
|
||
|
||
Or the full repo tree: <https://git.sr.ht/~cfebs/cfebs.srht.site/tree>
|
||
|
||
## EDIT
|
||
|
||
Few additional things that will be added will go here.
|
||
|
||
### 2024-06-19, adding draft state
|
||
|
||
It might be nice to work on a rough draft, generate it for previewing, track it in git, but skip including it in the posts index.
|
||
|
||
So I'll add a piece of post metadata called `Draft` and use `filter()` before the posts are sorted or applied to the `index.html` or RSS `index.xml`.
|
||
|
||
This is the result.
|
||
|
||
```diff
|
||
diff --git a/main.py b/main.py
|
||
index bfd9382..52ce57b 100644
|
||
--- a/main.py
|
||
+++ b/main.py
|
||
@@ -31,6 +31,9 @@ def render_post(fpath):
|
||
|
||
title = md.Meta.get('title')[0]
|
||
date = md.Meta.get('date')[0]
|
||
+ draft = False
|
||
+ if md.Meta.get('draft'):
|
||
+ draft = True
|
||
|
||
out = convert('# ' + title) + out
|
||
|
||
@@ -42,6 +45,7 @@ def render_post(fpath):
|
||
'date': date,
|
||
'fpath': fpath,
|
||
'destpath': destpath,
|
||
+ 'draft': draft,
|
||
}
|
||
|
||
def render_posts():
|
||
@@ -134,6 +138,7 @@ def render_rss_index(posts):
|
||
def main():
|
||
posts = render_posts()
|
||
logging.info('rendered posts: %s', posts)
|
||
+ posts = filter(lambda p: not p['draft'], posts)
|
||
sorted_posts = sorted(posts,
|
||
key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
|
||
render_index(sorted_posts)
|
||
```
|
||
|
||
And testing it:
|
||
```shell
|
||
❯ cat ./posts/test_thing.md
|
||
Title: Test thing
|
||
Date: 2024-06-19T13:38:34-04:00
|
||
Draft: 1
|
||
|
||
❯ python main.py
|
||
...
|
||
```
|
||
|
||
html should get generated, but not in the index or xml
|
||
```shell
|
||
❯ grep 'Test thing' ./posts/test_thing.html
|
||
<title>cfebs.com - Test thing</title>
|
||
<h1 id="test-thing"><a class="toclink" href="#test-thing">Test thing</a></h1>
|
||
|
||
❯ grep 'Test thing' ./index.html ./index.xml | wc -l
|
||
0
|
||
```
|
||
|
||
### 2024-06-19, is it slow?
|
||
|
||
Quick benchmark script `bench.sh`
|
||
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
|
||
amt=$1
|
||
if [[ -z "$amt" ]]; then
|
||
echo "ERROR: pass number of test posts for bench" 1>&2
|
||
exit 1
|
||
fi
|
||
|
||
echo "INFO: removing old __bench files" 1>&2
|
||
rm -f ./posts/*__bench*
|
||
for i in $(seq 1 "$amt"); do
|
||
cp ./posts/build_a_blog.md ./posts/build_a_blog_${i}__bench.md
|
||
done
|
||
|
||
echo "INFO: number of *.md files $(find ./posts/ -iname '*.md' | wc -l)" 1>&2
|
||
echo "INFO: number of *.html files $(find ./posts/ -iname '*.html' | wc -l)" 1>&2
|
||
echo "INFO: running" 1>&2
|
||
time -p python main.py 2>/dev/null
|
||
rc=$?
|
||
if [[ "$rc" != "0" ]]; then
|
||
echo "ERROR: program exited with $rc" 1>&2
|
||
exit 1
|
||
fi
|
||
echo "INFO: number of *.html files $(find ./posts/ -iname '*.html' | wc -l)" 1>&2
|
||
echo "INFO: cleanup __bench files" 1>&2
|
||
rm -f ./posts/*__bench*
|
||
```
|
||
|
||
```shell
|
||
# Run on a 16 core AMD Ryzen 7 7840U
|
||
❯ ./bench.sh 100
|
||
INFO: removing old __bench files
|
||
INFO: number of *.md files 102
|
||
INFO: number of *.html files 2
|
||
INFO: running
|
||
real 0.94
|
||
user 0.92
|
||
sys 0.02
|
||
INFO: number of *.html files 102
|
||
INFO: cleanup __bench files
|
||
|
||
❯ ./bench.sh 1000
|
||
INFO: removing old __bench files
|
||
INFO: number of *.md files 1002
|
||
INFO: number of *.html files 2
|
||
INFO: running
|
||
real 8.45
|
||
user 8.31
|
||
sys 0.12
|
||
INFO: number of *.html files 1002
|
||
INFO: cleanup __bench files
|
||
```
|
||
|
||
So approx 0.8s per 100 posts which starts to get a bit painful in the thousands.
|
||
|
||
Will be a fun future idea to try to solve.
|
||
|
||
### 2024-06-19, gotta go fast?
|
||
|
||
The critical part of the program that gets slower with more files is when each file is rendered to markdown.
|
||
|
||
I'm by no means a python concurrency expert, but after a quick search `multiprocessing.Pool` looks like a really quick win here.
|
||
|
||
Luckily `render_posts()` is already in a great format for using `Pool.map`
|
||
|
||
* 1 array of input file names
|
||
* Call `render_post` with 1 file name as an argument
|
||
* Result is collected in a list.
|
||
|
||
So here is the diff to make that happen:
|
||
|
||
```diff
|
||
diff --git a/main.py b/main.py
|
||
index 52ce57b..2ea80cb 100644
|
||
--- a/main.py
|
||
+++ b/main.py
|
||
@@ -1,9 +1,11 @@
|
||
+import os
|
||
import re
|
||
import glob
|
||
import html
|
||
import email
|
||
import logging
|
||
import datetime
|
||
+from multiprocessing import Pool
|
||
from string import Template
|
||
|
||
import markdown
|
||
@@ -12,11 +14,12 @@ from markdown.extensions.toc import TocExtension
|
||
destpath_re = re.compile(r'\.md$')
|
||
logging.basicConfig(encoding='utf-8', level=logging.INFO)
|
||
|
||
-md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
|
||
+cpu_count = os.cpu_count()
|
||
|
||
def convert(text):
|
||
- md.reset()
|
||
- return md.convert(text)
|
||
+ md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
|
||
+ res = md.convert(text)
|
||
+ return res, md.Meta
|
||
|
||
def render_post(fpath):
|
||
destpath = destpath_re.sub('.html', fpath)
|
||
@@ -27,15 +30,16 @@ def render_post(fpath):
|
||
text = input_file.read()
|
||
|
||
logging.info("parsing %s", fpath)
|
||
- out = convert(text)
|
||
+ out, meta = convert(text)
|
||
|
||
- title = md.Meta.get('title')[0]
|
||
- date = md.Meta.get('date')[0]
|
||
+ title = meta.get('title')[0]
|
||
+ date = meta.get('date')[0]
|
||
draft = False
|
||
- if md.Meta.get('draft'):
|
||
+ if meta.get('draft'):
|
||
draft = True
|
||
|
||
- out = convert('# ' + title) + out
|
||
+ title_out, _ = convert('# ' + title)
|
||
+ out = title_out + out
|
||
|
||
logging.info("writing to %s", destpath)
|
||
render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})
|
||
@@ -52,11 +56,11 @@ def render_posts():
|
||
files = glob.glob('posts/*.md')
|
||
logging.info('found post files %s', files)
|
||
posts = []
|
||
- for fname in files:
|
||
- p = render_post(fname)
|
||
- posts.append(p)
|
||
- logging.info('rendered post: %s', p)
|
||
+ logging.info('starting render posts with cpu_count: %d', cpu_count)
|
||
+ with Pool(processes=cpu_count) as pool:
|
||
+ posts = pool.map(render_post, files)
|
||
|
||
+ logging.info("render_posts result: %s", posts)
|
||
return posts
|
||
|
||
def posts_list_html(posts):
|
||
@@ -102,7 +106,7 @@ def rss_post_xml(post):
|
||
text = inf.read()
|
||
|
||
|
||
- converted = convert(text)
|
||
+ converted, _ = convert(text)
|
||
|
||
pubdate = email.utils.format_datetime(datetime.datetime.fromisoformat(post['date']))
|
||
subs = {
|
||
```
|
||
|
||
The biggest note is that `convert()` now creates a `Markdown` instance on each call. This protects against multiple processes trying to use the same module level `md`
|
||
|
||
See <https://python-markdown.github.io/reference/#Markdown> for notes on how `Markdown.reset()` and thread safety.
|
||
|
||
And re-run the benchmarks:
|
||
|
||
```shell
|
||
# Run on a 16 core AMD Ryzen 7 7840U
|
||
❯ ./bench.sh 100
|
||
INFO: removing old __bench files
|
||
INFO: number of *.md files 102
|
||
INFO: number of *.html files 2
|
||
INFO: running
|
||
real 0.31
|
||
user 2.44
|
||
sys 0.28
|
||
INFO: number of *.html files 102
|
||
INFO: cleanup __bench files
|
||
|
||
❯ ./bench.sh 1000
|
||
INFO: removing old __bench files
|
||
INFO: number of *.md files 1002
|
||
INFO: number of *.html files 2
|
||
INFO: running
|
||
real 1.34
|
||
user 18.09
|
||
sys 0.47
|
||
INFO: number of *.html files 1002
|
||
INFO: cleanup __bench files
|
||
```
|
||
|
||
So that's down to ~1.5s for 1000 posts. Not a bad start!
|
||
|
||
[1]: https://crystal-lang.org/
|
||
[2]: https://github.com/crystal-lang/crystal/releases/tag/0.31.0
|
||
[3]: https://pkgs.alpinelinux.org/package/edge/main/x86_64/py3-markdown
|
||
[4]: https://python-markdown.github.io/
|
||
[5]: https://archlinux.org/packages/extra/any/python-markdown/
|
||
[hugo]: https://gohugo.io/
|