blog/posts/build_a_blog.md

21 KiB
Raw Blame History

Title: Build-a-blog Date: 2024-06-17T14:46:36-04:00

I want to share my thought process for how to go about building a static blog generator from scratch.

There will be nothing ground breaking here - in fact this software will not be good. So turn back now if you're expecting the new Hugo.

Actually you should probably stop reading and just use Hugo.

In case you are still interested, the goal is to take 1 afternoon + caffeine + some DIY spirit → something resembling a static site/blog generator.

And I hope by the end of this post you might be inspired to build your own generation scripts, maybe in a new language you always wanted to try.

Lets see how hard this will be.

Here are the requirements for this blog:

  • Generate an index with recent list of posts.
  • Generate each individual post written in markdown -> html
    • Support some metadata in each post
    • A post title should have a slug
  • Generate RSS

That boils down to:

  1. Read some files
  2. Parse markdown, maybe parse a header with some key/values.
  3. Template strings

So there is 1 "exotic" feature in parsing/rendering Markdown as HTML that will need some thought.

The rest is just file and string manipulation.

Lets get it on.

Picking the tool for the job

Most scripting languages would be fine tools for this task. But how to handle Markdown?

I've had Crystal in the back of my mind for this task. It is a nice general purpose language that included Markdown in the stdlib! But unfortunately Markdown was removed in 0.31.0. Other than that, I'm not sure any other languages include a well rounded Markdown implementation out of the box.

I'll likely end up building the site in docker with an alpine image down the road, so just a quick search in alpines repos to see what could be useful:

 docker run --rm -it alpine
/ # apk update
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/main]
v3.18.6-263-g77db018514d [https://dl-cdn.alpinelinux.org/alpine/v3.18/community]
OK: 20079 distinct packages available
/ # apk search markdown
discount-2.2.7c-r1
discount-dev-2.2.7c-r1
discount-libs-2.2.7c-r1
kdepim-addons-23.04.3-r0
markdown-1.0.1-r3
markdown-doc-1.0.1-r3
py3-docstring-to-markdown-0.12-r1
py3-docstring-to-markdown-pyc-0.12-r1
py3-html2markdown-0.1.7-r3
py3-html2markdown-pyc-0.1.7-r3
py3-markdown-3.4.3-r1
py3-markdown-it-py-2.2.0-r1
py3-markdown-it-py-pyc-2.2.0-r1
py3-markdown-pyc-3.4.3-r1

py3-markdown in alpine is the popular python-markdown. It's mature and available as a package in my home distro.

Incredible.

Let's build

First, lets read 1 post file and render some html.

I'll store posts in posts/ like posts/build_a_blog.md.

And we'll store the HTML output in the same directory: posts/build_a_blog.html.

import re
import logging

import markdown
destpath_re = re.compile(r'\.md$')

logging.basicConfig(encoding='utf-8', level=logging.INFO)

def render_post(fpath):
	destpath = destpath_re.sub('.html', fpath)
	logging.info("opening %s for parsing, dest %s", fpath, destpath)
	# from: https://python-markdown.github.io/reference/
	with open(fpath, "r", encoding="utf-8") as input_file:
		logging.info("reading %s", fpath)
		text = input_file.read()

	logging.info("parsing %s", fpath)
	out = markdown.markdown(text)

	with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
		logging.info("writing to %s", destpath)
		output_file.write(out)

if __name__ == '__main__':
	render_post('posts/build_a_blog.md')

And if we run it.

 python3 ./main.py
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
INFO:root:reading posts/build_a_blog.md
INFO:root:parsing posts/build_a_blog.md
INFO:root:writing to posts/build_a_blog.html

 head posts/build_a_blog.html
<h1>Build-a-blog</h1>
<p>I want to share my thought process for how one would go about building a static blog generator from scratch.</p>
<ul>
<li>Generate an index with recent list of posts.</li>
<li>Generate each individual post written in markdown -&gt; html<ul>
<li>Support some metadata in each post</li>
<li>A post title should have a slug</li>
</ul>
</li>
<li>Generate RSS</li>

Looking pretty good.

Now lets do this for all .md files in posts/

import glob
...

def render_posts():
	files = glob.glob('posts/*.md')
	logging.info('found post files %s', files)
	for fname in files:
		render_post(fname)

if __name__ == '__main__':
	render_posts()

And add another simple test post

 echo '# A new post' > ./posts/a_new_post.md
 python3 ./main.py
INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
INFO:root:reading posts/a_new_post.md
INFO:root:parsing posts/a_new_post.md
INFO:root:writing to posts/a_new_post.html
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
INFO:root:reading posts/build_a_blog.md
INFO:root:parsing posts/build_a_blog.md
INFO:root:writing to posts/build_a_blog.html
 head ./posts/a_new_post.html
<h1>A new post</h1>

Basically at this point, it's a blog generator!

But I want a few more features:

  • Want the posts listed in the index sorted by date.
  • Want each post to be templated in some html wrapper.

Post ordering and templating

python-markdown supports metadata embedded in posts: https://python-markdown.github.io/extensions/meta_data/

I thought I'd need to build something here, but turns out it's exactly what I need to assign a few extra attributes to a post.

I'll adjust our "spec" for posts such that each post must include the following metadata at the top of the file:

Title:  Build-a-blog
Date:   2024-06-17T14:46:36-04:00
---

And I'd like to insert the Title automatically as a <h1> tag in each post so I don't have to write it again in the markdown.

So first, lets test the metadata and adjust the test blog post.

 head -n4 ./posts/build_a_blog.md
Title:  Build-a-blog
Date:   2024-06-17T14:46:36-04:00
---

And pop open a python repl to see how this works.

>>> md = markdown.Markdown(extensions = ['meta']); f = open('posts/build_a_blog.md', 'r'); txt = f.read(); out = md.convert(txt); md.Meta
{'title': ['Build-a-blog'], 'date': ['2024-06-17T14:46:36-04:00']}

Looks pretty nice!

So first I will adjust the rendering function to prepend a line:

# {title}

This has to be done after the full document renders because the meta python-markdown extension extracts metadata and converts to html in one call.

def render_post(fpath):
    ...

	md = markdown.Markdown(extensions = ['meta'])

	logging.info("parsing %s", fpath)
	out = md.convert(text)

	title = md.Meta.get('title')[0]
	date = md.Meta.get('date')[0]

	out = markdown.markdown('# ' + title) + out

Finally, lets return a structure that will make other parts of the program aware of the filename that was rendered and the metadata (title, date)

def render_post(fpath):
    ...
    out = markdown.markdown('# ' + title) + out

	with open(destpath, "w", encoding="utf-8", errors="xmlcharrefreplace") as output_file:
		logging.info("writing to %s", destpath)
		output_file.write(out)

	return {
		'title': title,
		'date': date,
		'fpath': fpath,
		'destpath': destpath,
	}

Now we have what we need to generate a complete index.

Index templating

Lets start by defining what our index template file will be.

I'll choose index.html.tmpl and after rendering we will write to index.html.

So lets make a function that will take a list of our post structure above and render it in a <ul>.

from string import Template
...
def posts_list_html(posts):
	post_tpl = """<li>
		<a href="{href}">{title}</a>
		<time datetime="{date}">{disp_date}</time>
	</li>"""
	out = '<ul class="blog-posts-list">'
	for post in posts:
		disp_date = datetime.datetime.fromisoformat(post.get('date')).strftime('%Y-%m-%d')
		out += post_tpl.format(href=post.get('destpath'),
						 title=post.get('title'),
						 date=post.get('date'),
						 disp_date=disp_date)
	return out + '</ul>'

def render_index(posts):
    fname = 'index.html.tmpl'
    outname = 'index.html'

    with open(fname, 'r', encoding='utf-8') as inf:
        tmpl = Template(inf.read())

    posts_html = posts_html(posts)

    html = tmpl.substitute(posts=posts_html)

    with open(outname, 'w', encoding='utf-8') as outf:
        outf.write(html)

Make sure that index.html.tmpl contains a template variable for ${posts}

 grep -C2 '\${posts}' ./index.html.tmpl
  <div class="col-md-8 col-sm-12">
    <p>Welcome. Something will go here eventually.</p>
    ${posts}
  </div>
  <div class="col-md-4 col-sm-12">

And we now need to connect render_posts() which returns each post that was processed to render_index()

def render_posts():
	files = glob.glob('posts/*.md')
	logging.info('found post files %s', files)
	posts = []
	for fname in files:
		p = render_post(fname)
		posts.append(p)
		logging.info('rendered post: %s', p)

	return posts

if __name__ == '__main__':
	posts = render_posts()
	logging.info('rendered posts: %s', posts)
	render_index(posts)

And lets run it!

 python3 ./main.py
INFO:root:found post files ['posts/a_new_post.md', 'posts/build_a_blog.md']
INFO:root:opening posts/a_new_post.md for parsing, dest posts/a_new_post.html
INFO:root:reading posts/a_new_post.md
INFO:root:parsing posts/a_new_post.md
INFO:root:writing to posts/a_new_post.html
INFO:root:rendered post: {'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}
INFO:root:opening posts/build_a_blog.md for parsing, dest posts/build_a_blog.html
INFO:root:reading posts/build_a_blog.md
INFO:root:parsing posts/build_a_blog.md
INFO:root:writing to posts/build_a_blog.html
INFO:root:rendered post: {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}
INFO:root:rendered posts: [{'title': 'A new post', 'date': '2024-06-17T15:09:26-04:00', 'fpath': 'posts/a_new_post.md', 'destpath': 'posts/a_new_post.html'}, {'title': 'Build-a-blog', 'date': '2024-06-17T14:46:36-04:00', 'fpath': 'posts/build_a_blog.md', 'destpath': 'posts/build_a_blog.html'}]

And check how the output looks:

 grep -C4 'blog-posts-list' ./index.html
  </nav>
  <section class="container">
    <div class="row">
      <div class="col-md-8 col-sm-12">
        <ul class="blog-posts-list"><li>
		<a href="posts/a_new_post.html">A new post</a>
		<time datetime="2024-06-17T19:48:17-04:00">2024-06-17</time>
	</li><li>
		<a href="posts/build_a_blog.html">Build-a-blog</a>

Not bad!

Post templating

I think I want my blog to just maintain the overall layout from the index page and just render the post body where the main post list is.

So lets make that template rendering a bit more general.

I'll redefine the content area template variable to replace as ${content} too.

def render_template(tpl_fname, out_fname, content_html):
    with open(tpl_fname, 'r', encoding='utf-8') as inf:
        tmpl = Template(inf.read())

    html = tmpl.substitute(content=content_html)

    with open(out_fname, 'w', encoding='utf-8') as outf:
        outf.write(html)

def render_index(posts):
	content_html = posts_list_html(posts)
	render_template('index.html.tmpl', 'index.html', content_html)
        outf.write(out)

And now adjust where posts are written out.

def render_post(fpath):
    ...
	out = markdown.markdown('# ' + title) + out
    logging.info("writing to %s", destpath)
	render_template('index.html.tmpl', destpath, html)

After running you should see the each post/*.html file where each post file uses the full index template and includes each generated post HTML.

Post sorting

With everything wired up now just need to sort the posts lists by the date metadata.

Lets do a bit of python repl sort testing because I never remember datetime usage.

Lets generate a few nicely formatted ISO date strings for testing.

 date -d'2023-01-01' -Is
2023-01-01T00:00:00-05:00
 date -Is
2024-06-17T16:30:35-04:00

And make a test array

>>> posts = [{'date': '2023-01-01T00:00:00-05:00'}, {'date': '2024-06-17T16:30:35-04:00'}]

With our current script, the older post would be listed first. So lets try a sort.

# Double checking datetime parsing
>>> import datetime
>>> newer = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))
>>> older = datetime.datetime.fromisoformat('2024-06-17T16:30:35-04:00')
datetime.datetime(2024, 6, 17, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000)))

# Checking python sorting methods work as expected
>>> newer.__gt__(older)
True
>>> newer.__lt__(older)
False
>>> older.__gt__(newer)
False
>>> older.__lt__(newer)
True

# Doing the sort
>>> sorted(posts, key=lambda x: datetime.datetime.fromisoformat(x['date']), reverse=True)
[{'date': '2024-06-17T16:30:35-04:00'}, {'date': '2023-01-01T00:00:00-05:00'}]

Now lets apply this to our posts.

if __name__ == '__main__':
	posts = render_posts()
	logging.info('rendered posts: %s', posts)
	sorted_posts = sorted(posts,
					   key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
	render_index(sorted_posts)

<title /> Templating

The last bit of templating is to make each post <title> different.

I'll try something like <title>cfebs.com - ${title}</title>

So index.html.tmpl

<title>cfebs.com${more_title}</title>

And where using the title template more_title will default to empty string.

def render_index(posts):
	content_html = posts_list_html(posts)
	render_template('index.html.tmpl', 'index.html', {'content': content_html, 'more_title': ''})

But for a post:

def render_post(fpath):
    ...
	title = md.Meta.get('title')[0]
	date = md.Meta.get('date')[0]

	out = markdown.markdown('# ' + title) + out

	logging.info("writing to %s", destpath)
	render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})

At this point we have functioning blog post generation with templating.

RSS

This should be pretty easy as RSS is just reformatting our blog index list into different XML.

The render_template function will be useful here with a few more tweaks. So I'll make another template file (based off a reference https://drewdevault.com/blog/index.xml)

# Grab the reference
 curl -sL 'https://drewdevault.com/blog/index.xml' > index.xml.example

# After a bit of editing
 cat ./index.xml.tmpl
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>${site_title}</title>
		<link>${site_link}</link>
		<description>${description}</description>
		<language>en</language>
		<lastBuildDate>${last_build_date}</lastBuildDate>
		<atom:link href="${self_full_link}" rel="self" type="application/rss+xml" />
		${items}
	</channel>
</rss>

render_template now gets even more generic and passes a dict to Template.substitute()

def render_template(tpl_fname, out_fname, subs):
    with open(tpl_fname, 'r', encoding='utf-8') as inf:
        tmpl = Template(inf.read())

    out = tmpl.substitute(subs)

    with open(out_fname, 'w', encoding='utf-8') as outf:
        outf.write(out)

And make sure to adjust any usages of render_template that exist.

def render_index(posts):
	content_html = posts_list_html(posts)
	render_template('index.html.tmpl', 'index.html', {'content': content_html})

def render_post(fname):
    ...
	render_template('index.html.tmpl', destpath, {'content': out, 'more_title': ' - ' + title})

And now can hack away at RSS generation:

def render_rss_index(posts):
    subs = {
        'site_title': 'cfebs.com',
        'site_link': 'https://cfebs.com',
        'self_full_link': 'https://cfebs.com/index.xml',
        'description': 'Recent content from cfebs.com',
        'last_build_date': 'TODO',
        'items': 'TODO',
    }
    render_template('index.xml.tmpl', 'index.xml', subs)

After this initial test and a python3 ./main.py run, should see xml filled out.

 cat ./index.xml
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>cfebs.com</title>
		<link>https://cfebs.com</link>
		<description>Recent content from cfebs.com</description>
		<language>en</language>
		<lastBuildDate>TODO</lastBuildDate>
		<atom:link href="https://cfebs.com/index.xml" rel="self" type="application/rss+xml" />
		TODO
	</channel>
</rss>

Now lets finish up by generating each item entry and collecting them to be replaced in the template.

And this adds another md.convert() call, so why not add a util to reuse a single Markdown instance.

# add a module var and helper method to reuse Markdown instance
md = markdown.Markdown(extensions=['extra', 'meta', TocExtension(anchorlink=True)])
def convert(text):
    md.reset()
    return md.convert()

def render_post(fpath):
    ...
    out = convert(text)

	title = md.Meta.get('title')[0]
	date = md.Meta.get('date')[0]

	out = convert('# ' + title) + out
    ...

def rss_post_xml(post):
	tpl = """
	<item>
		<title>{title}</title>
		<link>{link}</link>
		<pubDate>{pubdate}</pubDate>
		<guid>{link}</guid>
		<description>{description}</description>
	</item>
	"""

	with open(post['fpath'], 'r') as inf:
		text = inf.read()

	converted = convert(text)

	link = "https://cfebs.com/" + post['destpath']
    pubdate = email.utils.format_datetime(datetime.datetime.fromisoformat(post['date']))
	subs = dict(title=post['title'], link=link,
			pubdate=pubdate,
			description=converted)

	for k,v in subs.items():
		subs[k] = html.escape(v)

	return tpl.format(**subs)

def render_rss_index(posts):
	items = ''
	for post in posts[:5]:
		items += rss_post_xml(post)

	subs = {
		'site_title': 'cfebs.com',
		'site_link': 'https://cfebs.com',
		'self_full_link': 'https://cfebs.com/index.xml',
		'description': 'Recent content from cfebs.com',
		'last_build_date': email.utils.format_datetime(datetime.datetime.now()),
	}
	for k,v in subs.items():
		subs[k] = html.escape(v)

	subs['items'] = items
	render_template('index.xml.tmpl', 'index.xml', subs)
  • Need to use html.escape anywhere there could be HTML tags in output.
  • posts[:5] should always take the most recent 5 posts to add to the RSS feed.

Wrapping up

Reached the end of the afternoon, so this is where I'll leave it.

It's not great software.

  • No tests, no docs
  • No validation of input or function arguments
  • Hard coding values like the domain
  • Using adhoc dicts for generic structures
  • Relies on system python version and packages.
  • Does not offer anything a tool like hugo does not already offer.

But, it's ~150 lines of python with 1 external dependency.

If python or python-markdown drastically changes, it'll probably take <10 minutes to debug.

And - it was fun to write and write about.

View the complete source for generating this blog:

Or the full repo tree: https://git.sr.ht/~cfebs/cfebs.srht.site/tree

EDIT

Few additional things that will be added will go here.

2024-06-19, adding draft state

It might be nice to work on a rough draft, generate it for previewing, track it in git, but skip including it in the posts index.

So I'll add a piece of post metadata called Draft and use filter() before the posts are sorted or applied to the index.html or RSS index.xml.

This is the result.

diff --git a/main.py b/main.py
index bfd9382..52ce57b 100644
--- a/main.py
+++ b/main.py
@@ -31,6 +31,9 @@ def render_post(fpath):

 	title = md.Meta.get('title')[0]
 	date = md.Meta.get('date')[0]
+	draft = False
+	if md.Meta.get('draft'):
+		draft = True

 	out = convert('# ' + title) + out

@@ -42,6 +45,7 @@ def render_post(fpath):
 		'date': date,
 		'fpath': fpath,
 		'destpath': destpath,
+		'draft': draft,
 	}

 def render_posts():
@@ -134,6 +138,7 @@ def render_rss_index(posts):
 def main():
 	posts = render_posts()
 	logging.info('rendered posts: %s', posts)
+	posts = filter(lambda p: not p['draft'], posts)
 	sorted_posts = sorted(posts,
 					   key=lambda p: datetime.datetime.fromisoformat(p['date']), reverse=True)
 	render_index(sorted_posts)

And testing it:

 cat ./posts/test_thing.md
Title: Test thing
Date: 2024-06-19T13:38:34-04:00
Draft: 1

 python main.py
...

html should get generated, but not in the index or xml

 grep 'Test thing' ./posts/test_thing.html
  <title>cfebs.com - Test thing</title>
        <h1 id="test-thing"><a class="toclink" href="#test-thing">Test thing</a></h1>

 grep 'Test thing' ./index.html ./index.xml | wc -l
0