post past :: james murty

I have just finished migrating my long-neglected blog from a self-hosted Wordpress installation on my own VPS to a static site hosted on the Amazon S3 storage service using its website hosting feature.

This post captures the main steps involved, in case anyone else is interested in doing the same thing.

Why Migrate from Wordpress?

My blog has run well enough on a self-hosted Wordpress set-up for years, first at a cheap site hosting company (where performance was terrible) then on my own Linode VPS server (where performance has been great).

However, I have never been very happy relying on the PHP-based Wordpress app due to the frequent security updates it requires and my gut-level distrust of PHP itself. I know it well enough not to trust it, especially when my blog can lie neglected for months at a time when I'm busy. It made me nervous knowing I could be one tardy security-update away from a hacked site and server.

Besides, this blog is small enough that any powerful blogging system is overkill for my needs.

Static Site Generation

Because my blog is simple, it is a perfect candidate to be served as a static site instead of a dynamically-generated one. Instead of Wordpress and PHP code running on my server to generate HTML pages when someone lands on a page, why not just serve HTML pages directly? This is about as simple as it can get.

But even a simple blog like this is far too complex to hand-code all the HTML and other files necessary to show do things like display a post, list all posts, provide RSS/Atom syndication feeds for blog reader software, provide sitemap XML files for Google, etc. etc. So I need a middle ground: a static site generator.

At this stage it's worth saying explicitly that using static site generators, while "simpler" from a web-server point of view, requires much more knowledge about how web sites work and how to host them. As the rest of this post will make clear, generating and hosting your own static site is an involved and deeply nerdy process best suited to people who are web programmers, or interested in becoming one.

Pelican

There are roughly fourteen bazillion site generator projects around, with the number growing constantly.

I chose Pelican for my blog because:

It seems relatively well-established and mature in the field.
It allows posts to be written in Markdown (or other human-friendly markup languages)
It's written in Python, a language I like and that I can easily hack on if I want to contribute fixes or improvements to the project.
It has plugins that do things like generating sitemap files and other niceties that would be tedious to build myself.

I installed a very recent code version (pre 3.3) of Pelican into a pre-prepared virtualenv Python environment. I used the latest code because it has improvements that are particularly useful for migrating existing Wordpress sites. I also installed some additional requirements:

# Pelican pre 3.3 code version, with slug and pages import improvements
pip install -e git+git@github.com:jmurty/pelican.git@675d6c81#egg=pelican-dev

# Markdown for migrating and authoring posts in this markup language
pip install Markdown==2.3.1

# BeautifulSoup is required to migrate Wordpress posts
pip install BeautifulSoup

After installation I ran the pelican-quickstart command to create initial configuration files and some helper scripts.

Be warned though: the generated scripts have their own copies of configuration settings you choose while running the quickstart, so if you subsequently change your Pelican configuration files and run these scripts your config changes will have no effect. I found this annoying enough that I ended up removing the helper scripts and use the explicit Pelican commands instead.

Migrate Data from Wordpress

Migrate Your Posts

Migrating my existing Wordpress blog posts was only somewhat difficult and fiddly, to be honest it was easier than I expected:

Manually install the additional Pandoc universal document converter tool per the Pelican Import documentation.
Export and download your Wordpress posts and comments as an XML file. I saved this file as site.wordpress.xml
Run the pelican-import command to convert the Wordpress posts into Markdown-formatted files written to the content directory:
```
pelican-import --wpfile --dir-page \
    --output content -markup markdown \
    site.wordpress.xml
```

At this stage you will hopefully have a collection of Markdown files that correspond very closely to your Wordpress posts. Take a look at some of the files and note the metadata included at the top; fields like Title and Slug capture important information about the original posts.

Unfortunately I found the conversion process was imperfect so I needed to look closely at many of the post files to check for quirks and then fix these issues across all posts.

I also decided to use Tag instead of Category groupings for my posts, which I should have done from the very beginning in Wordpress, but changing this was straight-forward with some global find-and-replace changes in the markdown files.

Migrate Your Comments

One drawback of static sites is that there is no way to do server-side processing of comments submitted by site visitors. At least, not without defeating the point of making your site static in the first place.

If you wish to allow comments on your static site you will need to use a javascript-based commenting system, such as the one provided by Disqus.

I am using Disqus on the now-static site, as you will see if you view or leave comments below. It was quite easy to set this up: I created an account on the service, added my site details, and set the DISQUS_SITENAME setting in Pelican's publishconf.py configuration file.

Because I also wanted to keep the comments from my existing Wordpress blog site I also exported these comments by:

installing the Disqus Comment System plugin on my Wordpress site.
configuring my new Disqus site account in the plugin
exporting the comments from Wordpress to Disqus via the Export Comments button.
monitoring progress of the export on the Disqus dashboard.

The export process took a long time (days) and a few of my blog's comments didn't survive the process, which isn't a big deal for my site but might be a problem for others. Hopefully you will have better luck with this.

Generate the Static Site HTML

Now that your content is in place you can generate a static site to check how it all looks. Here is the command to convert the markdown files in content/ to static files to the html/ directory:

# Add --debug flag to see exactly what is happening
pelican content/ -o html -s pelicanconf.py --delete-output-directory

You can now look directly at the generated files in the html/ directory, or run Pelican in server mode so you can view your blog in a web browser at http://localhost:8000/ using the develop_server.sh helper script generated by the "quickstart" process, or with the following explicit commands:

# Run Python's built-in web server on html/ as a background job
cd html/
python -m SimpleHTTPServer &

Customise the Generated Site Layout

Once you can view the generated site in a browser you can customise a number of things in your pelicanconf.py configuration file to make the site work exactly the way you want.

In my case I wanted the new static site to replace the original Wordpress one without breaking all the links, because Cool URIs don't change. To do this I configured Pelican very carefully to generate URL paths matching the datestamp + slug format I used in Wordpress, for example:

# Post/Article URL links have clean paths with date stamp + slug ...
ARTICLE_URL = '{date:%Y}/{date:%m}/{date:%d}/{slug}/'
# ... while the file is index.html to be auto-served from the dir location
ARTICLE_SAVE_AS = '{date:%Y}/{date:%m}/{date:%d}/{slug}/index.html'

I also customised where RSS/Atom feeds are saved, whether pages are displayed, and other things. Check out the Pelican documentation to see what you can do.

Theme

I created my own theme for the blog. You can tell because it's very simple, and very ugly (I'm no designer).

It was pretty easy to do this by following the Pelican theming documentation and stealing ideas and code snippets from the example themes available in Pelican Themes Github repository.

One particularly cool thing I will mention is how to use the excellent Solarized colour themes for code snippets, so at least that part of my blog isn't ugly. Generate the necessary CSS files like so:

# Install Pygments and solarized style
pip install Pygments pygments-style-solarized

# Generate CSS files of light and dark solarized colours
pygmentize -S solarizeddark -f html > solarizeddark.css
pygmentize -S solarizedlight -f html > solarizedlight.css

You can then add these CSS files to your template to get solarized code highlighting in your posts.

Note: I found I had to set the overall background colour to get things to look right, here's what I did:

/* Add this line to the top of solarizedlight.css */
pre { background: #fdf6e3; }

/* Add this line to the top of solarizeddark.css */
pre { background: #002b36 }

Pelican Plugins

The final Pelican-related tweaks I made were to add some existing plugins to do useful work like generate a sitemap.xml file and make extra information available to help with site navigation, such as lists of next/previous and related posts.

To use the plugins I added the pelican-plugins codebase to my project directory:

git submodule add https://github.com/getpelican/pelican-plugins
git submodule update --init

Then configured Pelican to find and use the relevant plugins:

PLUGIN_PATH = 'pelican-plugins'
PLUGINS = ['sitemap', 'neighbors', 'related_posts']

SITEMAP = {
    'format': 'xml',
    'changefreqs': {
            'pages': 'daily',
    }
}

Site Hosting with Amazon S3 in Website Hosting Mode

With a static blog site instead of a dynamic one, it was no longer necessary to host it on my VPS server. By moving the site entirely to a static-file-serving platform I could free up my VPS to the testing-ground I had originally intended.

With it's relatively recently-added support for website hosting, the Amazon S3 storage service made an attractive choice. It's pretty simple, extremely reliable, fairly priced, and I'm very familiar with it.

The process for setting this up was made a bit complex because I had some extra requirements:

My blog should be accessible at both the james.murty.co root domain and the www.james.murty.co subdomain.
The old RSS feed endpoint at http://www.james.murty.co/feed/ shouldn't change, so that subscribers continue receiving my blog posts (like this one).

To serve the site from S3 I needed to generate the site in a publishable form, set up my S3 account to store and serve it, and tweak my domain's DNS management to hook everything up correctly.

Generate Site for Publishing

By default the Pelican quickstart process produces two configuration files, pelicanconf.py and publishconf.py. The former stores most of your settings and is intended for use when developing or testing your site, while the latter contains settings useful only for the published site.

In my case the publishconf.py file has two extra settings:

Relative URLs are disabled, to ensure in-post links work properly when read in an RSS feed reader.
Disqus is configured to handle comments.

Before you publish your site, make sure you are using the correct configuration file in the -s switch to the pelican command. For example:

pelican content/ -o html -s publishconf.py --debug

Set Up S3 Buckets

Here are the steps I took to set up S3 hosting (using the S3 console website):

Create S3 bucket names corresponding to my site's domain names: james.murty.co and www.james.murty.co
Upload the generated site files in the html/ directory to the bucket james.murty.co. I used my Synchronize application, but any S3-compatible file copying program will do.
Visit the bucket's exact Endpoint URL as shown in the Static Website Hosting properties area to make sure everything looks and works properly.

To configure the www.james.murty.co root domain bucket to redirect requests to the authoritative james.murty.co bucket:

Edit the bucket Properties and open the Static Website Hosting section.
Select Redirect all requests to another host name
Enter the root domain: james.murty.co

Because my original blog used /feed/ as the RSS feed endpoint, I also needed to make this URL path point to the new RSS feed location /feeds/rss.xml. This was easy enough to do with a custom routing rule specified with the following XML snippet in the Edit Redirection Rules section of the Enable Web Hosting properties:

<RoutingRules>
    <RoutingRule>
        <Condition>
            <KeyPrefixEquals>feed/</KeyPrefixEquals>
        </Condition>
        <Redirect>
            <ReplaceKeyWith>feeds/rss.xml</ReplaceKeyWith>
        </Redirect>
    </RoutingRule>
</RoutingRules>

Set Up DNS using Amazon Route 53

To serve websites from S3 using a root domain, such as a domain without a leading subdomain prefix like www or blog, you must use Amazon Route 53 as a DNS server for the site. Unfortunately this adds to the complexity and cost of S3-hosted websites, but I decided it is worth the hassle.

The necessary steps are covered in detail in Amazon's website hosting instructions, but in brief here's what I did to set up the domain-to-bucket mappings:

Create a hosted zone for your domain in Route 53
Note the nameservers assigned to that hosted zone (called Delegation Set in Amazon's console)
Create a record set for the hosted zone
Add an A record alias (Type: A, Alias: Yes) to the record set to map my root domain james.murty.co to the Amazon URL for the authoritative S3 bucket, in my case james.murty.co.s3-website-us-east-1.amazonaws.com.
The target bucket should be selectable in the Alias Target drop down list.
Add a CNAME record to the record set to map the www.james.murty.co subdomain to the root domain james.murty.co.

Because I was migrating existing domains, at this point I also copied some additional DNS settings from my original DNS provided over to Route 53, such as MX mail records.

Once you are happy that the Route 53 settings are correct, set up (or switch over) the nameservers at your domain name registrar to point to the Delegation Set endpoints you noted above.

Before long, your domain names should resolve to the appropriate S3 buckets and your static site should be available.

The newest version of JetS3t has been released and is now available for download: JetS3t 0.9.0

This release has been a long time coming, sorry about that. I had intended to get a release out late last year but personal factors left me short of time: relocating back to Australia from the U.S. and helping out with our newly-arrived baby boy have kept me pretty busy.

Still, the 0.9.0 version is now official. Here are some of the major new features and improvements.

HttpComponents Upgrade

A major change in the new version is JetS3t's use of the newer 4.x generation of the key HttpClient library (now more accurately called HttpComponents). The older HttpClient 3.x had been end-of-lifed and while it still worked fine, relying on the obsolete version was not a good long-term option.

I need to thank two contributors in particular for doing a lot of this work. Cheers to Gilles Gaillard and David Kocher for their invaluable work; the upgrade wouldn't have happened without them.

Note that since this upgrade involved updates to the core JetS3t HTTP code layer, there is a risk of subtle bugs in the HTTP handling with this release. I think the risk is small and some people have been using the pre-release 0.9.0 code successfully for a while now, but when you update to 0.9.0 it's worth doing a little more testing than you might normally.

Here are some of the new service-specific features.

Amazon S3

Support for multiple object deletes in a single request
Explicit support for new S3 locations: Oregon (us-west-2), South America (sa-east-1), GovCloud US West (s3-us-gov-west-1), GovCloud US West FIPS 140-2 (s3-fips-us-gov-west-1)
Support for server-side encryption, with per-object setting of algorithm and default algorithm configuration with the new s3service.server-side-encryption property
Support for Multipart Upload Part - Copy operation, to add data from existing S3 objects to multipart uploads.
Support for signing S3 requests with response-altering request parameters like response-content-type and response-content-disposition

Google Storage

Support for OAuth2 authentication mechanism, with automatic access token refresh.

Conclusion

Please grab the latest version, try it out and let me know how it goes.

Visit the JetS3t web site to download the latest packaged release, view the code samples or read the API Javadoc.

For a more complete list of changes see the Release History or Release Notes documents.

Or go to the BitBucket developer site to access the latest code, report issues in the bug tracker, and contribute to the project.

P.S. The latest release is on its way to the official Maven2 repository and should be available within a day or so.