post past :: james murty

« Page 2 of 18 »

Python code to convert UTF-8 to Latin-1

After dealing with UTF-8 to latin1 encoding issues repeatedly over the years I finally put the time into crafting a somewhat complete conversion script in Python that handles things like "smart" quotes and other commonly-used symbols.

This probably isn't the best place to put this code but hopefully it will help someone, most likely me at some time in the future. Unless perhaps sanity prevails and everyone starts using UTF-8 everywhere...

import re

def encode_utf8_to_iso88591(utf8_text):
    """
    Encode and return the given UTF-8 text as ISO-8859-1 (latin1) with
    unsupported characters replaced by '?', except for common special
    characters like smart quotes and symbols that we handle as well as we
    can.
    For example, the copyright symbol => '(c)' etc.

    If the given value is not a string it is returned unchanged.

    References:

    en.wikipedia.org/wiki/Quotation_mark_glyphs#Quotation_marks_in_Unicode
    en.wikipedia.org/wiki/Copyright_symbol
    en.wikipedia.org/wiki/Registered_trademark_symbol
    en.wikipedia.org/wiki/Sound_recording_copyright_symbol
    en.wikipedia.org/wiki/Service_mark_symbol
    en.wikipedia.org/wiki/Trademark_symbol
    """
    if not isinstance(utf8_text, basestring):
        return utf8_text
    # Replace "smart" and other single-quote like things
    utf8_text = re.sub(
        u'[\u02bc\u2018\u2019\u201a\u201b\u2039\u203a\u300c\u300d]',
        "'", utf8_text)
    # Replace "smart" and other double-quote like things
    utf8_text = re.sub(
        u'[\u00ab\u00bb\u201c\u201d\u201e\u201f\u300e\u300f]',
        '"', utf8_text)
    # Replace copyright symbol
    utf8_text = re.sub(u'[\u00a9\u24b8\u24d2]', '(c)', utf8_text)
    # Replace registered trademark symbol
    utf8_text = re.sub(u'[\u00ae\u24c7]', '(r)', utf8_text)
    # Replace sound recording copyright symbol
    utf8_text = re.sub(u'[\u2117\u24c5\u24df]', '(p)', utf8_text)
    # Replace service mark symbol
    utf8_text = re.sub(u'[\u2120]', '(sm)', utf8_text)
    # Replace trademark symbol
    utf8_text = re.sub(u'[\u2122]', '(tm)', utf8_text)
    # Replace/clobber any remaining UTF-8 characters that aren't in ISO-8859-1
    return utf8_text.encode('ISO-8859-1', 'replace')

Be sure to only feed this method UTF-8 encoded text.

Tags: Coding Python

There are comments.

Vimdiff for three-way merges in Mercurial

2011-05-06
I've been using vim as my sole code editor for a couple of years now at work. I find that the more I use it and the more I learn (there will always be more to learn about vim) the happier I am with this fantastic tool.

After working through some hairy code merges recently I realised I needed a better approach than relying on inline diffs, where merge conflicts are represented in a single file like so:
```
<<<<<<< incoming
Someone else's code
=======
My code
>>>>>>> outgoing
```
Inline diffs are great for resolving relatively simple conflicts but can quickly become confusing if conflicts span many lines or there are significant differences between files.

So I configured Mercurial to open vimdiff upon merge conflicts, but the default three-paned vertical-split view wasn't quite what I wanted. It didn't include the base version of the conflicted file, and the default window layout made it hard to see exactly what was going on.

A little research turned up a blog post showing how to better configure vimdiff when using git. We use Mercurial at work so I adapted this hint to work with Mercurial's MergeProgram configuration:
```
# Three-way merge with vimdiff (shows result in bottom window)
# Based on http://mercurial.selenic.com/wiki/MergingWithVim
# and http://www.toofishes.net/blog/three-way-merging-git-using-vim/

[ui]
merge = vimdiff

[merge-tools]
vimdiff.executable = vim
vimdiff.args = -d -c "wincmd J" "$output" "$local" "$other" "$base"
```
This will show the merged file in a large window at the bottom with the three pre-merge files of interest -- local changes, incoming/other changes, the base file -- in a three-pane vertical split at the top. With this set-up and some practice using the vimdiff commands, complex conflicting merges are much easier to deal with.

If you use vim or want to, be sure to check out the excellent Vim Casts video podcasts to learn (or re-learn) how to get the most out of it. Some recent episodes discuss vimdiff in the context of a git workflow but are still full of useful pointers for those not using git.
Tags: Coding Tips

There are comments.
JetS3t 0.8.1 in the wild

2011-04-10
The newest version of JetS3t has been released and is now roaming free. Meet 0.8.1.

This release has been a long time coming, mainly due to my reluctance to finish the documentation. But it's finally here and comes with some great new features.

Goodies
- Support for Amazon S3's multipart uploads, both at the API level and with a MultipartUtils tool that makes it very easy to upload files in multiple parts.
- Support for Amazon S3's website configuration, which makes an S3 bucket act more like a traditional website. I'm using this new feature to great effect on JetS3t's new home domain www.jets3t.org.
  The new domain is served from S3 like the old jets3t.s3.amazonaws.com version, but it works much better if you visit places like the root directory (versus this) or a missing page (versus that).
  Now the new domain just needs some Google-juice, so please update your links to point to www.jets3t.org.
- Massive improvements to the Synchronize application to reduce its memory footprint when syncing large directory hierarchies and improve its efficiency when comparing local and remote files.
  Synchronize also now supports multipart uploads, so you can back up files larger than 5GB and improve reliability by uploading large files in smaller pieces (see the upload.max-part-size configuration setting in synchronize.properties).
- Support for custom (non-S3) distribution origins in the CloudFront API. Note that these service changes are not backwards-compatible
- A number of bug fixes and other tweaks
See the full list of changes in the Release History or Release Notes documents.

Yes Please

Visit the JetS3t web site to download the latest packaged release, view the latest code samples or read the API Javadoc.

Or go to the BitBucket developer site to access the latest code, report issues in the bug tracker, and contribute to the project.

P.S. The latest release is on its way to the official Maven2 repository and should be available within a day or so.
Tags: AWS Cloud Computing Java JetS3t

There are comments.
JetS3t support for S3 Website Hosting

2011-02-17
I have just released code for JetS3t that adds API-level support for Amazon S3's new Website Hosting feature.

With a Website Hosting configuration applied to an S3 bucket, the bucket can serve static content but will also act in a somewhat dynamic way to serve index and error documents if someone visits URL paths that don't match a real file.

This makes it much more feasible to serve static website content from S3 without having to worry about users receiving strange XML error messages if they venture off the beaten track or try to access partial URL paths. In particular, it allows you serve an index.html file from the root of a bucket, just like a real web server.

To find out more read these:
To try out the feature in JetS3t, grab the latest development code and read the example test code to see how it works.
Tags: JetS3t

There are comments.
Work-around for Mac OS X python package install error -- "lipo: can't figure out the architecture type"

2011-01-29
This is just some quick documentation of a couple of work-arounds I needed to install the a python package on Mac OS X 10.6.6 (Snow Leopard). This solution could work in a number of cases, not just for this one package, so I thought I should write it down.

I was trying to install the nose-cov unit test coverage package, which depends on coverage, using the standard pip/easy_install commands without much luck. The compilation phase spat out a stream of error lines followed by the final message: failed with error code 1.

Mac OS X 10.4 Support

The first issue was that the coverage package required the optional "Mac OS X 10.4 Support" component of XCode, as hinted at by this message I found in the middle of the error log:
```
Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk
Please check your Xcode installation
```
I did not have the 10.4 components installed on my 10.6 machine despite having the rest of the (presumably 10.5+) components. This is easy to fix by re-running the XCode installation from your OS X disk and choosing the extra option.

GCC Version 4.0 FTW

After installing the 10.4 pieces the installation still failed, this time with the following message near the end of the error log:
```
lipo: can't figure out the architecture type of: /some/file

error: command 'gcc' failed with exit status 1
```
[Edit: 2011-02-25]

A much cleaner way to make your system use GCC version 4.0 instead of a later version is to use the CC environment variable. Prefix your pip/easy_install command with CC=/path/to/desired/gcc-version like so:
```
CC=/usr/bin/gcc-4.0 pip install SOMETHING
```
Try the CC environment variable approach above before you attempt the hack below.

Google lead me to this post which provided the solution, though it feels like a nasty hack.

My version of OS X (10.6.6) includes two versions of the gcc compiler executable in /usr/bin:
```
$ ls -al /usr/bin | grep gcc
lrwxr-xr-x     1 root   wheel           7 29 Jan 10:01 cc -> gcc-4.2
lrwxr-xr-x     1 root   wheel           7 29 Jan 10:01 gcc -> gcc-4.2
-rwxr-xr-x     1 root   wheel       97392 18 May  2009 gcc-4.0
-rwxr-xr-x     1 root   wheel      166128 18 May  2009 gcc-4.2
. . .
```
The gcc symlink points to gcc-4.2 but only the gcc-4.0 version successfully compiles the package. I temporarily moved the original gcc symlink out of the way and created a new one pointing to the 4.0 version, like so:
```
$ cd /usr/bin
$ sudo mv gcc gcc_orig
$ sudo ln -s gcc-4.0 gcc
```
I was then able to install the nose-cov package using pip and it ran just fine. Great!

Finally I replaced the original symlink, because forgetting to do so would almost certainly cause something else to break sooner or later:
```
$ cd /usr/bin
$ sudo rm gcc
$ sudo mv gcc_orig gcc
```
So in the end my /usr/bin directory looks exactly the same as it used to and the package is installed. Time to go do some real work.
Tags: Python Tips

There are comments.

« Page 2 of 18 »

Goodies

Yes Please

Mac OS X 10.4 Support

GCC Version 4.0 FTW

[Edit: 2011-02-25]