Haui's bytes

news, diary, journal, whatever

Encoding Hell

Some years ago, when I first started using Linux, my system’s locale was set to ISO8859-15. Over the years I switched to UTF-8 of course. Though I now tend to use proper filenames for all my files, I come across witnesses of the old days, when I was littering my filenames with crappy [1], or even crappier[2] characters, once in a while. In my defence I have to say, that lots of these files carry names I didn’t choose by myself, because they were auto-generated by CD rippers or other software. Some files even date back to the time when I was exclusively using Windows and didn’t care about filenames or encodings at all.
Using the command from my posting about rename can usually fix all these filenames, but this might not always be what you want - a folder named glückliche_kühe is renamed to gl_ckliche_k_he - not a perfect solution. What you might really want is to convert the filename from one encoding to another, and good for you, somebody already did all the work and created a nifty little program called convmv, which supports 124 different encodings. The syntax is quiet easy:

convmv -f iso8859-15 -t utf-8 *

This would show which filenames in the current directory would have been converted from ISO8859-15 to UTF-8 if you’d explicitly added the - -notest option to the command-line.

Thats the easy way, but let’s assume you want to work with the glückliche_kühe folder without re-encoding the filename. Be aware of the fact, that some graphical file mangers may not handle filenames with wrong encodings correctly. On my system, krusader couldn’t open the ISO8859-15 encoded test folder, while gentoo (yes, this is indeed a file manger) only displayed a warning. Additionally, there are situations, where no graphical environment is available at all.

So, the far more interesting question is how to work with these files in a shell environment. The naive approach cd glückliche_kühe fails because the ISO8859-15 ü is different from the UTF-8 ü - our UTF-8 environment will correctly respond that there’s no such folder. A simple ls will show a question mark for every crappier character in the filename and that’s not exactly useful either, since we can’t uniquely identify the names this way. How would you change into glückliche_kühe if there’s also a folder called gl_ckliche_k_he? Typing cd gl?ckliche_k?he is ambiguous since the question mark is treated as a special character by the Bash and expands to match any character. Depending on the situation, this might or might not work as the Bash returns a list of all matching filenames for your input sequence gl?ckliche_k?he.
One solution is to run ls with the -b option - this way, we instruct ls to print unprintable characters as octal escapes:

user@localhost /tmp/test $ ls -b
gl\374ckliche_k\374he

This gives us something to work with. echo can interpret these escape sequences and Bash’s command substitution offers a way to use echo‘s output as a value.

user@localhost /tmp/test $ cd "$(echo -e "gl\0374ckliche_k\0374he")"
user@localhost /tmp/test/glückliche_kühe $ pwd
/tmp/test/glückliche_kühe

There are three things you should note here. First of all, in order to mark the escaped sequences as octal numbers, you need to add a leading zero in the way I did in this example. Secondly, the -e parameter is required to tell echo to interpret escaped sequences rather than printing the literal characters. The last thing is not exactly related to the encoding problem, but always worth mentioning: the quotes are supposed to be there for a reason!

So, now the encoding hell shouldn’t look so scary anymore - at least not with respect to filenames. ;)

Oh, and by the way, if you just want to check if you got any wrongly encoded filenames, this one-liner could help:

find . -print0 | xargs -0 ls -db  | egrep "\\\[0-9\]{3}"


[1] every character c, that is not in [a-zA-Z0-9._-]+
[2] every character c, where utf8(c) != iso8859-15(c)

Rebuilding Debian packages

Most of the software installed via APT, Debian’s package management system, runs perfectly fine without any reason to complain. In some rare cases however, you might find yourself unsatisfied with a package and have the itch to recompile it. For me, Debian’s package for the Vim text editor is one of these cases - the package available in the repositories was compiled without support for the Perl interface. Of course, one could just visit vim.org, download the latest sources for Vim, check the build requirements and install the missing libraries manually, call ./configure with the correct parameters, compile the program and finally install it. Apart from being a quiet cumbersome procedure, APT would not include this version of Vim into its database. So, there has to be a better way to do this, and indeed, there is one.

First of all, two packages and their dependencies are required for the next steps - build-essential and devscripts. They should be available in the repositories and can be installed as usual:

su root -c "apt-get install build-essential devscripts"

Once this is done, we’ll change to our developer directory and download the sources for Vim as well as the build dependencies.

mkdir -p ~/devel
cd ~/devel
apt-get source vim
apt-get build-dep vim

When this is finished, a new directory ~/devel/vim-*VERSION*/ should contain the sources for Vim as well as Debian specific patches/configurations. Now, one could do all kinds of changes tho Vim’s source code, but we just want to to modify a configuration parameter. This is done by editing the debin/rules file, which contains the default configure flags for the package. The flags defined here are passed to the configure script during the build process. The Perl interface can be enabled by swapping a parameter from –disable-perlinterp to –enable-perlinterp. Thereafter, you just need to invoke the following command and wait until the compilation process is finished:

debuild -us -uc

If no errors occurred, you’ll find several *.deb files inside your ~/devel directory. To install Vim, just pick vim-*VERSION*_*ARCH*.deb and install it via dpkg, e.g. on my box:

su root -c "dpkg -i vim_7.3.547-4_amd64.deb"

vim –version should now show +perl instead of -perl, and :perldo is finally available. ;)

Delete all files except one

A couple of days I was asked if knew an easy way to delete all but one files in a directory. If you didn’t already guess it from this blog entry’s title, there is a simple way - or to be more precise - there are several ways. The first one is quiet straightforward and uses the find command:

find . -not -name do_not_delete_me -delete

This works recursively and also preserves files named do_not_delete_me contained in sub-folders of the current directory:

user@host /tmp/test $ ls -R
.:
a  b  c  do_not_delete_me  foo

./a:
foo

./b:
bar  do_not_delete_me

./c:
baz
user@host /tmp/test $ find . -not -name do_not_delete_me -delete
find: cannot delete `./b': Directory not empty
user@host /tmp/test $ ls -R
.:
b  do_not_delete_me

./b:
do_not_delete_me

As you can see, find tries to delete the folder b but fails because the folder is not empty. If you don’t care for files in sub-directories, it gets a bit more complicated with find:

find . -mindepth 1 -maxdepth 1 -not -name do_not_delete_me -exec rm -rf -- {} +

The -mindepth/-maxdepth parameters tell find to ignore sub-directories, because we’re not interested in their contents. This should also save some execution time - especially if the directory hierarchy is really deep.

While this works well, Bash’s pattern matching offers an easier solution for this:

rm -rf !(do_not_delete_me)

As the manpage explains, the text enclosed by brackets is considered to be a pattern list, i.e. constructs like !(*.jpg|*.png) are perfectly valid. If you don’t care for files in sub-directories, this might be the preferred way - it’s shorter and maybe even faster than the solutions using find.

No matter which solution you choose, refrain from error-prone constructs like rm -rf `ls | grep -v do_not_delete_me`.

New design

A few days ago, I decided to give my blog a new look. Consequently, I wanted to upgrade nanoblogger from version 3.3 to at least 3.4. or even 3.5. Thinking about the upgrade procedure however, gave me a headache - there are just too many small fixes and workarounds I tinkered into nanoblogger’s source code. After I spent some time searching for alternatives, I eventually ended up with Tinkerer, a Python-based static blog compiler. Apart from being actively developed, Tinkerer has two advantages over nanoblogger I especially want to emphasize.

First of all, Tinkerer is fast. Completely rebuilding my blog takes just about 2 seconds - nanoblogger needs over 3 minutes for the same task. Secondly, Tinkerer offers source code highlighting for many programming and markup languages by using Pygments.

Additionally, transferring the old blog postings from nanoblogger was easier than expected. I wrote a small shell script that converts the *.txt files inside nanoblogger’s data directory into a format known by Tinkerer. Of course, this just automates some steps of the process and can’t spare you the work of manually fixing errors and warnings Tinkerer might report. Still, it saved me a lot of work.

On the downside, I already stumbled on some bugs - if you plan to use Tinkerer for your own blog, this might save you a lot of trouble, if you’re repeatedly getting unexplainable UnicodeErrors.

Hardly known

Most Linux and some Ubuntu users know a certain set of command-line programs for interactive shell usage. Most importantly, there are the standard tools from the GNU core utilities which cover many aspects of everyday’s work. You’ll find these tools preinstalled on almost every Linux-based desktop or server system (embedded systems often tend to use all-in-one tools like BusyBox as a replacement for the core utilities). Additionally, some of the commonly used tools like grep or strings are found in separate packages, which are also available on most systems.

Read more...