Sunday, 3 April 2011

Shell scripts to handle filenames with spaces

Posix/Unix/Linux was not designed to handle filenames with spaces in them. However, Linux and Windows filesystems allow them and also many other "funny" characters. This has been brewing as a topic in Linux Journal recently, and Dave Taylor has just written an article on it in the February, 2011 issue. He spots files with spaces in them by the shell pattern "*\ *" and then mucks around changing spaces into other things. It's good stuff, but overkill for some cases.

For a long time now, I've been writing scripts that handle filenames both with and without spaces. You've got to know your shell and how Posix commands work! Commonly, I want to list files in a directory and do things to them whether or not they have spaces in them.

Shell patterns such as "*" break strings into "words" based on whitespace (spaces, tabs, newlines). This stuffs up a filename if its has spaces in it, since the name then gets split into separate words. But commands such as "ls" (when not directed to a terminal) list each filename on a separate line. So if you have something that distinguishes between spaces/tabs and newlines then you can get complete filenames with or without spaces.

The shell command "read" reads a line and breaks it into words. so
 read a b c
with input
 a line of text
will assign
 c="of text"
  read line
will read all of the line into the variable. It stops reading on end-of-line so it has the distinction type I often need.

But how to use it? Well, the shell while loop is just a simple command, and as such can have its I/O redirected.  So I do this:
 ls |
 while read filename
    #process filename e.g.
    cp "$filename" ~/backups
This works for all files, with or without spaces. Just don't forget the quotes while processing the file! 

Of course, this doesn't work for all uses: note the find and xargs combination that Dave also commented on:
 find . -print0 | xargs -0 ...

Saturday, 2 April 2011

HTML 5 has a serious flaw

HTML 5 is long overdue, after the WWW Consortium's failed attempt at convincing us to use XHTML. It has many useful features, but one glaring fault: it has discarded version control. I've been writing and designing distributed systems for over twenty years, and one thing has become very clear: if you don't include version numbers in your protocol then you are asking for trouble.

The document type has been simplified. Before it used to have horrible things like
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 

But now it has been simplified overmuch to just

<!DOCTYPE html>
There are some people who like this (e.g. John Resig). But I don't. HTML will continue to evolve - there will be new tags and attributes, and the existing behaviour will be clarified or changed. But without a version number, how will a browser (or any user agent) be able to work out which version it is dealing with? And how can a content generator signal which version it is creating? Already there is considerable confusion about which bits of HTML 5 are supported by different browsers.

The simple answer is perhaps that this allows vendors free reign to do what they want - and we saw what a mess that caused before HTML 4 put a standard in the ground. There is still time for the WWW Consortium to fix at least this one error before it is too late.