poniedziałek, 24 października 2011

How to import content from Blosxom to google Blogger

I have decided to give a try to Google Blogger service. I am an old dinosaur used to command line and tired with mouse and menus but as there is GoogleCL I am not scare. The problem is with my old posts---there is no way to post backdated blog entries with GoogleCL. A problem...

Fortunately there is export/import features on Blogger: one can backup blog content and/or upload it back to Google. In particular to import posts (and comments) into a blog, one have to click Import Blog from the blog's Settings. Next one have to select appropriate file and fill out the word verification beneath. The Blogger data format is Atom. So, to successfully import my old Blosxom entries I have to convert them to Atom.

I have made a few test entries and export them to check how the data looks like. Pretty wired but most of the content is irrelevant as it is concerned with formatting (css styles and such stuff is included). Also as I had comments disabled at my previous blog the problem is further simplified.

I have consulted Atom schema and tried with the following:


<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
xmlns:georss="http://www.georss.org/georss"
xmlns:gd="http://schemas.google.com/g/2005"
xmlns:thr="http://purl.org/syndication/thread/1.0">';

<id>tag:blogger.com,1999:blog-1928418645181504144.archive</id>
<updated>2011-10-22T12:34:14.746-07:00</updated>
<title type='text'>pinkaccordions.blogspot.com</title>
<generator version='7.00' uri='http://www.blogger.com'>Blogger</generator>

The meaning of the elements should be obvious. The last element (generator) is required by Blogger import facility, otherwise error message is returned.

According to the schema inside feed element there is zero or more entry elements:


<entry>
<id>ID</id>
<published>DATE</published>
<updated>DATE</updated>
<category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/blogger/2008/kind#post"/>

<!-- tags, each as value of attribute `term' of element category -->
<category scheme='http://www.blogger.com/atom/ns#' term='tag1'/>
<category scheme='http://www.blogger.com/atom/ns#' term='tag2'/>
<title type='text'>title</title>
<content type='html'>post content ... </content>
</entry>

There is a final </feed> to guarantee that XML file is well formatted.

I have assumed the only important feature of id element is that it's content should be unique. I have decided to use MD5sum of the post content as IDs to guarantee that.

Finally, my old Blosxom-compatible entries looks similar to the example below:


<?xml version='1.0' encoding='iso-8859-2' ?>
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<title>Przed finałami RWC 2011</title>
<!-- Tags: rwc2011,rugby,francja,polsat-->
</head><body><!-- ##Published : 2011-10-20T07:20:26CEST ##-->


<p>W RWC 2011 zostały już tylko dwa mecze: jutro (piątek), o trzecie miejsce oraz

So it was extremly easy to extract title, tags and publication date and format Atom-compliant XML file with the following Perl script:


#!/usr/bin/perl
# Variant of Blosxom to Blogger conversion
# 2011/10 t.przechlewski
#
use Digest::MD5 qw(md5_hex);

print '<?xml version="1.0" encoding="UTF-8"?>
<!-- id, title/updated jest wymagane w elementach feed/entry reszta opcjonalna -->
<!-- wyglada na minimalne oznakowanie -->
<feed xmlns="http://www.w3.org/2005/Atom"
xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
xmlns:georss="http://www.georss.org/georss"
xmlns:gd="http://schemas.google.com/g/2005"
xmlns:thr="http://purl.org/syndication/thread/1.0">';

print "<id>tag:blogger.com,1999:blog-1928418645181504144.archive</id>";
print "<updated>2011-10-22T12:34:14.746-07:00</updated>";
print "<title type='text'>pinkaccordions.blogspot.com</title>";
print "<generator version='7.00' uri='http://www.blogger.com'>Blogger</generator>\n";

foreach $post_file (@ARGV) {

my $post_title = $post_content = $md5sum = $published = '';
my @post_kws = ();
my $body = $in_pre = 0;
my $rel_URLs = 0;

print STDERR "\n$post_file opened!\n";
open POST, "$post_file" || die "*** cannot open $post_file ***\n";

while (<POST>) {
chomp();

if (/<title>(.+)<\/title>/) {$post_title = $1 ; next ; }
if (/<!--[ \t]*Tags:[ \t]*(.+)[ \t]*-->/) {$tags = $1 ; next ; }

if (/<\/head><body>/) {
$body = 1 ;
## </head><body><!-- ##Published : 2011-10-20T07:20:26CEST ##-->
if (/##Published[ \t]+:[ \t]+([0-9T\-\:]+).+##/) { $published = $1; }
print STDERR "Published: $published\n";
next;
}

if (/<\/body><\/html>/) { $body = 0 ; next }

if ( $body ) {
## sprawdzam `przenosnosc URLi':
if (/src[ \t]*=/) {
if (/pinkaccordions.homelinux.org/ || !(/http:\/\// ) ) { $rel_URLs = 1; }
}
## zawartość pre nie powinna być składana w jednym wierszu:
if (/<pre>/) { $in_pre = 1; $post_content .= "$_\n"; next ; }
if (/<\/pre>/) { $in_pre = 0; $post_content .= "$_ "; next ; }
if ( $in_pre ) { $post_content .= "$_\n"; }
else {
$post_content .= "$_ "; # ** musi być ze spacją **
}
}
}

### ### ###

if ($published eq '') {
warn "*** something wrong with: $post_file. Not published? Skipping....\n" ;
close(POST);
next ;
}
if ( $tags eq '' || $post_title eq '' ) {
die "*** something wrong with: $post_file (tags: $tags/title: $post_title)\n"; }
if ($rel_URLs) { die "*** suspicious relative URIs: $post_file\n"; }

$post_content =~ s/\&/&amp;/g;
$post_content =~ s/</&lt;/g;
$post_content =~ s/>/&gt;/g;

print STDERR "Title: $post_title Tags: $tags\n";

@post_kws = split /,/, $tags;
$md5sum = md5_hex($post_content);
print STDERR "MD5sum: $md5sum\n";

print "<entry>";
print "<id>tag:blogger.com,1999:post-$md5sum</id>";
print "<published>$published</published>";
print "<updated>$published</updated>";
print '<category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/blogger/2008/kind#post"/>';

## tags:
foreach $k (@post_kws) { print "<category scheme='http://www.blogger.com/atom/ns#' term='$k'/>"; }

print "<title type='text'>$post_title</title>";
print "<content type='html'>$post_content</content></entry>";

close(POST);

}

print "</feed>";

The minor problem was the default formatting of <pre>...</pre> which I use to show code snippets. I have to preserve line breaks (cf. $in_pre in the above Perl script) of pre element content as well as have to add the following to the default CSS styles (it is possible to modify CSS via Project →Template Designer →Advanced →Add CSS1)


pre { white-space:nowrap; font-size: 80%; }

To convert simply run script as follows:


perl blogspot-import.pl post1 post2 post3.... > converted-posts.xml

The above described script can be downloaded from here.

1In Polish: Projekt →Projektant szablonów →Advanced →Dodaj Arkusz CSS

Brak komentarzy:

Prześlij komentarz