Wednesday, March 28, 2012

Importing a Wordpress database to Blogger

This will be the first post in a multipart series. First, I'll describe moving the database. Next, I'll cover the images, which is significantly more complicated.

We recently helped move a very large blog from Wordpress to Blogger. The primary conversion was done with this python utility. There is also an online converter, but it only accepts files up to 1 MB. This database was over 300 MB.

Why move to blogger

Wordpress powers many of the flashy sites out there, and it is very powerful. Unlike blogger it can be installed on your own server, so you have total control. With blogger, even if you have a custom domain, the entire blog is hosted on google's servers. Of course I understand that many of you won't feel comfortable handing over all this information and power to Google, but in doing so, there are some significant advantages. Aside from never needing to maintain a server, the following three things immediately jump to mind.
  1. Security. Blogger has a much better security record than Wordpress. Period. Add to this 2-step verification, and the chances of your account getting hacked are extremely low.
  2. Greatly reduced bandwidth bill. Blogger images are hosted on Picasa, and signing up for google+ increases your free quota such that "photos up to 2048 x 2048 pixels and videos up to 15 minutes won't count towards your free storage."
  3. Better protection against DMCA abuse. Blogger moving to country code top level domains is huge. If your blog gets a takedown request, it is brought down only for the country in which the request originated. Completely erasing content from the internet now requires filing takedowns for each blogger ccTLD.
Of course similar benefits regarding server maintenance and bandwidth usage can come from hosting on as well.

Convert the database

Step 1. Export the database from the wordpress dashboard. If you have a large database, do it in chunks. Use a systematic file naming convention with no spaces. The online converter can be used on XML files under 1 MB. Smaller files will also make it easier to narrow down the location of any issues and be easier to upload to Google. One massive 300 MB XML file will be more likely to fail to import due to any number of possible things that could happen than ten 30 MB files.

Step 2. Convert with the online converter if your files are under 1 MB. If not, you'll need to download the scripts and run them in a terminal window. This requires python to be installed, which comes standard in most linux distros and Mac OS X. To execute the script, do this, changing the names of the files accordingly: input.xml > output.xml

Step 3. Fix any errors. A common error we ran into was Input WordPress document is not valid XML!!. Fortunately, there'd be an indication of the general location of the error. For example, Error appears around line 652861, column 33. We found the cause to often be incorrectly formed unicode, generally in the comments. We had some success with reconverting to utf-8, and then opening the XML file with a standard text editor. After doing that we'd often find a mangled character that could then be deleted. If that doesn't work, it might be best to strip out the entire offending comment.

To do so remove everything wrapped in (and including) the <wp:comment></wp:comment> tags.

Clean up the converted database

Unfortunately, the converted database needs to be modified to make it look good. Below is the list of search and replace terms I did to get the content to render exactly the same in blogger as it did in wordpress.

The problem is with the addition on extra line breaks. Part of this comes from the default CSS in the common blogger templates like "simple" and "awesome inc" that includes margins after blockquote and list elements. Part of it comes from blogger's lack of love for the the paragraph tag (<p>). But by far, most the extra space is because the version of the wordpress2blogger scripts we used likes the break tag (<br />). A lot.

Consider the following quote.
I like cats.
The html to render this quote looked like this in the XML file exported from wordpress:
<blockquote><I like cats./blockquote>

After conversion to blogger, It had two extra breaks appended:
<blockquote><I like cats./blockquote>&lt<br /><br />

This will result in an unsightly amount of blank space after any quote. I identified certain HTML tags that would be trigger the addition of extra breaks but found it faster to just let the conversion run and the do a multifile search and replace on the converted database.

Here's the list of all the search and replace terms I used. I'm sure someone could come up with a grep string to get all the multiple breaks with one command, but I didn't really think about it. NOTE: the converted data base used html entities, so instead of break tag looking like <br />, it will look like this &lt;br /&gt;

1. Multiple line breaks
Find: <br/><br/><br/><br/><br/>
Replace: <br/><br/>

Find: <br/><br/><br/><br/>
Replace: <br/><br/>

Find: <br/><br/><br/>
Replace: <br/><br/>

2. Breaks starting lists
Find: <ul><br/>
Replace: <ul>

3. Breaks before end of lists
Find: <br/><ul>
Replace: <ul>

4. Breaks after lists.
Find: </ul><br/>
Replace: </ul>

5. Paragraph tags (blogger doesn't need them in the html)
Remove: <p>

6. Ending paragraph tag
Remove: </p>

8. Breaks after list elements
Find: </li><br/>
Replace: </li>

9. Breaks after blockquotes
Find: </blockquote><br/><br/>
Replace: </blockquote><br/>

Import the database

Now, you're ready to import the database to blogger. From Settings > Other, click import blog.

Select the file you want to upload, pass through the recaptcha (only need to get one of the words correct), and click import. When we did this, there was a bug that prevented automatic publication of posts. Fortunately for us, there is a select all button on the blogger dashboard so you can select all the posts displayed on a page and publish them at once. Unfortunately for us, we could only display 100 posts at a time and there were over 6000 to publish.

If any files fail to upload due to random errors, try uploading again. You can also try reexporting and reconverting. For us, there were several entries, again in the comments, that caused problems. All we could do is split up the databases into smaller and smaller chunks and attempt to reupload, eventually narrowing it down a particular month that had a database issue. Then to a particular week. Next to an actual post. And finally to a malformed character in a single comment that didn't cause the conversion to blogger format to fail but did cause the import to blogger to fail. (This is why you want to use a systematic naming convention for the XML files, since you can easily end up with duplicate posts if you don't keep track.)

1 comment:

  1. "use a systematic file naming convention with no spaces" what do you mean please explain more casuse i get fatal error allocated memory when i try to export in Wordpress