This is the sixth installment of a series of articles. To read the previous one, please click here


Time for the clean-up of your manuscript

Now that we’ve exhaustively covered the preliminaries, it is finally time to put it all to work for us and begin creating an actual eBook source file. I know you’ve been waiting for this with held breath, so let’s just roll.

The first thing we need is a cleaned up text version of your manuscript. By that, I mean a version that has proper curly quotes, correct dashes, including em dashes, ellipses and so forth.

I can’t even count how many times I have read on message boards, not to use curly quotes, ellipses etc. and I cannot stress how misguided those recommendations are. They usually stem from people not properly understanding the workings of eBook creation and going for a cop-out instead of trying to really address the problems they might have encountered. Bad advice! I will show you how to do it right because publishing a book without proper typographical characters is like writing text without ever using the letter ‘e’.

The way I clean up my text is usually by loading it into a word processor and doing a series of search and replaces. The first one is replacing all occurrences of " with ". Yes, this is no typo, I am really replacing all quotes with an identical quote. By doing this I am putting the word processor’s logic to work. By replacing all quotes in the text with themselves, the program automatically smart quotes them, creating the correct, corresponding curly quotes for me throughout the text. Now that was cool, wasn’t it?

Next step, we do the same thing with single quotes, by replacing all occurrences of ' with '. Again the software will make sure to use the typographically correct curled single quotes in all instances.

Next up, em dashes. I have a habit to mark em dashes by writing two regular dashes in my text, so a quick search that replaces -- with — does the trick for me in no time.

The last step are usually ellipses, in which a search and replace of all occurrences of ... with … will automatically create proper ellipses for me. This is important because it allows the eBook reader to do proper line breaks after the ellipses, whereas three individual periods can easily confuse the device and render the first period on one line and the remaining two on the next — which is a serious typographical flaw. In addition, ellipses are spaced correctly for each font for best readability, and are part of the typographic vocabulary for a reason, so don’t just ignore them.

If you have a word processor that allows you to search for text styles — some do, others don’t — you can now do a search and replace that will save you considerable time down the line. Try to find all instances of italic text and wrap them with <i> tags now. Using wild cards, you can pretty much automate this process and save yourself hours of manual work with just a few mouse clicks here. In Word, for example, go to the search box and hit Ctrl-i to select italic, and in the replace box enter <i>^&</i> and then hit Replace All and you should be all set.

Do not fall for the temptation to do the same thing with your bold text, however, such as your chapter headings! We will tackle those differently a little later on.

We now have a clean text file. Select the entire text now and copy it to your clipboard. We are leaving the word processor and enter the domain of HTML.

Nice, clean and predictable in HTML

Open your programming editor (See Part III of the series for a quick discussion of programming editors), create a new file and paste your text into it. You will notice that all formatting is lost, and that is just as well. In fact, that is what we want. It is probably the most important step of the entire process, to get rid of the unpredictable word processor formatting. We will now begin to massage our text back to shape with a few, elegantly applied steps.

Once you got over the shock that all formatting is lost, you may also notice that every paragraph of your original text is now in one single, long line. (If that is not the case, you should adjust the line width of the editor to its maximal possible length through the Options settings.)

We will use this fact to our advantage and wrap every single line with a paragraph tag. This can be easily done using a regular expression search and replace. Regular expressions are extremely cryptic and I do not expect you to understand how they work, so just follow the next few instructions, if you may.

Open the search and replace window in your editor and make sure Regular Expressions are enabled. Occasionally you may find a checkbox in the search window, so give it a quick look. Now enter ^(.+)$ as the search term. Then enter <p>$1</p> in the replacement line. Run the search and replace across the entire text and take a look at your results. Every line of text should now be wrapped neatly by an opening <p> and a closing </p> tag. If they are not, your editor might use a slightly different syntax. Undo whatever the editor just did and enter <p>\1</p> in the replacement line instead of the previously used enter <p>$1</p> replacement term. Run the replacement and check the results. If it is still not correct, your editor might not support regular expressions.

In theory you could do these replacements in your word processor also, though quite honestly, I don’t really trust them that well, and personally prefer the use of a programming editor instead, which is also significantly faster.

Dealing with special characters the right way

The next step for us to do is to replace all special characters with their proper HTML entities. I’ve seen a lot of discussion about this, and how it’s not working right or is platform dependent, but trust me, when I say, that it is all bologna. There is a very safe way to handle this in HTML that will properly display on every HTML device, regardless of font or text encoding. The key to success lies in HTML’s named entities.

If we take the ellipses (…), for example, in HTML there is a special code that tells the device to draw that particular character. It is called &hellip; With this entity, the device knows to draw an ellipse that cannot be broken into parts and is treated as a single character.

If you use the entity &mdash; the device will render a proper em dash. Proper length, proper size and all.

Next up are quotes. For that purpose, HTML offers &ldquo; and &rdquo; , entities that represent curly left and right double quotes, just the way we love them. Correspondingly, &lsquo; and &rsquo; are the entities to draw curly single quotes.

And as easy as that, we have circumnavigated all compatibility issues for special characters. These named entities will always be rendered correctly, unlike the cryptic numeric entities that some people are using.

If you happen to see something like this in your HTML code – &#175; – you know you’re asking for trouble, so make sure to use named entities only!

There are, of course many more, including entities for currency symbols, accented characters etc. and there are two basic ways to go about having them all replaced.

The brute force approach would be to search and replace all of them by hand, one entity at a time. This is not only time consuming but also prone to error, as you could all too easily overlook some in your text — but it may be the only option available to you.

The second — and easier way — is to automate the process. TextMate, the programming editor I am using, has a function called “Convert Selection to Entities excluding Tags” and it does exactly what we need. With it, it takes me one mouse-click to have all special characters in my entire book converted to named entities. Remember, using the right tools for the job will always make your life easier!

Alternatively, there are a few websites on the Internet that allow you to paste in your text and it will convert it for you, such as http://word2cleanhtml.com. However, I take no responsibility for the quality of the conversion and I want to point out that you are inserting your entire book into a website you are not familiar with, where it could — theoretically — be stored and re-distributed. I’m usually not paranoid but it is something I thought I should point out.

If you have not been able to wrap all your italic text instances with <i> tags in your word processor, now would be the time to do that — by hand. It may be a bit tedious, as you will have to look for every instance of italic text in your manuscript and manually wrap it with the tags, but I found that usually their number are limited and it doesn’t take too long to do.

Once we are done with all that, we have a very basic HTML source file for our eBook — one that is guaranteed without strange formatting errors and things that plague countless eBooks. Make sure you save this file somewhere, using an .html file extension. This will later allow us to quickly evaluate and check the eBook file in an ordinary web browser. In fact, if you double-click the file, you should already be able to take a look at it in your browser. Paragraphs should be nicely separated and italic text should show as such.

As you can see we’re quickly getting there now, but, of course, we are not done yet. In the next installment we will begin to fine-tune the various elements of the book and give it the polish it deserves.


Take pride in your eBook formatting
Part IPart IIPart IIIPart IVPart VPart VIPart VIIPart VIIIPart IX

Need help with an eBook project? Check here for more information.