Take pride in your eBook formatting (Part VI)
Posted by Guido ·Jan 13
This is the sixth installment of a series of articles. To read the previous one, please click here
Time for the clean-up of your manuscript
Now that we’ve exhaustively covered the preliminaries, it is finally time to put it all to work for us and begin creating an actual eBook source file. I know you’ve been waiting for this with held breath, so let’s just roll.
The first thing we need is a cleaned up text version of your manuscript. By that, I mean a version that has proper curly quotes, correct dashes, including em dashes, ellipses and so forth.
I can’t even count how many times I have read on message boards, not to use curly quotes, ellipses etc. and I cannot stress how misguided those recommendations are. They usually stem from people not properly understanding the workings of eBook creation and going for a cop-out instead of trying to really address the problems they might have encountered. Bad advice! I will show you how to do it right because publishing a book without proper typographical characters is like writing text without ever using the letter ‘e’.
The way I clean up my text is usually by loading it into a word processor and doing a series of search and replaces. The first one is replacing all occurrences of " with ". Yes, this is no typo, I am really replacing all quotes with an identical quote. By doing this I am putting the word processor’s logic to work. By replacing all quotes in the text with themselves, the program automatically smart quotes them, creating the correct, corresponding curly quotes for me throughout the text. Now that was cool, wasn’t it?
Next step, we do the same thing with single quotes, by replacing all occurrences of ' with '. Again the software will make sure to use the typographically correct curled single quotes in all instances.
Next up, em dashes. I have a habit to mark em dashes by writing two regular dashes in my text, so a quick search that replaces -- with — does the trick for me in no time.
The last step are usually ellipses, in which a search and replace of all occurrences of ... with … will automatically create proper ellipses for me. This is important because it allows the eBook reader to do proper line breaks after the ellipses, whereas three individual periods can easily confuse the device and render the first period on one line and the remaining two on the next — which is a serious typographical flaw. In addition, ellipses are spaced correctly for each font for best readability, and are part of the typographic vocabulary for a reason, so don’t just ignore them.
If you have a word processor that allows you to search for text styles — some do, others don’t — you can now do a search and replace that will save you considerable time down the line. Try to find all instances of italic text and wrap them with <i> tags now. Using wild cards, you can pretty much automate this process and save yourself hours of manual work with just a few mouse clicks here. In Word, for example, go to the search box and hit Ctrl-i to select italic, and in the replace box enter <i>^&</i> and then hit Replace All and you should be all set.
Do not fall for the temptation to do the same thing with your bold text, however, such as your chapter headings! We will tackle those differently a little later on.
We now have a clean text file. Select the entire text now and copy it to your clipboard. We are leaving the word processor and enter the domain of HTML.
Nice, clean and predictable in HTML
Open your programming editor (See Part III of the series for a quick discussion of programming editors), create a new file and paste your text into it. You will notice that all formatting is lost, and that is just as well. In fact, that is what we want. It is probably the most important step of the entire process, to get rid of the unpredictable word processor formatting. We will now begin to massage our text back to shape with a few, elegantly applied steps.
Once you got over the shock that all formatting is lost, you may also notice that every paragraph of your original text is now in one single, long line. (If that is not the case, you should adjust the line width of the editor to its maximal possible length through the Options settings.)
We will use this fact to our advantage and wrap every single line with a paragraph tag. This can be easily done using a regular expression search and replace. Regular expressions are extremely cryptic and I do not expect you to understand how they work, so just follow the next few instructions, if you may.
Open the search and replace window in your editor and make sure Regular Expressions are enabled. Occasionally you may find a checkbox in the search window, so give it a quick look. Now enter ^(.+)$ as the search term. Then enter <p>$1</p> in the replacement line. Run the search and replace across the entire text and take a look at your results. Every line of text should now be wrapped neatly by an opening <p> and a closing </p> tag. If they are not, your editor might use a slightly different syntax. Undo whatever the editor just did and enter <p>\1</p> in the replacement line instead of the previously used enter <p>$1</p> replacement term. Run the replacement and check the results. If it is still not correct, your editor might not support regular expressions.

In theory you could do these replacements in your word processor also, though quite honestly, I don’t really trust them that well, and personally prefer the use of a programming editor instead, which is also significantly faster.
Dealing with special characters the right way
The next step for us to do is to replace all special characters with their proper HTML entities. I’ve seen a lot of discussion about this, and how it’s not working right or is platform dependent, but trust me, when I say, that it is all bologna. There is a very safe way to handle this in HTML that will properly display on every HTML device, regardless of font or text encoding. The key to success lies in HTML’s named entities.
If we take the ellipses (…), for example, in HTML there is a special code that tells the device to draw that particular character. It is called … With this entity, the device knows to draw an ellipse that cannot be broken into parts and is treated as a single character.
If you use the entity — the device will render a proper em dash. Proper length, proper size and all.
Next up are quotes. For that purpose, HTML offers “ and ” , entities that represent curly left and right double quotes, just the way we love them. Correspondingly, ‘ and ’ are the entities to draw curly single quotes.
And as easy as that, we have circumnavigated all compatibility issues for special characters. These named entities will always be rendered correctly, unlike the cryptic numeric entities that some people are using.
If you happen to see something like this in your HTML code – ¯ – you know you’re asking for trouble, so make sure to use named entities only!
There are, of course many more, including entities for currency symbols, accented characters etc. and there are two basic ways to go about having them all replaced.
The brute force approach would be to search and replace all of them by hand, one entity at a time. This is not only time consuming but also prone to error, as you could all too easily overlook some in your text — but it may be the only option available to you.
The second — and easier way — is to automate the process. TextMate, the programming editor I am using, has a function called “Convert Selection to Entities excluding Tags” and it does exactly what we need. With it, it takes me one mouse-click to have all special characters in my entire book converted to named entities. Remember, using the right tools for the job will always make your life easier!
Alternatively, there are a few websites on the Internet that allow you to paste in your text and it will convert it for you, such as http://word2cleanhtml.com. However, I take no responsibility for the quality of the conversion and I want to point out that you are inserting your entire book into a website you are not familiar with, where it could — theoretically — be stored and re-distributed. I’m usually not paranoid but it is something I thought I should point out.
If you have not been able to wrap all your italic text instances with <i> tags in your word processor, now would be the time to do that — by hand. It may be a bit tedious, as you will have to look for every instance of italic text in your manuscript and manually wrap it with the tags, but I found that usually their number are limited and it doesn’t take too long to do.
Once we are done with all that, we have a very basic HTML source file for our eBook — one that is guaranteed without strange formatting errors and things that plague countless eBooks. Make sure you save this file somewhere, using an .html file extension. This will later allow us to quickly evaluate and check the eBook file in an ordinary web browser. In fact, if you double-click the file, you should already be able to take a look at it in your browser. Paragraphs should be nicely separated and italic text should show as such.
As you can see we’re quickly getting there now, but, of course, we are not done yet. In the next installment we will begin to fine-tune the various elements of the book and give it the polish it deserves.
Part I • Part II • Part III • Part IV • Part V • Part VI • Part VII • Part VIII • Part IX



99 comments
Comment by Vicki on January 13, 2011 at 12:11 pm
Your “Take pride in your eBook formatting” series has been excellent, Guido. You’ve written ir so that even someone with limited computer knowledge could follow your instructions.
Comment by Guido on January 13, 2011 at 12:32 pm
Thank you very much for the compliments, Vicki. Much obliged.
Comment by Maurice Alvarez on January 14, 2011 at 1:46 pm
I happened to run into David Burton in a web forum and he kindly turned me to your blog series. It’s been a great read, highly informative and immediately useful. Thanks!
Comment by Guido on January 14, 2011 at 2:29 pm
I am glad that you find the series helpful and hope you’ll be able to apply some of it for your own work.
Comment by Maurice Alvarez on January 20, 2011 at 6:48 am
Regarding the entity codes, I happen to have some pinyin (Chinese phonetics) which uses characters in the Latin-B unicode group which don’t seem to have entity codes (ie: Nǐ hǎo). Will I really have a problem if I just use the numeric unicode for these?
Comment by Guido on January 21, 2011 at 8:55 am
I have to admit that I have no experience with Chinese phonetics and am not sure how standardized their Unicode implementation is. I would expect it to be fairly safe to use numeric entities in that case but I would definitely double check on a few platforms and load it on a Kindle, a PC, an iPhone, a Mac etc just to make sure.
Comment by Deb on January 23, 2011 at 2:06 pm
Guido, is there a fast way to replace all left and right quotes in jedit? Does it have something similar to the textmate you use?
Comment by Guido on January 23, 2011 at 2:24 pm
If you mean replacing regular quotes to turn them into curly quotes, I don’t think any programming editor can do that. It requires some logic to identify the opening and closing quotes as they require different replacements. Word processors usually have that logic when “smart quotes” are turned on, but not programming editors.
If you mean by replacing, replacing the curly quotes with their respective entity values, you can do that in a quick two-step search and replace. Simply search and replace all occurrences of “ with the entity &ldquot; and then replacing all occurrences of ” with the entity &rdquot;
Comment by Jason Vanhee on February 20, 2011 at 10:29 am
I’m trying to follow the formatting steps, and everything goes perfectly except that I can’t make jedit actually find the curly quotes to replace them with the entity values. I set it to find, and it claims there’s none in the document, even though I can clearly see there are. Am I missing something obvious, or am I doomed to have to replace every one of my 3600 instances of various curly quotes by hand?
Thanks if you have any ideas. And if not, thanks anyway, this is a really awesome step by step guide all the same. And way faster than I thought it would be.
Pingback by Formatting Your Ebook for Kindle: .Mobi File Format – The .HTML Component – Part 5 | Helen Hanson on February 22, 2011 at 8:48 am
[...] to this post by Guido Henkel, I now know how to clip and paste from a word processor to an HTML editor without hand coding the [...]
Comment by Guido on February 25, 2011 at 1:42 pm
Jason,
The best way is to highlight one of the curly quotes in your text and copying it to the clipboard. Then paste it from the clipboard straight into the search box. That way you’re making sure you have exactly the same code there as in the text.
I hope this will help.
Comment by Larissa Lyons on February 26, 2011 at 3:57 pm
Dreamweaver proved tricky on getting the paragraph tags properly rendered; I wasn’t able to use either of your above examples, but here’s what did work if anyone’s interested: FIND: REPLACE WITH:
Note-I did highlight specific text when doing this and used Source/Code view. If I’d done the cntl+A option (which I normally would have) & done the find/replace, there would have been an extraneous hanging tag or two.
Guido – thank you! This is fantastic info & explanations.
Comment by Larissa Lyons on February 27, 2011 at 10:31 am
Oops! Something didn’t like those Find/Replace codes. Here they are posted as a jpg on my site:
http://www.larissalyons.com/images/para_fix.jpg
Comment by Guido on February 27, 2011 at 10:36 am
Looks good to me. What exactly is happening? Feel free to send me the file if you wish.
Comment by Larry Kahn on March 6, 2011 at 12:07 pm
Guido, thanks for this excellent series. I’m taking a crack at converting my novel myself thanks to your encouragement, but I have a question. My novel includes a second style of paragraphing designed to mimic an online chatroom format–a different font, bold screen name and a colon followed by tabbed, regular font text. Do you have a suggestion for formatting this either in Word or the text editor? Is there an opportunity to make this even more distinctive in e-book format by inserting the chat dialogue in a box?
Thanks for your thoughts!
Larry
Comment by Guido on March 6, 2011 at 12:28 pm
Larry, this is fairly easy to do. All you have to do is create a separate paragraph style for this, say, you call it “chat.” Then you wrap all the text with the corresponding tags, like this
<p class=”chat”>Here is where your text goes</p>
The key is now to modify the “chat” style so that it uses a Courier font-family, maybe adds a margin to indent the text and sets a border to visually frame the entire block with a thin line.
Comment by Andy Conway on April 6, 2011 at 12:09 pm
Hi Guido
Thanks for this guide. It was indispensable in publishing my first ebook, but I’m having problems now with my second. The ‘replace italics’ instruction is no longer working in Word.
I’m doing exactly what I did before but it’s now placing the italic anchors AFTER each italicised word or phrase and not wrapping around.
I’ve no idea why it would do this. Is it a problem you’ve encountered before?
Appreciate your help.
Cheers
Andy
Comment by Guido on April 6, 2011 at 10:50 pm
I have not seen this problem before, but sometimes italic conversions do not work properly, when there is an italic font in use instead of the italic font setting. I had this in a client’s book just this week where she used “Courier New Italic” in her manuscript, instead of “Courier New” and then making it italic.
I had to adjust my search and replace accordingly to locate and wrap those instances. You might want to take a closer look at your own italic instances and see if there is something wonky there.
Comment by Andy Conway on April 7, 2011 at 8:16 am
Thanks for the response. It’s not that, though. I’m using (Default) Arial and I’ve checked formatting against the previous document that converted correctly and there’s no difference. It’s a puzzler. Looks like I’ll have to do them manually once I’ve put the novel into Dreamweaver.
Is there a problem if the anchor uses ‘em’ and not ‘i’ – ‘em’ seems to be the default for italics in Dreamweaver.
Comment by Katharina Gerlach on April 10, 2011 at 11:45 pm
“By replacing all quotes in the text with itself, the program automatically smart quotes them”
This only works if you activate “replace straight quotes for curly quotes” (I’m paraphrasing here because I’ve got a different language version of Word) in Word (it’s on the auto-format tab in the dialog box that pops up if you choose Extra -> Auto Correct from the menu).
Comment by Jaime Buckley on April 30, 2011 at 4:10 pm
Guido, you are a complete stud. Flat out. Thank you for this information and I wanted to help by making a tweak for any of your readers who might use Open Office instead of MSW:
If you use <i>^&</i> to replace the italics, you’ll find the ^ in your final results. So when you find and replace in OO, just type <i>&</i> and it will turn out perfect.
My two cents back to you kind sir, as a thank you.
Jaime Buckley =)
Comment by Jaime Buckley on April 30, 2011 at 4:13 pm
Blast. You see how important it is to code things right?
Let’s try that again…
If you use “<i>^&</i>” to replace the italics, you’ll find the “^” in your final results. So when you find and replace in OO, just type “<i>&</i>” and it will turn out perfect.
Hmmmm.
Comment by Jaime Buckley on April 30, 2011 at 4:14 pm
Ok Guido—you’ll have to fix the code so it actually shows up…lol.
I tried to help anyway. (smirk)
Comment by Guido on May 1, 2011 at 7:35 pm
Yeah, it is a little tricky to post HTML tags without them being interpreted.
Comment by Cora on May 7, 2011 at 8:44 pm
First of all, your guide has been very helpful in formatting my first e-book. And besides, it’s always nice to meet a fellow John Sinclair fan.
However, I have a specific problem. I use the German edition of Word, so whenever I use curly quotes, they appear in the common German layout, first lower quotes and then upper quotes. This is a pain in the backside, if you primarily write in English, so I always switch off the curly quotes first thing when I get a new computer. I switched them back on when preparing my text for e-book formatting. But when I did search and replace, I got the bloody lower and upper curly quotes again. Setting the document language to English doesn’t help either, I keep getting lower and upper quotes.
Do you know a way to work around this? Or is it best just to keep the straight quotes, even if it’s not as pretty. Because German quote format in an English language e-book would probably cause more confusion than plain straight quotes.
Comment by Guido on May 7, 2011 at 11:32 pm
I would have expected that switching the document language to English would fix the problem but since it is a Microsoft product I am not surprised that it doesn’t do. Nothing Microsoft does works the way it should.
Here’s what I would do. Leave the quotes as they are in German. Then convert them to named entities the way I described. After that I would manually search and replace all occurrences of “ with something like &dummy;
Then I’d replace all occurrences of ” with “ and finally replace all occurrences of &dummy; with ”
You have now effectively exchanged the two symbols with each other. It takes a few extra steps but fixes your problem.
Comment by Cora on May 8, 2011 at 5:16 pm
Thanks for the help. Your comment about the inefficiency of Microsoft products, which I completely agree with BTW, actually gave me another idea. I imported the text into my old Lotus Word Pro program, did the search and replace and with Lotus, changing the document language actually works. But then it’s not a Microsoft product.
Thanks anyway and also thanks again for this extremely helpful guide.
Comment by Guido on May 8, 2011 at 6:49 pm
That’s not a bad idea, really. Glad that worked out for you.
Comment by Heather on May 28, 2011 at 12:43 am
I think I’m in love with you.
Seriously.
Okay, got that out of my system. Now, when I write in Word, I utilize “page breaks” between chapters. Will this cause a problem during the rest of my conversion process? Is there any specific way I should format my chapters to help the conversion?
Comment by Guido on May 28, 2011 at 10:30 am
Thank you, Heather. The page breaks do not interfere with the process. When transferring your text from the word processor to the text editor, these page breaks will be lost, but you will re-introduce them later through the chapter headings and the “chapter” style.
Comment by meryl on June 20, 2011 at 8:36 am
Hey Guido. Much thanks for the guide. I couldn’t get &rsquot; [right single quote] to work, so I used ' [apostrophe] instead. Is that a good idea?
Comment by meryl on June 20, 2011 at 8:56 am
Nevermind…I figured it out.
Comment by Guido on June 20, 2011 at 1:59 pm
It is ’ – was that your mistake?
Comment by Guy Anthony De Marco on July 17, 2011 at 1:02 pm
Excellent series, Guido. I’ve been pushing these links on the social media sites I frequent.
Comment by Elisabeth on July 18, 2011 at 12:59 pm
I think I’m going to re-read this series once a week until I’m ready to put my ebook together.
One question: Forgive me if I’ve missed something, but why do we need to do the find-and-replace in Word first and then again in the HTML editor? Does pasting it into the editor wipe out all those changes made in Word?
When I type in Word it automatically converts double dashes into em dashes and three periods into ellipses once you add a space after the word that follows them. I have to manipulate my quotes a little, though, because if you have a character break off in the middle of speaking with an em dash, Word uses an opening quote instead of a closing quote – wrong direction. My solution is to type a random letter between the em dash and the quote, then go back and take it out. Same thing with using single quotes to denote part of a word left out, such as ’tis or ‘em – Word puts it in the wrong direction and I have to change it.
Comment by Elisabeth on July 18, 2011 at 1:01 pm
That is to say, puts it in the wrong direction like this blogging software just did on my second example.
Pingback by Ebook Formatting – Part One – Overview « Donnie Light – Writing Darkness on July 19, 2011 at 8:32 am
[...] this is covered in Part VI of Guido’s formatting [...]
Comment by Amber on August 21, 2011 at 12:35 pm
Hi Guido,
I’m a bit lost about how to do a mass replacements italicized and/or bolded words search and replace in a word processor. Is this what you suggested or am I lost about that too?
Specifically which word processors can do that? I use Atlantis and Word mostly.
Thanks.
Comment by Guido on August 21, 2011 at 1:32 pm
I am not familiar with Atlantis but Word can most definitely do it.
Comment by Amber on August 25, 2011 at 7:48 am
Thanks,
But how do I do it in Word, and which version please. I know how to search for specific words or phrases. Can I search and replace for ALL italicized words in a single go?
I’m not sure if I’m getting this.
Comment by Alisa on September 18, 2011 at 1:25 pm
Hi Guido
I’m confused with the &hellip, &mdash, etc. How does this look in the copy? I can’t get the jEdit search and replace to see the quotes, …, etc. I did have the “replace with smart quotes” checked in Word — what I supposed to?
It’s such a confusing subject for first-timers, but I’m just step-by-stepping it with you. Thanks!
Comment by Alisa on September 18, 2011 at 1:46 pm
Ok, this is getting embarrassing, but I don’t know how to do the following. Can you really dummy it down for me, because when I copy it in the text and try to paste it in the “find” box in jEdit, nothing happens. Also, I still don’t get how to open my .html file in my web browser. Sorry to be so clueless!
Alisa
“The best way is to highlight one of the curly quotes in your text and copying it to the clipboard. Then paste it from the clipboard straight into the search box. That way you’re making sure you have exactly the same code there as in the text.”
Comment by Guido on September 18, 2011 at 1:57 pm
The named entities work like this. Let’s assume you have an ellipse in your text and it looks like this.
this is a test…In your HTML source you would have to replace the ellipse with a named entity so it reads like this
this is a test…When this HTML file is being displayed in a browser or eBook reader it will then correctly the … named entity as the “…” symbol.
Comment by Guido on September 18, 2011 at 2:01 pm
To open your HTML file in a web browser, all you really have to do is to double-click it. It will then automatically be displayed in your default web browser, usually.
Why your search doesn’t work, I am not entirely sure. But if you want me to I can take a look real quick. Send me your HTML file by email and I’ll take a quick look.
Comment by Alisa on September 18, 2011 at 7:01 pm
Oh, oh, I got the ellipses, etc. fixed! I had to highlight it in the copy in jEdit and then open the find box, and it would be in the find box.
Where would I be when I double-click on the html file? That’s my confusion. Am I in Word? On the internet? Sorry to be so dumb!
Comment by Guido on September 18, 2011 at 8:02 pm
You are in no application when you double-click the file. Locate the HTML file with File Explorer when you’re in Windows or with the Finder on a Mac. Once you have found the file, simply double-click it.
Comment by Alisa on September 18, 2011 at 8:45 pm
Okay, got it, thanks so much. You are so patient, I’m hoping you can help again — my quotes, etc., aren’t coming up (the codes are showing in the nook file). What am I doing wrong? Here’s what I have at the beginning:
html, body, div, h1, h2, h3, h4, h5, h6, ul, ol, dl, li, dt, dd, p, pre, table, th, td, tr { margin: 0; padding: 0.1em; }
p
{
text-indent: 1.5em;
margin-bottom: 0.2em;
}
p.chapter
{
text-indent: 1.5em;
font-weight: bold;
font-size: 2em;
page-break-before: always;
margin-top:5em;
margin-bottom:2em;
}
p.centered
{
text-indent: 0em;
text-align: center;
}
span.centered
{
text-indent: 0em;
text-align: center;
}
CHAPTER 1
blah blah blah &mdash I
Comment by Alisa on September 18, 2011 at 8:52 pm
Oh, Lord, I forgot the codes would change here. In front of “Chapter 1″ it says
(open carrot)p class=”chapter”(close carrot)(open carrot)p class=”centered”(close carrot)(open carrot)span class=”centered”(close carrot)CHAPTER 1(open carrot)/span(close carrot)(open carrot)/p(close carrot)
(open carrot)p(close carrot)
I swear I’m not normally this dumb.
Comment by Guido on September 18, 2011 at 9:47 pm
You are missing a semicolon after the mdash entity. You need to write — for it to work. All entities start with the & sign and end with a semicolon.
Comment by Alisa on September 19, 2011 at 5:41 am
You are officially my favorite person, Guido! Thanks so much for all of this!
Comment by Dan on October 1, 2011 at 12:35 pm
Guido,
Having trouble with this seemingly simple step in your penultimate paragraph: “Make sure you save this file somewhere, using an .html file extension.” I went from Word to jEdit, supplanting the former’s formatting with HTML tags successfully, but can’t seem to find a “Save As…” or “Export…” function which allows me to give the final product an .html extension. What am I overlooking.
And, in any event, thank you for this practical and generous resource.
Dan
Comment by Guido on October 1, 2011 at 12:49 pm
All you have to do is select “Save as” – which I am sure jEdit has – and then type in a file name that ends with .html, such as test.html
Comment by Dan on October 2, 2011 at 6:53 pm
I figured it was something absurdly simple.
Thanks immensely,
Dan
Comment by Guido on October 2, 2011 at 6:57 pm
Hey, we all can’t see the forest for all the trees sometimes.
Comment by Michael on October 11, 2011 at 11:59 am
The second — and easier way — is to automate the process. TextMate, the programming editor I am using, has a function called “Convert Selection to Entities excluding Tags”
This sounds great but how do I access it? I am using Textmate and cannot find the function.
Comment by Guido on October 11, 2011 at 12:06 pm
From the menu bar, go to “Bundles->HTML->Entities”
Comment by Mark on October 26, 2011 at 4:57 am
Finally you get to the subject of your series! You are the Tristram Shandy of Kindle formatting writers.
Comment by Mark on October 26, 2011 at 5:00 am
BBEdit also has a lot of pre-built regex functions that help in doing these tasks (under the Text and Markup menus), as well as standard grep/regex, and you can sequence these up in “text factories” that you can use over and over on different documents.
Comment by Mark on October 26, 2011 at 5:03 am
Admittedly rare, but cases like ’tis (a contraction of it is) will not be caught by most smart quotes filters.
You can catch it by a white space plus dumb apostrophe search and just manually inspect each instance, or build up your own list of such exceptions and build them into your regex sequence. Use
\s’ or [:space:]‘
… depending on your regex engine.
Comment by Mark on October 26, 2011 at 5:06 am
Personally I prefer to use use search and replace to put tags on titles and links and the like (e.g., for headings and chapter starts) before doing entity encoding, because in rare circumstances an entity-converted character can screw up a search for whatever you’re using as the hook for your searches. The BBEdit HTML-to-entity filter has an option to ignore angle brackets, so it can be applies post HTML tagging.
Comment by Steve on December 1, 2011 at 7:51 am
Hi Guido,
This is such a wonderful de-mystifying blog. Thank you.
Can you say a bit more about fonts, and how hard it might be to use a different font for the main text of an ebook? Calibri is soooo…well, plain. I get how an image could be used for chapter heading fonts, etc., but I’m wondering if you can explain what’s up with the limitations on the main body of the text, and what can be done about it, if anything.
Thanks!
Steve
Comment by Guido on December 1, 2011 at 8:03 am
Font options are virtually non-existent at this time. This will change with future readers, perhaps, where yo uwll have the possibility to actually include font data in your eBooks, but for now you are limited to the three very basics of HTML.
Using the font-family setting in your styles, you can select either a “serif”, “sans-serif” or “monospace” typeface. That is all. Using any particular font names is prone to causing problems because not all readers have the same font implementations. Therefore it is best to stick to the generic types that I mentioned.
Comment by Josh Irving on December 3, 2011 at 12:04 pm
Hello Guido,
Not sure if I am being dumb here but when I view the document in a browser (Firefox and Safari) I am getting apostrophes showing up like this: they’re
Quotes also show up in the browser like this: â€
They look fine in TextMate, but when viewed as an HTML file they mess up. Should I be concerned?
Thanks,
Comment by Josh Irving on December 3, 2011 at 2:04 pm
UPDATE:
Guido,
I was getting ahead of myself.
Upon finishing your instructions re Calibre all is well and looking good.
Apologies for the premature question and thanks for the great article.
All best,
Comment by Guido on December 3, 2011 at 2:37 pm
Still, make sure you are properly converting the special characters to named entities. Otherwise, even though it may look good on your end after Calibre went over it, it may look just as broken as before on certain devices.
Named entities are a MUST for solid eBooks.
Comment by Lassal on December 8, 2011 at 1:37 pm
Hi Guido,
thanks for this guide.
I am preparing a file for the first time ever, and it is all easy to follow and goes smoothly. (So far. Knocking on wood)
Only thing is that I am actually writing in German and need the named entities for the lower and upper curly quotes.
I googled for it but I do not trust myself choosing correctly …
Thanks for your help,
Comment by Guido on December 8, 2011 at 1:40 pm
The Umlauts are named as such
ä ö ü Ä Ö and Ü
For more info, here is a reference overview – http://www.w3schools.com/tags/ref_entities.asp
Comment by Lassal on December 8, 2011 at 1:55 pm
Thanks Guido
Comment by Lassal on December 8, 2011 at 1:58 pm
And you say TextMate would do this all on one go?
Next stop: http://macromates.com
Comment by Guido on December 8, 2011 at 2:18 pm
Yes, TextMate does take care of those conversions with one button press.
Comment by Sean McGuire on December 14, 2011 at 11:11 am
Has anyone tried using Notepad++ for the HTML tweaking? This is the software I’m working with, as jEdit stoutly refuses to be installed onto my laptop, and I can’t convince the Replace function to put the and where they belong. I used the search and replace codes that Mr. Henkel suggested. They didn’t work.
Has anyone had this problem?
Comment by Guido on December 14, 2011 at 11:53 am
Yes, Notepad++ works fine and can do regular expression search and replaces just fine, I believe. The key is to turn it on in the search dialog box. Maybe this page will help a little.
http://markantoniou.blogspot.com/2008/06/notepad-how-to-use-regular-expressions.html
Comment by Sean McGuire on December 14, 2011 at 10:45 pm
Thanks for the link! And thanks for this series, by the way.
Comment by JJ on December 16, 2011 at 7:46 am
Brilliant blog, Guido!
Do you know if “Convert Selection to Entities excluding Tags” is available on E Text? For those of us without a Mac it would be enormously useful.
JJ
Comment by Guido on December 16, 2011 at 7:54 am
I’m sorry I do not know. I don’t even know what E Text is, to be honest.
Comment by Tarin on December 29, 2011 at 7:51 am
Hey Guido I think I am messing up here, I tried the find replace thing but I still get things like these “Copyright © ” “How I’ll Know” and ” “I Love You— and no I am not saying I love you its cut and paste from what I am seeing.
basically the single quotes, double quote marks and the like are vanishing when I check the txt file in my browser. So is this me being thick or something? please help me with this. I am using a mac puter, pages word processor copy pasting from pages into Textmate if that helps at all who and I did follow the steps of replacing the things but it don’t seem to work. so a step by step dummies version perhaps? yours with much appreciation.
Rin
P.s. this blog rocks as do you sir!
Comment by Guido on December 31, 2011 at 11:39 am
It would appear you are not converting those special characters to named entities, as described in part six of my tutorial – http://guidohenkel.com/2011/01/take-pride-in-your-ebook-formatting-part-vi/
Comment by Tarin on January 2, 2012 at 8:47 am
I must be doing it wrong then because I did what you said, the find and replace thing, several times over.
each new attempt seems to bring the same result. I must be doing something wrong. Thank you for quick response am going to keep trying.
Comment by Guido on January 2, 2012 at 9:39 am
Tarin, feel free to email me the HTML file real quick and Ill take a look.
Comment by Tarin on January 3, 2012 at 11:45 am
hey Guido, it’s ok, when I put it into Calibre everything was normal again, I guess I must have done something right afterwards when I redid it all over from scratch. But now am having issues centering stuffs but I am determined to figure out what am messing up now. I must learn
Thank you Guido you totally rock!
Comment by Pam B-C on January 9, 2012 at 11:05 am
Thank you so much for this series. I am always trying to improve the look of my eBooks and those of my friends. You write in such a clear and concise way that I feel sure I’ll be able to apply what I’ve learned. (If not I’ll be back to cry on your shoulder.)
Comment by Matt on January 27, 2012 at 2:00 pm
Guido, thanks for the guide. On copying the text from MS Word to JEdit, I found all the text appeared on one line. It seemed like your post above indicated it would be one line per paragraph, not one line for the entire manuscript.
Did I misunderstand?
Everything on one line makes it a little harder to work with.
Comment by Guido on January 27, 2012 at 2:41 pm
Yes, something is wrong then. It would appear as if Word has inserted only soft line breaks. You should search and replace all of those with hard line breaks. Once you’ve done that, the text show up in one line per paragraph in the editor.
Comment by Matt on January 27, 2012 at 3:08 pm
Thanks, Guido. Can you explain how to do that in Word? (I’m using Word 2008 for Mac, if it matters.)
Comment by Matt on January 27, 2012 at 5:33 pm
Actually, after googling it, I tried the replacement, and that doesn’t seem to be the issue. It’s all paragraph breaks, already. Does JEdit require any kind of configuration?
Comment by Matt on January 27, 2012 at 5:48 pm
I broke down and got TextMate. It’s working properly in that, so I assume it’s something that I needed to configure in JEdit, not in MS Word.
Comment by Guido on January 27, 2012 at 6:14 pm
Glad you got it to work. I don’t use JEdit, so I’m not really familiar with its behavior or settings.
Comment by Timothy Clayton on January 27, 2012 at 11:03 pm
Thank you for the guide…extremely clear and time saving. I wanted to post here in case there are still questions about how to do “Convert Selections to entities” for Notepad++ users. There is a plugin for Notepad++ called HTML Tag. It is available through the plugins menu on Notepad++: select plugins>plugin manager, select HTML Tag from the list and install. Once you have it you can “Select All” of your text. Then plugins>HTML Tag>Encode HTML Entities. This will change your special characters to the HTML codes Guido discusses (for example, “…” to “…”).
It is best to do this before you add the paragraph tag because doing so after will change the characters “” into there corresponding HTML code, like this: “<p>” and “</p>”. But, if you have already added the paragraph codes when you run “Encode HTML Entities,” don’t worry. Just do “Find and replace” with find=<p> and replace=<p>. Then find=</p> and replace=</p>. Then your paragraph tags are back!
Hopes this helps anyone still seeking this solution.
Cheers
Comment by Joanne on January 31, 2012 at 1:49 pm
Hi Guido,
This is a fantastic guide! I’m almost done formatting my ebook and think I’ve gotten most of it. Using TextWrangler.
Here’s my question: do I need to use an html entity for a hyphenated word? Also, do you happen to know with words like ’tis or ‘cuz, which way that single quote should face?
Thanks so much!
Joanne
Comment by Guido on January 31, 2012 at 2:10 pm
Joanne, you will not need special entities for standard hyphens, so don’t worry about hyphenated words.
As for the single quotes in words like ‘cuz, it is my understanding that you should use left single-quotes. Every time you have an omission at the front of the words, the left single-quote is used. In all other cases, the right single-quote is being used.
However, it is a rule that very few people actually adhere to these days, and even most word processors are not smart enough to follow the rule, using right single-quotes in all instances.
Comment by Joanne on January 31, 2012 at 2:32 pm
Thanks for responding so promptly – and, again, for your excellent guide!
Comment by Mark on February 15, 2012 at 8:21 pm
Hi Guido,
I am reading your formatting document and using Textmate. Where is this following function? “Convert Selection to Entities excluding Tags” . I can’t find it?
Thanks
Comment by Guido on February 15, 2012 at 10:28 pm
In the “Bundles->HTML->Entities” menu.
Alternatively simply type “entities” in the Help box and you will see it listed there.
Comment by Matt on February 17, 2012 at 6:16 pm
I am wondering if I need to use the entity codes for some of the usual ASCII symbols like @ and \
All I can find for them is the numeric codes which you have suggested against.
Also, I’m not sure if this is the best section to ask this question in, but how do you go about hyperlinking text in HTML?
Thanks for the amazing guide!
Comment by Guido on February 17, 2012 at 10:02 pm
No, symbols like @ and / do not need to be encoded as entities. They are part of the standard ASCII code set.
Linking is done the same way as in webpage. Use <a href=”http://URL_you_want_to_link_to”>This is a link to link to webpages, or use anchors via <a id=”ThisIsAnAnchor”> inside your text to link to.
Comment by J.M. Porup on February 28, 2012 at 7:25 pm
Guido,
thanks for the great guide. Question for you. I’m publishing a lengthy text in Spanish. When I first published this online, to avoid having to use html escape codes for every single word, I simply declared the text type as utf-8 instead of ASCII. How well do e-readers support utf-8 and unicode? What’s your eperience?
Comment by Guido on February 28, 2012 at 9:26 pm
UTF-8 is fully supported on eBook readers. In fact, it is the only encoding you should use for eBooks. However, that does not solve your problem that all special characters need to be converted into named entities, because UTF-8 on one device is not that same as UTF-8 on another device. Only named entities will circumvent those incompatibilities.
Comment by Melanie on April 12, 2012 at 11:36 am
Guido – thank you so much for all your patient work here, it is a brilliant resource for html/e-book newcomers. I have a very quick question as TextWrangler is threatening to wrangle me in for good or I could just be overthinking, but is the entity for right-hand single curly quotes ’ also used to replace the ‘ in words such as don’t, wasn’t, didn’t, hasn’t?? Or are single curly quotes only utilised in phrases such as – known as ‘the king of kids’.
I was travelling along OK until the find/replace asked me this and, well, I am sure it is a simple answer… Thank you, Mel.
Comment by Guido on April 15, 2012 at 11:42 am
Yes, that is correct. The ’ entity is also used for apostrophes.