Example using the default setup

Parsing HTML using the default setup is as easy as creating a PDF in five steps:

  1. Create a Document object
  2. Get a PdfWriter instance.
  3. Open the Document
  4. Invoke XMLWorkerHelper.getInstance().parseXHtml()
  5. Close the Document

Let's take a look at a code snippet that converts the walden.html file to PDF. In this snippet we use the XMLWorkerHelper class and its parseXHtml() method to do all the work:

    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document,
        new FileOutputStream("results/walden1.pdf"));
    document.open();
    XMLWorkerHelper.getInstance().parseXHtml(writer, document,
        HTMLParsingDefault.class.getResourceAsStream("/html/walden.html"), null);
    document.close();

see HTMLParsingDefault and the resulting PDF walden1.pdf

The HTML was taken from project Gutenberg. It's a book by H.D. Thoreau: Walden, or Life in the Woods.

When we look at the first page that is generated by iText, we see that something went wrong: the first lines on the HTML result in a line of gibberish. What went wrong and how can we fix it?