html2tex Version 2.5 (beta)

This page describes versions 2.1 to 2.5 of html2tex, a program which converts a collection of related HTML files into a single LaTeX file. (The newest version is version 2.6.) Such a LaTeX file can be processed into a PostScript file. To generate a single LaTeX file from a number of HTML files, the user needs to give a skeleton LaTeX file and indicate where translated versions of the HTML files should be included. The user also has to specify at for each HTML file at which level (chapter, section, subsection, ..) it should be included. Links between the different HTML files are mapped to references in the LaTeX.

The generation of LaTeX is configurable. The mapping of each HTML tag to LaTeX commands can be specified. (This mapping can even be changed dynamically during the processing of the HTML file.) It is also possible to exclude certain parts from the HTML files from the generated LaTeX file, or to include LaTeX parts in HTML comment lines, which are ignored by HTML viewers. This makes it possible to maintain sources for both HTML and LaTeX in the same HTML files.

The program performs certain checking of the HTML files, in order to be able to generate correct LaTeX output, but this checking does not conform any HTML standard. At some places the checking might be more relax, while at other places more restrictive then HTML 2.0. So far, there is not much support for extensions beyond HTML 2.0.

The program does extensive checking of links between the different files. Because of this reason it can also be used as a link checking program, by giving it a single HTML file, and specify that it should scan all referenced pages in the local directory (and its sub-directories).

Links to excluded HTML files (ant other URL's) can either be reported as footnotes, or as a sorted bibliograph in the LaTeX file.

Error messages are reported on the standard output file. The program can also generate an extensive cross-refernces file mentioning all the ancor tags.

Functionality

The HTML to LaTeX conversion program is implemented by the C program html2tex.c, which needs to be compiled first. (The program is developed with the popular gcc compiler, which is freely available under the GNU public license.)

The program takes a single file as input. This should be a skeleton LaTeX file without any extension (or, if the program is only used for link checking, a HTML file with the extension .html) It will generate an LaTeX file with the same name as the input file, but with the extension .tex.

The input file

The input file should contain valid LaTeX commands. In the file all lines starting with %html will be interpreted as special lines by the conversion program. These are used to indicate which HTML files should be included, and to set the various options. The following special commands are recognized by the html2tex:

Special command in the HTML files

The following special commands (inside HTML comments) are recognized in the HTML files:

The program recognizes comments inside a pair of double dashes (--), in any of the HTML tags including <! >. It also recognizes any text in a <! > tag not surrounded by double dashes as comment, but not without generates a warning message for it.

Defining mappings

As we wrote above the various mappings of HTML tags to LaTeX can be changed in both the input file (as a line of the form %html -d tag-name options "LaTeX-open" "LaTeX-close"), and inside comments in the HTML files (in the form of latex-def tag-name options "LaTeX-open" "LaTeX-close").

They changes the mapping of the tag-name HTML tag to the given LaTeX formating commands. The strings LaTeX-open and LaTeX-close are put around the text that is marked by the HTML tag. (The string in LaTeX-close is generated at the proper place, in case the closing tag is not obligatory in the HTML syntax.) If the LaTeX command has to include a double quote one should use two double quotes in the string. If a real newline (the `\n' character) has to be included, use `\nl' instead. (There is no LaTeX command starting with this sequence, but there are many starting with `\n'.)

The options are used for some special kind of translating. The following options are possible:

Because HTML files can be included at different levels, the heading tags (H1 to H6) do not refer to the heading tags as they occur in the HTML files, but to their translated equivalents. For this reason, we have added an additional H7 tag for an additional nested level. In case heading tags are used inside other tags as a means of formating they are internally translated to the F1 to F7 tags.

The default settings (for Version 2.5, slightly different from Version 2.2) are the ones given below, using the format to be used in the input file:

%html -d html    ""  ""
%html -d head    ""  ""
%html -d title   ""  ""
%html -d body    -on ""  ""
%html -d address ""  ""
%html -d h1      "\nl\nl\chapter{"  "}\nl\nl"
%html -d h2      "\nl\nl\section{"  "}\nl\nl"
%html -d h3      "\nl\nl\subsection{"  "}\nl\nl"
%html -d h4      "\nl\nl\subsubsection{"  "}\nl\nl"
%html -d h5      "\nl\nl\paragraph{"  "}\nl"
%html -d h6      "\nl\nl\subparagraph{"  "}\nl"
%html -d h7      ""  ""
%html -d f1      "{\LARGE \bf "  "}"
%html -d f2      "{\Large \bf "  "}"
%html -d f3      "{\large \bf "  "}"
%html -d f4      "{\bf "  "}"
%html -d f5      "{\small \bf "  "}"
%html -d f6      "{\footnotesize \bf "  "}"
%html -d p       "\nl\nl"  ""
%html -d ul      -igh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d menu    -igh "\nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d dir     "-gnh \nl\begin{itemize}"  "\nl\end{itemize}\nl"
%html -d ol      -igh "\nl\begin{enumerate}"  "\nl\end{enumerate}\nl"
%html -d li      "\nl\item "  ""
%html -d lh      "\nl\item "  ""
%html -d dl      -igh "\nl\begin{description}"  "\nl\end{description}\nl"
%html -d dt      "\nl\item["  "]"
%html -d dd      ""  ""
%html -d a       ""  ""
%html -d q       "``"  "''"
%html -d i       -iim "{\em "  "}"
%html -d em      "{\em "  "}"
%html -d b       "{\bf "  "}"
%html -d strong  "{\bf "  "}"
%html -d tt      "{\tt "  "}"
%html -d samp    "{\tt "  "}"
%html -d kbd     "{\tt "  "}"
%html -d var     "{\sl "  "}"
%html -d dfn     "{\sc "  "}"
%html -d code    -math "$"  "$"
%html -d blink   ""  ""
%html -d cite    "\begin{quote} "  "\end{quote}\nl"
%html -d blockquote  -igh "\begin{quotation} "  "\end{quotation}"
%html -d bq      -igh "\begin{quotation} "  "\end{quotation}"
%html -d u       "\underbar{"  "}"

%html -d pre     -verb "\begin{verbatim} "  "\end{verbatim}\nl"
%html -d xmp     -verb "\begin{verbatim} "  "\end{verbatim}\nl"
%html -d listing -verb "\begin{verbatim} "  "\end{verbatim}\nl"

%html -d br      -br "\newline\nl"  ""
%html -d hr      "\vspace{1mm}\hrule "  ""
%html -d img     ""  ""
%html -d isindex ""  ""
%html -d select  ""  ""
%html -d link    ""  ""
%html -d center  "{\centering "  "}"
%html -d meta    ""  ""
%html -d table   ""  ""
%html -d tr      ""  ""
%html -d td      ""  ""

Options

The options can be used to configure the LaTeX fragments that are generated by the program for the various kinds of references. The options can be given in the input file (as a line of the form %html -o option-name option-value), and inside comments in the HTML files (in the form of latex-opt option-name option-value).

There are options that determine the cases in which references should be generated and when not. For example, it will often be the case that an HTML file contains a HREF tag, whenever an email address is given, which can be used to send an email. As the essential information is already provided it is not neccessary to include it in a footnote or a bibliographic entry. The following options can be used for this purpose:

By default all these options are on.

The references can be divided into internal and external. The internal references are HREF tags that point to a file that is included in the LaTeX output, and external are those that are not. Internal references can be mapped to phrases, that state to look at the corresponding section. External references have to be given completely, either as a footnote at the bottom of the page or as a bibliographic entry. They are generated as bibliographic entries if the input file contains a line with `%html -b' (or if the program option -b is given), otherwise they are generated as footnotes. There are four generation modes:

These four modes can be set for three different environments, namely: the headers, LaTeX alltt environments, and all the remaining parts. The options for this are:

There are also options that determine the format in which the various kinds of references are to be generated (including the format of the bibliographic entries). All these options make use of format strings (like those used in C), where the percentage symbol followed by letter indicates a place holder for a string or number that has to be outputted. A double percentage symbol is used to denote a percentage symbol. All these options should contain LaTeX formating commands. Because references can be generated in fragil environments `%p' has to be used at places where a `\protect' is required in a fragil environment. Also because a `\footnote' is not allowed every where, a `%F' has to be used instead.

These are the options for internal references:

The options for external references as footnotes are:

The options for citations are:

The options for the bibliographic entries are:

The following options deal with the formating of all kinds of references. The make it possible to add additional formating around the anchor text or the image tag. The "%R" indicates the place where the reference should be placed. This can either be an internal or an external reference, in the running text or as a footnote. In case the "%R" appears in an fragile environment, it should be changed into "%fR". In case it appears in a place where a \footnote would not be proper, a combination of an "%mR" and an "%tR" can be used to indicate the place of the footnote marker and the footnote text, respectively. (An "f" can be added if they occur in a fragile environment.)

Program options

If the program is given an input file with the extension .html, it does not generate a LaTeX output file, but only analyse the file, and the files it references (if the -s option is given).

The program recognizes the following command line options:

The sources

There are several versions available, which are given below. For all versions: No warrants what so ever are implied!. Each version has a version number and a date at the top of the source file. Please use these for bug reports. I try to fix small bugs as soon as possible.

Please check the revision history in the source for more information. (What happend to version 2.3? I guess, I skipped that number by accident.)

Acknowledgements

I would like to thank the following people for their contributions:

Future plans

There are a number of things that still have to be fixed: