A markup language combines text and extra information about the text. The extra information, for example about the text's structure or presentation, is expressed using markup, which is intermingled with the primary text. The best-known markup language is in modern use is HTML (Hypertext Markup Language), one of the foundations of the World Wide Web. Historically, markup was (and is) used in the publishing industry in the communication of printed work between authors, editors, and printers.
Classes of markup languages
Markup languages are often divided into three classes: presentational, procedural, and descriptive.
Presentational markup describes the visual appearance of the whole text of of a particular fragment. For example, in a Word processor file, the title of a document might have associated markup asserting that the text is centered, in bold-face, and a larger typeface. Virtually all word-processing and Desktop Publishing products support presentational markup; in normal operation it is hidden from the user to produce the "WYSIWYG" effect.
Procedural markup is typically also concerned with the presentation of text, but is usually visible to the user editing the text file, and is expected to be interpreted by software in the order in which it appears. To format a title, a succession of formatting directives would be inserted into the file immediately before the title's text, instructing software to switch into centered display mode, then enlarge and embolden the typeface. The title text would be followed by directives to reverse these effects. In most cases, the procedural markup capabilities comprise a Turing-Complete programming language. Examples of procedural-markup systems include nroff, troff, TeX, and PostScript. Procedural markup has been widely used in professional publishing applications.
Descriptive Markup applies labels to fragments of text without necessarily mandating any particular display or other processing semantics. For example, the Atom syndication language provides markup to label the "updated" time-stamp which is an assertion from the publisher as to when some item of information was last changed. While the Atom specification discusses the meaning of the "updated" timestamp, and the markup used to identify it, in great detail, it makes no assertions about whether or how it might be presented to a user. Software might put this markup to a variety of uses, including many not foreseen by the designers of the Atom language. SGML and XML are systems explicitly designed to support the design of descriptive markup languages; examples of such languages include Atom, MathML, and XBRL.
The dividing line between classes of markup is often blurred. For example, HTML contains markup elements which are purely presentational (for example <b> for bold) and others which are purely descriptive (the "href=" attribute).
The main virtue of descriptive markup considered to be its flexibility; if the fragments of text are labeled as to "what they are" as opposed to "how they should be displayed", software may be written to produce to process these fragments in useful ways not anticipated by the designers of the languages. For example, HTML's hyperlinks, originally designed for activition by a human following a link, are also widely used by Web search engines both in discovering new material to index and in estimating the popularity of Web resources.
Presentational-markup systems usually include "named styles" or equivalent, which to some degree replicate the effect of descriptive markup. Similarly, procedural-markup languages usuallly include "macros", to a similar end.
History
The term "markup" is derived from the traditional publishing practice of "marking up" a manuscript, that is, adding printer's instructions in the margins of a paper manuscript. For centuries, this task was done by specialists known as "markup men" who marked up text to indicate what typeface, font, style, and size should be applied to each part, and then handed off the manuscript to someone else for the tedious task of typesetting by hand.
The idea of "markup languages" was apparently first presented by publishing executive William W. Tunnicliffe at a conference in 1967, although he preferred to call it "generic coding." Tunnicliffe would later lead the development of a standard called GenCode for the publishing industry. Book designer Stanley Fish also published speculation along similar lines in the late 1970s. However, IBM researcher Charles Goldfarb is more commonly seen today as the "father" of markup languages, because his research from 1969 onward transformed the idea into an actual working product. Independently of Tunnicliffe and Fish, Goldfarb hit upon the same basic idea while working on an early project to help a newspaper computerize its workflow; he would later become familiar with the work of Tunnicliffe and Fish, and today he is always careful to share credit.
Some early examples of markup languages available outside the publishing industry can be found in typesetting tools on Unix systems such as troff and nroff. In these systems, formatting commands were inserted into the document text so that typesetting software could format the text according to the editor's specifications. It was a trial and error iterative process to get a document printed correctly. Availability of WYSIWYG ("what you see is what you get") publishing software supplanted much use of these languages among casual users, though serious publishing work still uses markup to specify the non-visual structure of texts.
Another major publishing standard was TeX, created and continuously refined by Donald Knuth in the 1970s and 80s. TeX concentrated on detailed layout of text and font descriptions in order to typeset mathematical books in professional quality. This required Knuth to spend considerable time investigating the art of typesetting. However, TeX requires considerable skill from the user, so that it is mainly used in academia.
The first language to make a clear and clean distinction between structure and presentation was Scribe , developed by Brian Reid and described in his doctoral thesis in 1980. Scribe was revolutionary in a number of ways, not least it introduced the idea of styles separated from the marked up document. Scribe influenced the development of Generalized Markup Language (later SGML) and is a direct ancestor to HTML and LaTeX. LaTeX is a de-facto standard in many scientific disciplines.
In the early 1980s, the idea that markup should be focused on the structural aspects of a document and leave the visual presentation of that structure to the interpreter led to the creation of SGML (Standard Generalized Markup Language). The language was developed by a committee chaired by Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's project, GenCode.
SGML specified a syntax for including the markup in documents, as well as another system (a so-called "metalanguage") for separately describing what the markup meant. This allowed authors to create and use any markup they wished, selecting tags that made the most sense to them. Examples of such markup languages based on the SGML system are TEI and DocBook. SGML was promulgated as an International Standard by ISO in 1986.
However SGML was generally found to be cumbersome, a side effect of attempting to do too much and be too flexible. For example, SGML made end tags optional in certain contexts, because it was thought that markup would be done by overworked support staff who would appreciate saving a few keystrokes here and there.
By 1991, it appeared that SGML would be limited to niche uses while WYSIWYG tools (storing documents in proprietary binary formats) would take over the vast majority of document processing.
The situation changed dramatically when Sir Tim Berners-Lee used some of the SGML syntax, without the meta-language, to create HTML. In HTML the markup consists of a set of "known" tags that handle common formatting tasks. However the language was originally created to markup simple scientific papers and therefore had to be greatly expanded in order to offer the rich content the web has today, and for this reason the additions often follow no logical design, although recent efforts have attempted to address this. HTML is likely to be the most used document format in the world today.
Another, newer, markup language that is currently growing in importance is XML (Extensible Markup Language). Unlike HTML which uses a set of "known" tags, XML allows user to create any tag he/she wish (thus it's extensible) and then describe those tags in a meta-language known as the "DTD" (Document Type Definition). However, DTD's were difficult to write because their syntax was different from XML, so they have recently been supplemented by XML schema, which is a meta-language defined in terms of XML itself.
XML is similar to the concept of SGML, and in fact, in general terms, XML is a subset of SGML and a superset of HTML. The main purpose of XML (as opposed to using SGML) is to keep the system simpler by focusing on a particular problem — documents on the internet. By doing so they hope to avoid the feature-creep that complicated SGML. The newest incarnation of HTML is XHTML or eXtensible Hypertext Markup Language, a more rigorous and robust version that is in fact XML, and requires documents to be "well-formed" as does XML, but which uses mostly the familiar HTML tags. The main difference between HTML and XHTML from the standpoint of coding the language is that all tags must be closed, including so-called 'empty' tags such as <br> which, not being a 'container tag', must be 'closed' in every instance like: <br />.
Features
A common feature of many markup languages is that they intermix the text of a document with markup instructions in the same data stream or file. Here, for example, is a small section of text marked up in HTML:
<h1> Anatidae </h1>
<p>
The family <i>Anatidae</i> includes ducks, geese, and swans,
but <em>not</em> the closely-related screamers.
</p>
The codes enclosed in angle-brackets <like this> are markup instructions (known as tags), while the text between these instructions is the actual text of the document. The codes "h1", "p", and "em" are examples of structural markup, in that they describe the intended purpose or meaning of the text they include. Specifically, "h1" means "this is a first-level heading", "p" means "this is a paragraph", and "em" means "this is an emphasized word". A device reading such structural markup may apply its own rules or styles for presenting it, using larger type, boldface, indentation, or whatever style it prefers. The "i" instruction is an example of presentational markup. It specifies the exact appearance of the text (in this case, the use of an italic typeface) without specifying the reason for that appearance.
For the humanities, the Text Encoding Initiative (TEI) has published some guidelines about how to encode texts.
Alternative usage
While the idea of markup language was originated from text document, there is an increasing usage of markup languages in areas like vector graphics, web services, content syndication, and user interfaces. Most of these are applications of XML as XML is a clean, well-formatted and extensible markup language. The use of XML has also lead to the possibility of combinating multiple markup languages into a single profile, like XHTML+SMIL and XHTML+MathML+SVG [1].
See also
References