Processing HTML with AWS HTML Producer .Net
The fowling chapter describes the main concepts of HTML parsing with the help
of AWS HTML Producer. This information is most important if you are
going to use the component primarily as a HTML template parser. if you want to
process ordinal web pages only, you do not have to know details about how the
component parses them, which described below. In the second case, it will be
enough for you to read only the "Tag using
conventions" topic.
Tag using conventions
Because AWS HTML Producer is a highly universal parsing component and was
designed to parse any type of HTML-like text, it does not know how these tags
are used, for example, whether they require corresponding end tags or not. This
is the problem that all XML parsers face, as XML parsers know nothing about
tags. The solution of this problem here is the same as with other XML parsers -
the developer should adhere to a standard set of conventions that will solve
most such problems. Here is the list of these conventions.
End tags. The problem of end tags is solved the same way as in
XML but there are several features here that add highly improved flexibility in
terms of pointing out when the parser must look for the end tag and when it must
not.
- By default, every tag to be processed by HTML Producer must
have a corresponding end tag (<pagetitle>…</pagetitle>). Place
a slash (/) before ">" in a start tag if it is not convenient
to use an end tag. This is the same as in the XML standard.
- You can change this precondition by setting the RequireCloseTags
property of the HTMLParser object as False. In HTML Producer
console this property is reflected by the Require close tags check
box.
- If the majority of tags require end tags and others don't or most of the
tags don't require them when the rest do require them, you can use the unconditional
specifiers used in the list of tags to be found in text to point
strictly as to whether a tag needs to be closed or not. There are two
specifiers that can be placed before each tag name.
- & - a tag requires a corresponding end tag (i.e. &DIV,
&TITLE, &P, &PHONE)
- * - a tag does not require a corresponding end tag (i.e. *IMG, *META,
*CUSTOMER)
The meaning of these specifiers does not depend on the state of the Require
close tags property. However if a tag that was found in text has a slash
(/) before ">" (if it is specified as already closed) then an
end tag will not be sought despite any specifiers.
> So, if most of your tags don't require end tags (<BR>, <IMG>,
<META>) but one requires (<P>) then you should set the Require
close tags property as False and before the <P> tag name in
the tag name list you should set & specifier: BR, IMG, META, &P.
Tag name. Tag name can contain, alphabetic symbols, numbers,
and all other symbols except white space and the symbols defined as tag
start and end (by
default < and >). If you want to process all tags on a page you
should put an asterisk (*) into the TagNames property (in HTML Producer
Console this property is reflected by the Tag Names text box).
Tag parameters. Tags can have parameters written in this style:
ParamName1=ParamValue1 ParamName2=ParamValue2 … ParamNameN=ParamValueN.
If a value of a parameter has white spaces then it must be covered in single (') or double
quotation marks("). If a parameter with the same name repeats in one tag, then only the last
instance will be treated as its value. Single specifiers, such as CHECKED,
NOSHADE or SELECTED that are sometimes used in some HTML tags are recognized by HTML Producer,
but not captured, so you cannot get them.
Fundamental tag
processing
Now when you know a little about HTML Producer, you can work with the
first example that shows the features described above in action.
Imagine that there is an HTML template where a page title must be inserted
dynamically when a user requests the page. One possible solution is to add a
special tag that will be replaced with the page title text during parsing of the
template. Let's name it <pagetitle>.
After we found a name for the tag, we must decide if it requires the
corresponding end tag. As usual end tags can be useful if the text between start
and end tags is important. In our case, there is no need in it and we can use
only the start tag. However, the parser requires tags to be closed by default
and we can go on in several ways. The first way is to unset the Require Close
Tags box in the HTML Producer Console window; the second one and
maybe the best in this situation is to place is to put a slash before
">", that is <pagetitle/>.
<html>
<head>
<title><pagetitle/></title>
</head>
</html>
Open HTML Producer Console, copy the text above and paste it into the
field of the HTML Producer Console titled Insert HTML text here.
After this we should specify the tag that must be found in this text. In order
to do it, type the tag name into Tag Names field. Note that you can type
either the name of the tag, pagetitle, or the full
tag presence with "<" and ">", <pagetitle>.
We could specify that the tag does not require closure by adding a star (*)
before the tag name this way: *pagetitle.
Now you can press the Parse button. The <pagetitle/>
tag will be highlighted. It means that the Producer found the tag; you can see
its name and its text (in this example they are the same because we did not
specified parameters for this tag). Now you can enter the text to replace this
tag. Type it in the Covered Text field, for example, Welcome to my
Home Page. To continue parsing click More >> and your text will
be inserted in place of the tag. You will see the text like this:
<html>
<head>
<title>Welcome to my
Home Page!</title>
</head>
</html>
In this example we became familiar with the main function of HTML Producer
- finding and replacing tags. Try to complete this example again use different
ways of defining whether a tag requires to be closed.
Advanced tag
processing
In the previous section you could see that HTML Producer is quite
flexible in the task of tag processing. In this section you will learn other
features that help you create reliable and powerful web applications.
Tag parameters
The first thing that we should pay attention to is recognizing tag
parameters. This feature is implemented in XML parsers, but unfortunately few
server-side template parsers support this. Using parameters is a very effective
way to pass data to the code that will be processing the tag. For example, if
you want articles taken from a database will be inserted when a special tag is
found, you can specify the identifier of the article to show and so on. Also,
you can specify the location of the database server, the user name and password
to get access to the database, name of the table to retrieve the article from
and so on.
RunAt parameter
The second thing we should consider is RunAt parameter that adds more
flexibility in specifying what tags should be processed and what should not. If
you pass a desired value of RunAt parameter to the parser (the RunAt
property of the HTMLParser object or RUNAT text box in HTML
Producer Console) the tags that have RunAt parameter only with the
same value will be processed. Other tags even if they have the requested name
will not be processed. This parameter can be used in any tag to be parsed like
the other parameters. The same technology is used in Active Server Pages (ASP)
when RunAt attribute is used in different tags (<SCRIPT>, <OBJECT>,
etc) if they must be processed on the server side, not client. It is better to
use RunAt parameter as a criterion for selecting tags to parse if you should
process native HTML tags on the server. The value of RunAt parameter in usual
HTML text lets HTML Producer recognize that this tag must be processed by
your application, not by a web browser.
Note that you cannot use selecting on the value of RunAt parameter alone, but
only combining this method with selecting on a tag name, or names if you need
parse several tags. If you use closed tags, the RunAt parameter with the same
value must be in the end tag, like this: <BODY RunAt="webapp">…bla…</BODY
RunAt="webapp">. Otherwise this close tag will not be treated
as the end tag for the first one.
If during the processing, the parser meets non-expected end tags (without
such start tag) with the same value of RunAt parameter as given, it will remove
them anyway.
Generating debug information
The third main thing that must be described is the ability of HTML
Producer to generate debug information for each processed tag. This means
that each block of text inserted while processing a tag will be covered by
comments that show the start and the end of this block, so a developer can watch
changes made by the parser. Besides these comments, information about processed
tag that caused the insertion of this text block is generated, including the tag
name, its parameters and initial text between start and end tag. This gives you
full control over the parsing process and grants you important flexibility in
debugging your code.
Recursive HTML parsing
When you parse HTML text, you can replace one tags with other. By default,
these new inserted tags will not be noticed by HTML Producer in current
parsing process. However, sometimes you need to process all the tags, including
new tags in single parsing process. This can be done using recursive
parsing mode that enabled by setting the ParseRecursive property to True.
This property is reflected by the Parse Recursive checkbox in HTML
Producer Console. If after the second parsing cycle new suitable tags (that are
specified in TagNames property) are inserted, they will be parsed
as well. This loop will be repeating as long as suitable tags are inserted.
ASP and PHP script islands
parsing
From version 2.0 AWS HTML Producer works correctly with inline ASP and PHP
sever side scripts on HTML page. If parser meets this script block covered by
<% ... %> for ASP and <? ... ?> for PHP ones, it will extract the
text of the script and put it to the TagText property correctly.
"%" and "?" that encapsulate script body will be removed. To
enable ASP or PHP script parsing you should put "%" (for ASP) or
"?" for PHP scripts in TagNames property such as any name of
tag. Both these sign are put without quotes. Simultaneously you can specify
other names of tags that you also want to parse. Here is examples for Visual
Basic:
Dim Parser
As HTMLProducer.HTMLParser
...bla...
Parser.TagNames ="?" 'Parse only PHP script islands _OR_
Parser.TagNames ="?, Meta, %, Font" 'Parse ASP, PHP script islands
and some HTML tags
Needless to say that all these attributes you can set in HTML Producer
Console as will in Tag Names text box.
Parsing all tags in HTML text
You can process all the tags existing in HTML text without specifying them
explicitly. To do this you need to put an asterisk (*) in the
TagNames
property. All tags and inline server script islands will be parsed. However,
HTML comments (<!-- ... -->) will not be parsed in this mode.
Using different brackets in
tags
Sometimes there is a necessity to parse non-standard, custom tags that are
written in special style, such as [Product] or {Device}. HTML Producer let you
change default tag brackets. Using the TagStartSign and TagEndSign properties
you can specify appropriate symbols. Note that it is not allowed to specify
several symbol sets for each property. Clauses like {,( will be
considered as one symbol set. Example of using these properties for Visual Basic
looks like this:
Dim Parser
As HTMLProducer.HTMLParser
...bla...
Parser.TagStartSign = "[" 'Tag starts with [
Parser.TagEndSign = "]" 'and ends with ]
Advanced tag processing:
example
Let's expand your previous example, so we can consider these key features in
action. Now we will model the situation when we need to insert the page body
between <body> and </body> tags.
Imagine that we need to insert an article from a database on the web-page and
all the data for connection with the database server to be established are
variables and must be specified directly in the tag. Also imagine that this tag
is processed by a custom program that is the HTML Producer client in this
situation and that HTML Producer will not query the database by itself
here. The article will not be inserted actually; the parser will only show you
different characteristics of the tag and remove it.
<html>
<head>
<title>Welcome to my
Home Page!</title>
</head>
<body>
<QueryDatabase runat=webapp
datasource=mssql server=dbserver username=sa password="" sql="SELECT
PageBody FROM PagesTable WHERE PageID=1">
Page
body will be inserted here
</QueryDatabase runat="webapp">
</body>
</html>
Open HTML Producer Console if it is not open now. Copy this piece and
paste it into the textbox titled "Insert HTML text here". Next, type
the name of the tag (QueryDatabase) into "Tag names" field. As
we use selecting by RunAt parameter also, we need to specify its value in the
box called "RUNAT"; let its value to be webapp. Also, we
will try to generate debug information for this tag processing, so set "Generate
debug information" check box.
Now we are ready for parsing. Press "Parse" button and, if
everything was set right, all expressions with QueryDatabase tags will be
highlighted. All tag parameters (runat, datasource, server, username,
password, sql) with their names and values will be in the table and you can
watch them. Also you will see the text between start and end tags in the text
box called "Covered text". Delete this text and type there
"It is my article!"; after this click the "More"
button and the parsing will be finished. Look through the HTML text and you will
see that these tags will be replaced by "It is my article!" and
covered in the HTML comments, pointing to the beginning and the end of inserted
text. In addition, in the upper comment block you can see all the parameters of
the tag and the initial text between the tags.
|