About the Table Processor Object
Now almost any web page organizes its information with help of HTML tables. AWS HTML Producer has built-in robust HTML table processor that can help you process HTML tables and extract any information they keep. Thus, with this component you can easily retrieve necessary information from almost any web site. The TableProcessor object allows parsing whole HTML tables, recognizing cells and extracting their values. Along with the cell values, TableProcessor provides information on the structure of the table as well. This chapter contains overview of the TableProcessor object and show how it can be used for extracting information from web pages.
The table processing feature of AWS HTML Producer has several levels that let design the solution more clear and ready for future modernizations:
1. HTML tags level. This is the low level of the table processing process. It parses the HTML input and finds table-related HTML tags (<table>, <tr>, <td> and <th>), extract their values and attributes. The HTMLParserASP object operates on this level.
2. HTML table level. This is the middle level. It catches HTML parser’s notifications about the found tags and extracts the whole rows and cell. It sends notifications to the high level when table itself, its rows and cells are found. For these tasks the TableParser object is developed. It is private for the component, as the TableProcessor object only needs its services.
3. Table cells level. This is a high level; it processes the table in whole, featuring populating the table cell collection, merged cells support and other features.
The high level procedures work with HTML tables as with a set of objects – rows and cells, not as with separate HTML tags (<table>, <tr>, <td>, <th>). However, in contrast to DOM (Document Object Model), implemented in web browsers, it does not build the table-rows-cells tree representation of a table. The tree model can be a good solution for the applications that support scripting where it is important to create clear hierarchal model of the table to facilitate writing scripts. However, if you just are going to point the cell that contains the data you need, creating a tree is quite heavy and inconvenient solution. Imagine that you just desire to get values from a column with revenue rates for each month in a year balance table. With the tree, you would have to select a row firstly, next – a cell and only afterwards you would be able to get a value. This is a fully object-oriented and clear model, but too huge for simple data gathering tasks.
TableProcessor features an 1-D model where a plain collection of all table cells is supported. A TableCell object that has 2 special properties for finding out its position within a table represents each table cell. The first property keeps the index of the column where the cell is (ColIndex) and the second one – the index of the row of the cell (RowIndex). This gives you ability to point any cell you want within a table. Left image depict this.

TableProcessor recognizes table cells and creates a TableCell object for each one. It has no methods and stores only properties – parameters of a cell. As the cells are found, they are put into the TableCells collection that stores all the table cells linearly, in a 1-D form. Using the TableCells intrinsic GetCell method that returns a cell on specified column index (ColIndex argument) and row index (RowIndex argument):
Set Cell = Cells.GetCell(1,2)
Coordinates of a cell of a known table you can find out using TableProcessor Console (see a tutorial below).
After you got a cell, you have easy access to the cell value:
Value = Cell.Value
And that’s all!
TableProcessor can process not only simple tables as above, but also tables with multiple merged (spanned) cells as well. But as each cell is represented by one TableCell object, which cannot be merged, one trick is implemented here in order to support initial table structure.
Look at the following weather table:
| |
Moscow |
New York |
Rome |
|
Temperature |
63 F (17 C) |
92F (33C) |
77 F (25 C) |
|
Humidity |
52% |
|
Wind |
N at 5 mph (8 kph) |
N at 3 mph (5 kph) |
N at 3 mph (5 kph) |
You see that the Humidity row has three merged cells (for Moscow, New York and Rome, they all have the same humidity level). How this table is represented in the TableCells collection? TableProducer creates a separate TableCell object (i.e., a separate cell in resulting table) for each cell that is merged in one and they all get the same value (“52%” in this case). However, these cells are not equal.
Usually, in HTML source there is a cell that keeps a value and has the Rowspan (or Colspan) attribute that “expands” the fence of this cell and “gives” its value to the neighbor cells that are merged with this one. So, the cell that gives its value to the other ones is known as a master cell and others that get this value – merged, or spanned cells. TableCell object has the IsSpanned property that shows if this cell is a merged cell and belongs to a master one. If it set True, you can get coordinates of the master cell from the MasterCellColIndex and MasterCellRowIndex properties.
This is the weather table as it is represented in the TableCells collection. The value of the master cell is shown in bold:
| |
Moscow |
New York |
Rome |
|
Temperature |
63 F (17 C) |
92F (33C) |
77 F (25 C) |
|
Humidity |
52% (master) |
52% (merged) |
52% (merged) |
|
Wind |
N at 5 mph (8 kph) |
N at 3 mph (5 kph) |
N at 3 mph (5 kph) |
TableProcessor offers also great flexibility in choosing the HTML table to process among the other tables on a page. You can select a table by five ways:
-
By its number on the page. Just count
number of the table from the top.
-
By table name. If the table has the Name
attribute specified, you can use it to select the table.
-
By custom attribute. If the table has any
attribute with an unique value, you can specify the name of this attribute
with its value and the table will be found!
- By search string New!
- you can specify a text string that would match
HTML text within a table you want to process. Use this to find a table
for processing by a text string it contains.
- By regular expression pattern
New! - the same as the previous one, but
lets take advantage of the power of the regular expressions while
matching a text inside the table to be selected. In this mode you need
provide a valid regular expression pattern.
Take a look at an example:
MyTableProcessor.SelectCriterion =
SelectCriteria.SelectBySearchString
MyTableProcessor.SearchString = "World Weather"
|