Mark Smith

ASP.NET, SQL Server, HTML, CSS and other random thoughts!

Convert HTML tables to a DataSet

I've just had a requirement to extract some data from a web page and manipulate it before sending it off elsewhere. The data in the web page was all included inside <table> tags, so I wrote a few Regular Expressions to extract this data and place it in a DataSet object.

Read More...

If you have any comments on this method, or have any other methods you would recommend, please let me know.
 


 

Comments

Convert HTML tables to a DataSet - Mark Smith said:

Pingback from  Convert HTML tables to a DataSet - Mark Smith

# December 6, 2007 5:07 AM

rrobbins said:

Hmm, using regular expressions to parse HTML is always problematic. I think I would try to generate an Excel spreadsheet using Response.ContentType = "application/vnd.ms-excel", save the spreadsheet to disk, and then use it as a data source for a OleDb connection although you really need a named data range to query for data in a spreadsheet.

# December 6, 2007 10:29 AM

ca8msm said:

Yes, maybe the Regular Expression approach may be problematic if the HTML is invalid (which was a disclaimer in my article) but otherwise it seemed to work correctly.

I don't think I'm so keen on the Excel based solution though as it seems to have a lot more overhead (i.e you'd have to try and work out where each table started and ended in the resulting file, and then create a named range for each table via the excel object model, which would then also require excel to be installed to the server). Unless of course I'm mistaken and there's another way to utilise excel in this scenario?

# December 6, 2007 11:12 AM

Dave said:

You can save straight table data as an Excel file and then open it with the OleDb provider, without Excel being installed.

# December 6, 2007 1:10 PM

ca8msm said:

How would you do that Dave? I thought you had to have named ranges to query an Excel file via OleDb? If I'm correct then the only way to create these named ranges would be to open the file with the object model hence needing Excel to be installed.

# December 6, 2007 1:26 PM

Charles Owen said:

You can query the spreadsheet as if it were a database table.

string strConn = @"Provider=Microsoft.Jet.OLEDB.4.0;    Data Source=C:\data\mydata.xls;Extended Properties=    ""Excel 8.0;HDR=YES;""";

SELECT * FROM [Sheet1$];

support.microsoft.com/.../306572

# December 14, 2007 6:59 PM

ca8msm said:

Charles,

I understand that you can query a spreadsheet, that is not the issue (see the comments above which already go into this). The problem I would have had using that approach is that there would have been multiple tables of completely unrelated data contained within "Sheet1". Therefore, you would end up with a DataTable (or equivilant object) with columns of data that bear no relevance from one table to another and no way of knowing when one table finished and another started. For example, the HTML page could have contained this data:

Table 1:

Field1 Field2

1      2

3      4

Table 2:

Name   Country

Mark   England

Dave   France

Using your approach, I would end up with one DataTable which contains two DataColumns. However, the rows within that DataTable would contain the headers from table 2 along with the data from table 2. Table 1's data has no relevance to Table 2's data, yet you would have them both within one DataTable and therefore no way of knowing when the second table's data started.

The only way around this would be to use the named ranges approach as mentioned above, but this brings in the overhead of having Excel installed on the server.

The method I used, means none of this is necessary so with valid HTML, I think this is the best method I've seen so far.

# December 16, 2007 12:23 PM

Mikesdotnetting said:

The problem with trying to query an "Excel" file that was generated via html is that it isn't an Excel file.  It's an html file with a .xls extension.  Jet doesn't know what to do with it, and will complain of unrecognised file formats if you try to use it to open and read the file.

I would have used regular expressions too.

# December 18, 2007 11:20 AM
Leave a Comment

(required) 

(required) 

(optional)

(required)