Extracting data from Web Pages

Partho Sarathi

29 Oct, 2013 · 3 minutes read

Considering the volumeof data available in the World Wide Web today, it is a no brainer that people will often need to extract data from it. This is the first part of a series of posts that will show you how to extract data from Web Pages using Data Utensil.

For those of you who are new here, Data Utensil is our product which aims to be “The single tool for all your data needs”. Now that is a big goal, but we are working towards it, albeit in tiny steps. As of writing this article, you can explore & manage databases, compare schemas & data, import & export tabular data from various formats and crawl websites using Data Utensil. I will stop my bantering there and get right on the topic.

Extracting data from web pages comprises of two main activities, crawling web pages & importing the data from HTML markup. Data Utensil splits these two activities in different Jobs. A job is a long running activity which runs in the background. Different types of jobs accomplish different things. For instance, the data comparison job compares data in tables of two schemas while the copy schema job copies data of tables in one schema to another.

The Crawl Websitejob crawls web pages and dumps the resultant HTML as files. Crawling starts with one or more URLs from where Data Utensil discovers new URLs to crawl, all the time dumping the HTML markup of the web pages. You can specify path filters to exclude URLs from being crawled or include path filters to restrict the crawling to only those. All the HTML dumps are saved in a folder of your choice. The dumps are stored in folders and files that try to mimic the path hierarchy of the URL. So, the following page: http://maxotek.com/products/data_utensilwill be saved to C:\My Crawls\maxotek.com\products\data_utensil.html

After the completion of crawling you end up with a bunch of HTML files. These serve as the input to the next step, Import Table from HTML. There can be multiple types of tabular datasets in these files. To choose the correct one, you can select a file from the dump & see a preview of the tabular datasets in this file. You can easily switch to other tabular datasets contained in the file, to locate the correct one. The software understands the table through it’s XPath and column names. Data from all tables matching these two criteria will be appended to a new table. In addition to the columns in the table, you can specify virtual columns which can extract data from any HTML node/attribute from the dump file using XPaths. Virtual columns can also extract the name of the HTML file or any folder in it’s path.

After completing the configuration of columns & virtual columns, you can choose the schema & specify a name for the new database table which will be created. Finally, you specify a Name & Description for this Job.

When you run the job, it will use these configurations and start importing the data from HTML markup.

I am working on a new type of Job that can automatically extract multiple datasets from HTML dumps. The next article will focus on how this automation can save time, while accomplishing the same results.

web scraping