How to create your own Web Scraper

There are two different steps involved in Web Scraping:-

1. Crawling web pages
2. Extracting data from these web pages

Typically, there is a starting URL from where other links are discovered. You can filter these links to restrict the pages that are crawled. XPath is a wise choice for this filtering.

Once you have the HTML content of a page, you can extract any piece of information from it. Once again, XPath comes to the rescue for extracting the data.

Pay attention to the terms of use of the website you are trying to crawl. Don’t extract data from websites that forbid it.

On a side note, you can test out our Web Scraper to see how it is done.

Storing GUID/UUID in a Firebird database

Use the CHAR data type with a size of 16 and the OCTETS character set. This saves the value in binary, thus saving a few bytes.

Use the following if you are trying to insert values to a GUID column using ADO .NET:-

((FbParameter)parameter).FbDbType = FbDbType.Guid;

Convert the Guid value to a string representation using the following code:-

((Guid)value).ToString("D")

Search & Remove Duplicate Files

Unsurrogater is a product of ours that lets you search for duplicate files in Hard Disk Drives, CD/DVD Roms, USB Flash Drives, FTP Servers, Sharepoint Servers & Compressed Files.

Armed with a plethora of features, Unsurrogater sports a clean user interface that builds on a Job Category based navigation. The duplicate search process itself is very intuitive to start as the wizards guide you to setting up the job.

Search Result

Results are viewed hierarchically in a Tree Table display.

Auto Marking

Files can be marked manually or using the Auto Mark feature that takes hints from the user & automatically selects files across the entire result.

Actions

After marking the files, users can choose to perform operations such as copy, move, delete, archive, replace by hard links, etc.

Reports

Results can be stored historically for viewing later or exported as an HTML/XML Report.

Result Combination

A key feature of Unsurrogater is the ability to combine the search results of multiple jobs & discover new duplicates. This saves a lot of time as you don’t have to run the job again for the files already scanned.

Reflector to the Rescue

I was trying to get a list of all un-versioned C# code files in my working copy. Some examples on the internet pointed to Windows Power Shell which is able to pipe the output of one command to another (kind of what we have in linux).

The command was:-

(svn stat) -match '^\?.*\.cs$'

The following command gave a much cleaner output (just the full path of the files)

(svn stat "--no-ignore") -match '^\?.*\.cs$' -replace '^.\s+',''

The next command deleted those files from the PC.

(svn stat "--no-ignore") -match '^\?.*\.cs$' -replace '^.\s+','' | rm

I did not pay attention to what the command does and executed it. Net result was that my precious source codes were gone. I looked for the files in the recycle bin. But, being a command line program, it does not seem to use the recycle bin. Then I tried NTFS undelete which is able to restore files that were deleted ‘permanently’ too. What happens is that when you delete a file (SHIFT + DELETE, skipping the recycle bin), windows merely marks the space the file occupied in the disk as free. This is done for performance reasons (a delete operation would take time that is comparable to writing to a file, otherwise).

However, to my dismay, NTFS undelete was unable to find the file. Perhaps it was because of Windows 7 and it’s ways of handling different versions of a file.

It seemed like I would have to re-create the files again. But, then it occurred to me that reflector could come in handy. I had compiled the project previously. So, the debug directory had the compiled assemblies. I used reflector to get the source files back. Although not the exact as the original, the decompiled code was good enough for me to re-create the custom user-control.

Subversion Bug – Case Insensitive Username

I spent a complete day trying to find the cause for an access denied error while committing changes to the Maxotek Repository. At first it seemed like a program was using one of the files. However, even restarting the system had no effect. Then, I tried to leave out the culprit file from the commit and wasa greeted by the “Access Denied” message again, only this time without a mention of any file as the cause.

I tried deleting the cached passwords from %AppData%\Subversion\Auth\svn.simple. The next time I got the dialog asking for the account information. I typed the username & password in but still got the error.

While at work, I tried doing the commit and it worked. So, it seemed like an IP address ban. But, then again I don’t have a static IP address. Also, the Update, Revert, Check Out commands worked without a hitch.

Then it occurred that it could be a ban by domain name. I use No-IP to map my dynamic IP address to a domain. So, I changed the domain but the results were the same.

I finally tracked it down to being a case-sensitivity issue in the username. A Check Out, Revert, Update was being allowed from a username User but Commit was not. The actual username being user.

What’s up with Maxotek?

Some of you may know that Maxotek is a one man company. It was always my dream to setup a software company and distribute softwares. During the last few years, I have reached closer to the dream, with Maxotek making a bit of name. Creating intuitive softwares for every day users, has been the main goal.

It has been difficult to do all this, while still pursuing my degrees. Support from the users has been a great encouragement. Last month (25th of May, to be precise), I started working at Global IDs as a Trainee Software Developer using Java technologies. Global IDs specializes in Data Integration on a large scale. Maxotek takes a step back in terms of my priority.

Chances of new softwares being released is very little. But, we will continue to provide technical support to our valued customers. One day, Maxotek will be back as my main goal. But, until then, there is a lot I must learn, especially the software process models and the business aspects of running a company.

Maxotek.com obtained

I had been waiting a long time for this domain to expire. The 75 days grace period + 5 days redemption period felt like an eternity. Along the way, I had my doubts about others back ordering the domain through go-daddy. I even tried ordering it myself, but thanks to their requirement of a verified PayPal account, I saved some bucks.

A few days ago, I received an email from a company providing me an opportunity to obtain maxotek.com at a “premium price” before anyone else could, because I am the owner of maxotek.net. It didn’t seem right to me and so I checked the net for others’ experiencing the same. From what I read, it was clear that it was a scam, so I played the waiting game and boy wasn’t I rewarded.

What’s next?

Maybe I’ll use both the domains for different activities. .COM for business and .NET for blogging and other internet services. Only time shall tell.

Test Drive – Windows 7

I got hold of the Beta version of Microsoft’s forthcoming Operating System – Windows 7. In terms of looks it is very similar to Windows Vista. The first change you will notice is the new Taskbar which merges the Quick Launch toolbar and gets rid of text labels for the items. When I first heard about the new Taskbar, I didn’t think it would do well because icons might not be enough to recognize the applications. This is where the increase in the taskbar height comes in handy. The larger icons provide easy switching and save a lot of horizontal space.

The taskbar is smart enough to detect an ongoing operation and shows a small progress bar behind the icon. Say you are downloading a file using Internet Explorer in the background and working on a Word document. You don’t even need to switch to the download dialog window to see how much of the file has been downloaded. Just move your eye to the taskbar icon, handy eh?

The taskbar also shows multiple thumbnails of grouped items; all of which can be previewed full screen without even having to switch to the windows (also no clicks or keypresses required).

Another improvement is the desktop preview feature which can be activated by moving the mouse inside the vertical rectangle at the extreme right of the taskbar. Again, no clicks required. To get back to the current screen, just move your mouse away.

Coming back to the taskbar icons, I noticed that unopen taskbar items which are pinned to the taskbar have a whitish glow below the icons. For open windows the background is a gradient based on the color which forms the majority of the icon (looks really cool, I must say).

Speaking of looks, the bootscreen of Windows 7 is what should have been done to Windows Vista.

I guess they rushed Vista’s release and left a lot of things uncompleted. Windows 7 will most certainly address these issues and be what Windows Vista should have been after such a long development period. The naming of Windows 7 seems more of a marketing nature, than technical. If you look at the versioning of Windows, you’ll find the major version has been updated only when there has been a substantial change in the Operating System core. Windows 9X systems (95, 98, ME) where all version 4, 2000 & XP where v 5. So should have been Vista & Vienna as version 6. But, as I read in an article somewhere, Vista is a tainted brand and so Microsoft decided to get the V out. Say bye to Vista/Vienna & hello to Windows 7.