A little crawling experiment

Almost a year ago i wrote the article about the TYPO3 market share statistics i’m preparing to make a new statistic in the early days of 2012. Having too much time on my hands last days i thought why not to write my own crawler which searches for TYPO3 sites ?

I know this is quite a difficult task. Actually it is not that difficult to write the crawler, its much more difficult to scale after the number of domains start to grow in the database. But i like challenges.

So i started working on it and after a few hours a basic crawler was born. What it does is that it scans a website and tries to identify it if it is a TYPO3 website (based on the TYPO3 header comment). Then it extracts all distinct domains from the links and goes to the next domain.

The early data is available here: http://crawler.lacisoft.com/ The crawler is not very speedy because i limited its speed (i don’t want to kill my VPS). Also because of this i limited its scope to .ro domains (domains from Romania).

Another issue is that for now it counts a domain twice if it has both www and non-www version. Technically i could solve it to count only one but i’m not convinced that this would be the right step as theoretically www is a subdomain of the main domain and it is possible that the www and non-www versions house different websites. I will think more about it.

For now this is only an experiment, let’s see how it works out.

Later edit: The crawler is still scanning, the data is not final!

A little crawling experiment

Laszlo Bodor

Related Tags

Leave a Reply Cancel reply

Interest for PHP is declining ?

A comparation between Typo3 and Drupal

PHP sucks? No. Programmers suck!

Learn to say “No”

How to scrape webpages using PHP and XPath

A lesson to be learned in software development

AWS Certified Developer

Long time no see

Impressions from Codecamp Cluj – autumn edition

Share this:

Related Tags

Leave a Reply Cancel reply

You May Also Like

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: