A little crawling experiment


Almost a year ago i wrote the article about the TYPO3 market share statistics i’m preparing to make a new statistic in the early days of 2012. Having too much time on my hands last days i thought why not to write my own crawler which searches for TYPO3 sites ?

I know this is quite a difficult task. Actually it is not that difficult to write the crawler, its much more difficult to scale after the number of domains start to grow in the database. But i like challenges.

So i started working on it and after a few hours a basic crawler was born. What it does is that it scans a website and tries to identify it if it is a TYPO3 website (based on the TYPO3 header comment). Then it extracts all distinct domains from the links and goes to the next domain.

The early data is available here: http://crawler.lacisoft.com/ The crawler is not very speedy because i limited its speed (i don’t want to kill my VPS). Also because of this i limited its scope to .ro domains (domains from Romania).

Another issue is that for now it counts a domain twice if it has both www and non-www version. Technically i could solve it to count only one but i’m not convinced that this would be the right step as theoretically www is a subdomain of the main domain and it is possible that the www and non-www versions house different websites. I will think more about it.

For now this is only an experiment, let’s see how it works out.

Later edit: The crawler is still scanning, the data is not final!

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like