Blog

Analysis about ai/ml/io domains

Introduction:
This article clears the fog about the domains which have become popular recently. In simple terms, a domain is a string(a sequence of characters) which helps a browser, using DNS services, to locate (DNS to IP mapping) the server which is hosting the requested website. This string (domain name) is maintained by several DNS hosting website companies. A typical example of a domain is .com.

One of the formal definition of domain from wikipedia:
“A domain name is an identification string that defines a realm of administrative autonomy, authority or control within the Internet.”

With the rise of Artificial Intelligence/Machine and Deep learning/Blockchain wave, some domains have gathered much prominence than they had it before. These domains include .ai,.ml,.io. One line introduction to these domains:

ai: Administered by Anguilla
ml: Administered by Mali
io: Administered by British Indian Ocean Territory

While ai and ml have gathered recent prominence because of their phonetic similarity to abbreviated Artificial Intelligence (AI) and Machine Learning (ML), IO domain has its own set of occupied technologies. So much of the attention that the .ai domain has garnered that most of the major tech companies (Facebook, Google, Amazon, Microsoft, etc.,) have their brands established on .ai domain

This article unleashed some of the insights into these domains.

Idea:
Following is the outline of the idea which helped to do analysis:
Get the list of websites that are hosted on all the 3 domains
For each website, get a few properties which will be useful for the categorizing

Implementation:

While any language is equally good to implement, I have used Python to simplify the automation and parallelisation (using threads) of the tasks.

1. Getting the list of valid websites:
To get the list of valid names, I had to scrap the search results of most of the major search engines (Google/Bing/Baidu/Yahoo/Ask). To make things quick and simple, I have obtained 1000+ website names per domain to scrape.

2. Gathering different properties of websites:
For each website, I had to retrieve the content, read properties of websites to get the required analyses. Furthermore, I had to get Alexa information as well to retrieve more insights.

Results and Analysis:

“A picture is worth a thousand words” complemented by “The drawing shows me at a glance what would be spread over ten pages in a book” by Ivan Turgenev, Father and Sons (1862).

Keeping in the above quote in mind, I have presented most of the analysis in graphs and pictures, rather than words. In case, you are still interested in the words, please scroll down to the Conclusion section

Analysis 1: Website hosted

Location of the hosting servers have been obtained from Alexa. I have consolidated the numbers (number of servers that hosted per country) of all the domains. Each bubble shows the number of servers that are hosted in a country. For example, USA (438) means that 438 websites are hosted in the servers that are located in the USA.

Analysis 2: Rank Analysis

We have 2 good ways of getting rank (popularity) of a website:

a. Google’s Pagerank: Its been some years (around 2013) that Google has stopped updating page rank. Though I couldn’t find the official source to confirm this, I can tell you that the web had flooded with blogs/news. Please refer this unofficial, but a credible source, link for more information.

b. Alexa’s Rank: Alexa, from Amazon, has provided an alternative to Google’s page rank. I had to get the Alexa’s Traffic rank. If you are interested to know how Alexa ranks each website, please do take a look at this. For each website I have obtained Alexa’s rank and scatter plotted to get the below plot:

Do remember rank is lower the better

Fig 2: Alexa Rank comparison among domains

Analysis 3: Key word Analysis

Each website comes with a optional list of key words which describes the content. These key words helps the search engines to easily identify the information present in the site.

WordCloud (or Tag Cloud ) is a best way to give visual representation of the key words. Font size of a word becomes bigger with the frequency in the list. To draw this, I have used Mueller’s WordCloud :

Since key words were very much differed in each domain, I had to generate seperate image for each domain:

Conclusions:
Following points might add some insights to yours after looking at above pictures:

AI domain:
As expected most of the hot technologies like data science, AI, etc., stayed in the AI domain. A quick glance can reveal the hosted websites belongs to AI/Analytics/ Recruiting/ Security.

IO domain:
While the most of the hot technologies/key words appeared in AI domain, blockchain has moved to IO domain.
Interestingly, “blockchain/bitcoin/startup” as the key word also appeared in IO domain frequently in IO domain than in the AI domain.

Ranking wise (Figure 2) , IO domain has surpassed AI domain because of the relatively earlier entry of crypto currency, block chaing and their popularity..

ML Domain: Movie Land 🙂
Surprisingly, atleast to noviciates like me, is most of the popular woods(tolly, holly, bolly) have appeared in the domain with the high frequency. Domain seems to be the most favorite domain for movie industry related websites.

This (entertainment quotient) may be the reason that the ML domain has the best rank (refer Fig 2) surpassing AI and IO domain.
Common to all the 3 domains:
Most of the websites are hosted in US, China and India taking the top 3 places.

In case, you are interested to take a look at the code, please do visit my github repo: Github link

Hope you find the results interesting. Do leave a comment

PS: Since all results are dynamic, they are subject to change.

Advertisements

Success story of Decision Tree Regressor

Though we have a multitude of regressors to predict/approximate target variable(s), not always the advanced ones wins. This article tries to show that Decision Tree Regressor (DTR)wins over the advanced ones like Random Forest, etc.,

We tried to compare DTR with the following regressors:

  1. Linear Regressor
  2. Boosted Regressor
  3. ADABoost
  4. Random Forests

Each regressor was tried individually to find its best score and then compared. For example, ADA Boost was giving its best with 200 boosts while Bagging Regressor with 100 bags.

Dataset Introduction:

This is a diabetes data which comes with scikit learn datasets package. This dataset has 442 rows and 11 features (including the target):

Load the dataset:

diaData = load_diabetes()

target = diaData.target

print(‘Shape: {}’.format(df.shape))
Shape: (442, 10)

Regression Trials:

All of the regressors were run with test size as 20% and random seed as 123.

(XTrain, XTest, YTrain, YTest) = train_test_split(X, Y, test_size=20, random_state=123)

Linear Regression:

lreg = LinearRegression()
lreg.fit(XTrain, YTrain)

print(‘Score: {:.2f}’.format(lreg.score(XTest,YTest)))

Score: 0.60

2. ADA Boost:

We will try with different number of boosts from 10 to 300, though the number of boosts changed with different runs with inherent random samples in the model during each run, score was the same:

sen = AdaBoostRegressor(n_estimators=200)
sen.fit(XTrain, YTrain)

Max score 0.61 achieved with 260 boosts

ADA Boost with 260 Boost

3. Bagged Regressor:

There was a minimal improvement with this regressor:

breg = BaggingRegressor(n_estimators=100)
breg.fit(XTrain, YTrain)

Max score 0.64 acheived with 60 boosts:

4. Random Forest Regressor:

Random Forest

Max score 0.61 acheived with 10 forests

5. Decision Tree Regressor:

Comparison of regressors in a single run:

Though above picture clearly places DTR as the winner, but intrinsic randomness in the regressors during bagging, boosting, etc., can change the whole picture for multiple runs. Hence to find if there is a variance in these results, all regressors were run over 1000 times including the training and test data splitting.

Here is the result:

As expected, DTR was the winner for 99.995 % of the time and only 0.005 % of the time, Random Forest was able to surpass DTR’s score. And one more interesting observation in this was only DTP and linear regressors were stable in all the runs.

I will be covering why part in the next article.

Thanks for reading.