In this article we will focus on mainly how the search engines work. Thereafter, we will discuss indexability, and the differences between surface web, deep web and the darknet.
There was a time when goods and services important to us would either appear on the newspaper classifieds’ section or on the Yellow Pages. There also used to be those bulky BSNL/ MTNL telephone directories offered for free on taking a new connection.
These indices used to have exhaustive contact information about all sorts of legitimate businesses. Needless to say, they did not contain contact information of businesses working outside the purview of public policy. Guns, pornography, drugs, etc. had no place in these public directories.
Worth noting, is the fact that there were no mechanisms to collect user preference, or user data. Users of these directories had no role to play in the economy other than paying money to buy one.
Then came the millennium, businesses started getting online, directories got online as well. Google had indexed close to 60 million pages in a span of three years,1 starting 1995. Along with many other search engines who had the same competitive advantage of starting early.
Google was nothing innovative when it started. It was based on the same algorithms its competitors were using. That is scraping sites and counting hyperlinks.
Scraping is the usage of automated softwares to read websites and store their information. To be able to scrape a website you need to know the link to the website. There exist no other method to find a website if you do not know the link.
The algorithm will take a popular website as a starting point where people share links to their own website. It will scrape the website for unique website links. Once done it would visit those website links to scrape and find out further links to other websites.
Over time it would create a list of websites through this method and allocate them a rank. The rank is calculated on the basis of how many other websites link to one specific website.
For e.g.: If three people link to IndiaTechLaw.com and ten people link to IndiaCorpLaw.in, then IndiaCorpLaw.in would get a higher rank.
This rank is useful when websites contain similar information. The search engine would show the highest ranked websites at the top of their search results.
Search engines do not like to be scraped though, they use technologies like rate-limiting, browser detection, etc. to defend against automated softwares. This is why the internet is actually not free.
The more information you scrape the more business you get. And here we are talking of petabytes of information. The largest search databases get the most amount of visitors and the opportunity to show them ads. The ads are from those websites which would not have appeared at the top of search results. These ads earn revenue for the search engine and accord footfall to the advertiser.
That is how the internet works. The moment you visit a popular search engine like Google or Bing, you become a part of this economy, where the goods being sold is you (and your search preference).
However, in a bid to keep the reality as it were pre-internet, there are some search engines which do not collect user information. DuckDuckGo (largest and secure), StartPage.com (private but uses Google to google) and WolframAlpha (popular for scientific usage) are among the few.
On the basis of search engine indexability the internet is classified into three areas. This classification is intentionally made analogous to the oceans.
That is surface web, deep web and darknet.
Pages and websites which can be found and indexed by search engines are known collectively as the surface web. Their links are popular, documented on other indexable websites, freely accessible and not closed down by law enforcement agencies yet.
While there exist the surface web, easily visible and indexed by most search engines. That is only the tip of an iceberg. Rest of the internet is made up of the deep web where probing is difficult.
Deep web makes the 99% of the internet. The owners do not want their web properties to be publicly available and they requested search engines to remove their links from search results.
For e.g: Universities into research keep their research databases online, however it is protected by user IDs and passwords. In such a case, there is a treasure of information which is just not freely accessible, and search engines could not index them because of the same reasons.
Estimates say that the deep web is several magnitudes larger than the surface web. One estimate of 2001 says that deep web is 400 to 550 times larger than the surface web.2
One thing worth noting is deep web properties do not need to change their links or server addresses, as they do not store illegal or criminal elements.
That’s about deep web.
The last bit of the internet justifies the name being used to describe it, it is as dark and inaccessible as the lower reaches of an ocean.
The darknet is a coined term first used by ARPANET to mean web properties that are extant but unresponsive to prevalent network protocols. These unlike the deep web are additionally inaccessible along with being unindexed.
They can be accessed only through frequently changing domain names, IP addresses, network protocols, etc. They have to keep changing their configurations to avoid detection by governmental agencies.
The Tor Network
The Tor network constitutes of more than 7000 nodes spread across the globe to anonymise user access data. It defends against organised surveillance, censorship and helps protect freedom of speech and expression.
Incidentally, it was a US Government funded project, which, given its tendencies, may surprise you. It started out as a network anonymiser tool for confidential communication at the DARPA and US Navy.
As of now a branch of the U.S. Navy uses Tor for visiting and studying web resources during intelligence gathering.3 Law enforcement agencies use Tor in order to keep their footprints clean and leave no government IP addresses in web logs of the host.
Eventually it was open-sourced and it has grown as a liberator in the face of surveillance and censorship.
Upon installing the Tor software on a server, the host is granted a randomly generated domain, which may look like this:
http://3g2upl4pq6kufc4m.onion/ DuckDuckGo’s onion site
https://www.facebookcorewwwi.onion/ Facebook’s onion site
Governments of countries like China and Egypt do not want outside internet to affect their citizens, they constantly track down and block these websites. Therefore, the domain names look randomised as they keep changing automatically. Once the Tor network detects a blockage it can easily assign another domain name.
And the best part of this is that the entire network is usable free of cost.
Drugs, Demons and the Darknet
While the earlier subheading went into the introduction and the good work the Tor network has done for the global internet. There exist a lot unsaid.
This is best expressed in pictures:
Yeah, that’s about it? No. It has also got new AK47s, Rocket Launchers, Shotguns, etc.
The delivery of these goods are done using secretive packaging at a high delivery cost. However, police across the globe are pretty adept at tracking these suspicious packages, making sure the society stays safe.4
And although I have personally not come across nastier stuff than these, some quarters say slavery, human trafficking, illegal organ markets and child pornography as well are available on some domains.
Accessing the Darknet
Irrespective of your reason to access banned websites, you should know that they are banned because they do not align with the current law and order of our countries.
They are not banned unreasonably, most of these websites harbor all sorts of malware. These malware can remotely access your computing device and use it in furtherance of all sorts of cyber crime. And you would never get to know about it, till it is very late.
However if you are in China, and you are missing Facebook, here is what you have to do:
- Install Orbot and Orfox on your phone from Google Play store
- Start Orbot and click Browse
- Once you see the “Congratulations. This browser is configured to use Tor” message you may visit this onion URL: https://www.facebookcorewwwi.onion/
- You can also search DuckDuckGo for websites which have indexed useful .onion sites
Similarly, you may visit other onion URLs as and when you discover them. Most of these URLs keep changing over time.
If you liked the article please share it with your followers. If you have doubts or questions about any part of this article, please feel free to leave a comment below or ask questions directly to the author here: Ask Questions.