LET’S TALK ABOUT - DEEP WEB
DISCLAIMER: this information is a public and for educational purposes, I don't have any knowledge on how to use it and more technical aspects as software and and hardware. If you are looking for that, you are on the wrong place.
MAIN CONCEPT AND WHAT IS
The deep web, invisible web, or hidden web are parts
of the World Wide Web whose contents are not indexed by standard search engines
for any reason. The content is hidden behind HTML forms. The opposite term to
the deep web is the surface web, which is accessible to anyone using the
Internet. The deep web includes many very common uses such as web mail and
online banking but it also includes services that users must pay for, and which
is protected by a paywall, such as video on demand, some online magazines and
newspapers, and many more.
BASIC CONCEPT
The first conflation of the terms "deep web"
and "dark web" came about in 2009 when the deep web search
terminology was discussed alongside illegal activities taking place on the
Freenet darknet.
Since then, the use in the Silk Road's media
reporting, many people and media outlets, have taken to using Deep Web
synonymously with the dark web or darknet, a comparison many reject as
inaccurate and consequently is an ongoing source of confusion. Wired reporters
Kim Zetter and Andy Greenberg recommend the terms be used in distinct fashions.
While the deep web is reference to any site that cannot be accessed through a
traditional search engine, the dark web is a small portion of the deep web that
has been intentionally hidden and is inaccessible through standard browsers and
methods.
SIZE
In the year 2001, Michael K. Bergman said how
searching on the Internet can be compared to dragging a net across the surface
of the ocean: a great deal may be caught in the net, but there is a wealth of
information that is deep and therefore missed. Most of the web's information is
buried far down on sites, and standard search engines do not find it.
Traditional search engines cannot see or retrieve content in the deep web. The
portion of the web that is indexed by standard search engines is known as the surface
web. As of 2001, the deep web was several orders of magnitude larger than the
surface web. An analogy of an iceberg used by Denis Shestakov represents the
division between surface web and deep web respectively:
- It is impossible to measure, and harsh to put estimates on the size of the deep web because the majority of the information is hidden or locked inside databases. Early estimates suggested that the deep web is 400 to 550 times larger than the surface web. However, since more information and sites are always being added, it can be assumed that the deep web is growing exponentially at a rate that cannot be quantified.
- Estimates based on extrapolations from a study done at University of California, Berkeley in 2001 speculate that the deep web consists of about 7.5 petabytes. More accurate estimates are available for the number of resources in the deep web: research of He et al. detected around 300,000 deep web sites in the entire web in 2004, and, according to Shestakov, around 14,000 deep web sites existed in the Russian part of the Web in 2006.
NON-INDEXED
CONTENT
It would be a site that's possibly reasonably
designed, but they didn't bother to register it with any of the search engines.
So, no one can find them! You're hidden. I call that the invisible Web.
CONTENT TYPES
Methods which prevent web pages from being indexed by
traditional search engines may be categorized as one or more of the following:
1. Contextual Web: pages
with content varying for different access contexts (e.g., ranges of client IP
addresses or previous navigation sequence).
2. Dynamic content:
dynamic pages which are returned in response to a submitted query or accessed
only through a form, especially if open-domain input elements (such as text
fields) are used; such fields are hard to navigate without domain knowledge.
3. Limited access
content: sites that limit access to their pages in a technical way (e.g., using
the Robots Exclusion Standard or CAPTCHAs, or no-store directive which prohibit
search engines from browsing them and creating cached copies).
4. Non-HTML/text
content: textual content encoded in multimedia (image or video) files or
specific file formats not handled by search engines.
5. Private Web: sites
that require registration and login (password-protected resources).
6. Scripted content:
pages that are only accessible through links produced by JavaScript as well as
content dynamically downloaded from Web servers via Flash or Ajax solutions.
7. Software: certain
content is intentionally hidden from the regular Internet, accessible only with
special software, such as Tor, I2P, or other darknet software. For example, Tor
allows users to access websites using the .onion server address anonymously,
hiding their IP address.
8. Unlinked content:
pages which are not linked to by other pages, which may prevent web crawling
programs from accessing the content. This content is referred to as pages
without backlinks (also known as inlinks). Also, search engines do not always
detect all backlinks from searched web pages.
9. Web archives: Web
archival services such as the Wayback Machine enable users to see archived
versions of web pages across time, including websites which have become
inaccessible, and are not indexed by search engines such as Google.
INDEXING METHODS
While it is not always possible to directly discover a
specific web server's content so that it may be indexed, a site potentially can
be accessed indirectly (due to computer vulnerabilities).
To discover content on the web, search engines use web
crawlers that follow hyperlinks through known protocol virtual port numbers.
This technique is ideal for discovering content on the surface web but is often
ineffective at finding deep web content. For example, these crawlers do not
attempt to find dynamic pages that are the result of database queries due to
the indeterminate number of queries that are possible. It has been noted that
this can be (partially) overcome by providing links to query results, but this
could unintentionally inflate the popularity for a member of the deep web.
DeepPeep, Intute, Deep Web Technologies, Scirus, and
Ahmia.fi are a few search engines that have accessed the deep web. Intute ran
out of funding and is now a temporary static archive as of July 2011. Scirus
retired near the end of January 2013.
Researchers have been exploring how the deep web can
be crawled in an automatic fashion, including content that can be accessed only
by special software such as Tor. In 2001, Sriram Raghavan and Hector
Garcia-Molina (Stanford Computer Science Department, Stanford University)
presented an architectural model for a hidden-Web crawler that used key terms
provided by users or collected from the query interfaces to query a Web form
and crawl the Deep Web content. Alexandros Ntoulas, Petros Zerfos, and Junghoo
Cho of UCLA created a hidden-Web crawler that automatically generated
meaningful queries to issue against search forms. Several form query languages
(e.g., DEQUEL) have been proposed that, besides issuing a query, also allow
extraction of structured data from result pages. Another effort is DeepPeep, a
project of the University of Utah sponsored by the National Science Foundation,
which gathered hidden-web sources (web forms) in different domains based on
novel focused crawler techniques.
Commercial search engines have begun exploring
alternative methods to crawl the deep web. The Sitemap Protocol (first
developed, and introduced by Google in 2005) and mod oai are mechanisms that
allow search engines and other interested parties to discover deep web
resources on particular web servers. Both mechanisms allow web servers to
advertise the URLs that are accessible on them, thereby allowing automatic
discovery of resources that are not directly linked to the surface web. Google's
deep web surfacing system computes submissions for each HTML form and adds the
resulting HTML pages into the Google search engine index. The surfaced results
account for a thousand queries per second to deep web content. In this system,
the pre-computation of submissions is done using three algorithms:
o Selecting input values for text search inputs that
accept keywords,
o Identifying inputs which accept only values of a
specific type (e.g., date), and
o Selecting a small number of input combinations that
generate urls suitable for inclusion into the web search index.
In 2008, to facilitate users of Tor hidden services in
their access and search of a hidden .onion suffix, Aaron Swartz designed
Tor2web—a proxy application able to provide access by means of common web
browsers. Using this application, deep web links appear as a random string of
letters followed by the .onion TLD.
DISCLAIMER:
this information is a public and for educational purposes, I don't have
any knowledge on how to use it and more technical aspects as software
and and hardware. If you are looking for that, you are on the wrong
place.
LET ME KNOW YOUR COMMENTS
LET ME KNOW YOUR COMMENTS
Comments
Post a Comment