Web crawling is pretty fascinating, but typically super boring.
This article is for non-developers who need to understand the importance because they want their websites to get better visibility, acquire more organic traffic, and make more money.
I wholeheartedly promise to make it not boring. And to prove that to you, I will start it off with a semi-relevant joke:
I wonder what my parents did to fight boredom before the internet? I asked my 15 brothers and sisters and they didn’t know either.
In this article we cover:
Let's start by defining web crawler.
Web crawlers (also called 'spiders', 'bots', 'spiderbots', etc.) are software applications whose primary directive in life is to navigate (crawl) around the internet and collect information, most commonly for the purpose of indexing that information somewhere.
They're called "web crawlers" because crawling is actually the technical term for automatically accessing a website to obtain data using software. Essentially, a crawler is kind of like a virtual librarian. It looks for info on the internet, and then sends it to a database for organizing, cataloguing, etc. so that the crawled information is quickly & easily retrievable by search engines when needed (like when you perform a search).
Most people call web crawlers either crawlers, bots, or spiders.
I really don't think many people call them spiderbots, but it's fun to say.
Spiderbot! Your mission, should you choose to accept it, is to never-endingly roam the extraordinarily large (and continuously expanding) internet and collect all its information, and put it into our index. Now go forth, acquire & extract!
That's pretty much how it works.
Now, who the F in their right mind would want to go through the internet and catalogue all that information? It sounds like the punishment that would be given to the utmost of sinners, by Lucifer himself.
That's Googlebot: He/she/it is a never-ending robot (piece of software) who runs around the internet and takes all your information (the information from your website, the information you load onto social media websites, the information you send in your Gmails, the information you speak into your Google Home, etc.) and sends it into the Google index. This is how search engines work.
Google isn't the only one though – other search engine companies (like Yahoo, Bing, etc.) make their money by providing information to us people glued to our computers and phones searching things 24/7 – but they need to acquire that information some how. They do with this web crawlers.
The primary goal of a webcrawler is to create an index (more in this later) and to learn what every web page on the internet is about, so the information can be retrieved by search engines and provided to you (the searcher) extremely quickly, and with great accuracy – meaning providing you results that answer the search intent of whatever it is you typed (or spoke) into the search engine.
The internet is like a continuously growing library with billions of books (websites), but no official/central filing system. So, search engine companies use internet-based software known as web crawlers to discover publicly available webpages - like your website.
Web crawlers systematically browse the internet to find websites. But how do they find all different websites? And how do they find all the pages on your website?
Then, they copy the information on the web pages they find (text, HTML, hyperlinks, metadata, etc.) and send it their search engine mothership (the web crawler's company servers) which download the webpages into their enormous databases and organize / index the information in a way that it can be searched and referenced very quickly.
Was that definitely 100% technically accurate? I don't know. I'm not a web developer. But it's close enough for you to get the overall idea of how it works without you having to re-read the definition 17 times and still be confused.
We aren't trying to be that website where you read something like:
Anyways – web crawlers send information into Google's database in a way that it can be accessed by you (searchers) very quickly.
When crawlers find a webpage, the search engine's systems render the page content taking note of key elements like keywords and we keep track everything it the Search index.
This technology is called "indexing".
Historically, Google's entire search engine index / algorithm ran on using keywords to understand, index, organize and serve pages (when someone performed a search).
That's why when you search for something on Google it can somehow return 4,220,000,000 results of information in less than half a second... Absolute Insanity.
Note: this process of visiting pages, crawling around all of the links, downloading the information, etc. is all happening on your website, which means your web server (aka web host) is the one who has to process information and it uses your resources, which web-hosts will charge you for.
So, not only is Google making you spend money to essentially "steal and organize" your information, they then make you to pay them to advertise if you want your website to show up at the top of the search page. Think about that for a second...
That's why here at SERP Co, we provide SEO Services with pride - and see it as a battle against the giants. A way for us to help the little guys take back what is theirs – organic search engine real estate, where you don't have to PAY for clicks.
I digress. Back to web crawling.
Now, however, Google is evolving and being able to create more sophisticated and complex understanding of information.
Instead of simply organizing information on webpages by keywords, it is now able to understand entities – the "same" we us humans do.
For example, the keyword phrase "nicholas cage" was simply a string of 12 letters separated by a space.
Now, Google understands more about this keyword, the reasons people search for it, and that Nicholas Cage is an entity – specifically a person entity.
So when you search for nicholas cage you are provided with more information about him, as a person.
You can read more about this process our article about SERP features.
Since web crawlers are software, they follow rules, known as policies.
There are more policies but ... I'm already getting bored talking about them, so let's get back to what's important here.
Without web crawlers your site would never be found, and thus unable to be presented on search engines.
Most crawlers don't attempt to crawl the entirety of the Internet, because let's face it – some website's are more important than others, and the internet is just way too big.
Web crawlers (remember they are software) require resources (aka money) to run, so companies want to make sure they are using their resources as efficiently as possible, so they must be selective.
These bots decide which pages to crawl first based factors they deem to be important:
A web crawlers "crawl budget" is basically the amount of pages it will crawl (and index) on any given website during a given timeframe.
What does this mean for you? If your site is too slow, too hard to crawl, deemed not important enough, etc. you will run out of your budget and the crawler will leave. It will miss finding pages, and thus your pages wont be indexed in search engines.
So, as an SEO specialist, you want to make sure you optimize your website to maximize crawl budget.
Do this by having:
Your robots.txt file is a file on your website that crawlers look at for directives – you can invite the spiders in, or keep them out – the choice is yours.
We have an article extensively covering robots.txt, but just to recap – you may not want bots visiting certain pages (maximize your crawl budget on your more important website sections) or maybe you just want to block certain bots.
What certain bots you ask? The bad kind of bots.
So we want our website to be found by Google, Bing, Yahoo, etc. so our business can be found by customers and grow. Great.
And now we know that in order for your website to be found we must make sure that these crawler bots are finding our website. Great.
But not all web crawlers are programs created by the search engine companies, and not all bots are deployed around the internet to INDEX content – some are here to scrape content.
Ever got spam phone calls? spam emails? How did these people get your contact information? Well, one way was that it was scraped off your website, or some website, on the internet.
Ever wonder how your business/personal information ends up on websites where you know for certain you didn't add it? Might have been scraped.
Bots can scrape anything posted publicly on the Internet. Anything includes text, images, HTML, CSS, etc.
Malicious bots can collect all sorts of information that hackers/attackers use for a variety of purposes:
Personal information can be scraped in bulk to collect databases of people in a specific cohort, and used for marketing purposes. Admittedly, this is not nearly as malicious as the previous examples but it still illustrates the point – not all bots are here to index your content for search engines.
Fun Fact: Bots are believed to make up over 40% of all internet traffic!
Not only is that a staggering amount of bot related activity, it has real implications for you as a website owner. It affects your analytics, your server resources, etc.
Since this article isn't about malicious bots (we could have an entire series on that) I will stop there with it.
Hate it or love it, bots are everywhere. Web crawlers make up almost half the internet, so to be a responsible business owner, website owner, SEO consultant, etc. it is critical we understand them and continue to learn about what we can do to let the good ones in, and keep the bad ones out.