Letting bots crawl your dynamically rendered page

Deepjyoti Barman @deepjyoti30

Nov 13, 2020 • 12:17 PM UTC

Meta tags are one of the most important aspects of a blog. In the recent years, crawlers such as from Google and Bing etc have advanced a lot, they are able to crawl pages that render content dynamically.

One important role meta plays is to give the page an identity to all the crawlers from social share sites like Twitter etc. These crawlers are not yet that advanced as to recognize dynamically rendered content.

How do they work?

Whenever you share a post on Twitter, Facebook etc, you'll see a nice little card showing you the title of the post, some description and the cover image (possibly). Ever wondered how they are able to show that? Yep, they use the meta tags from the posts URL.

Techincally, how do they work?

All right, let's get a bit technical here, how does it actually get all that data?

Well, when you share an URL, the crawlers basically send a GET request to that URL and the server sends back the HTML content of the page. Once the page is received, the crawlers extract all the content from it and accordingly you see those data on your Twitter feed.

What is dynamic rendering of content

Okay, so what is this dynamic rendering I'm talking about? In order to understand how dynamic rendering works, you need to know how the static pages work.

What are static pages?

So, we call some pages static because, well, they are static. They are just HTML pages that were created by someone and when you try to open them, your browser fetches the HTMl content from that someone's server and shows it to you.

Now here's the deal, if you want to create a personal blog, can you imagine how tedious it will become to manually keep creating HTML pages for each blog post you write? Well, perhaps you can't but believe me when I say, it becomes a mess real quick.

What is dynamic rendering?

Well since static pages become a mess, for something like a personal blog, dynamic rendering is very useful. You just store all the blogs etc in the backend and you just have one HTML page (not literally). This page will render the content based on the post that the user wants to read.

This takes away a lot of the issue of static pages. Like, whenever you feel like creating a new post, just add it to your backend and your frontend will take care of it automatically.

This is how this blog is also handled (I recently made the move to this new blog page, if you had checked my site before, you'd be aware of the old static page).

The problem

It's no secret that dynamic rendering is pretty useful. However it brings its own issues. One issue is since the content is rendered dynamically, it means the meta tags are also rendered dynamically.

But, as I mentioned in the beginning, bots like Twitter do not support that yet. So the problem is how do you make your posts crawlable by bots so that whenever you share it somewhere it shows up real nice and good?

The solution

After I faced this issue, I started wondering what could be a nice possible solution for this.

I came across a post on dev.to that basically said that in order to let bots crawl, we need to use a middleware in our server.

The above post is basically about creating a static file server in expresss that serves the files from the dist directory. However, when the request is made by some bot, it loads the page using puppeteer which is basically the selenium equivalent of nodeJS. What is does is using a chromedriver in order to load the page and then returns the content.

The check to see if it is a bot is done by checking the User-Agent header which tells us who is making the request.

Once the content is recieved from puppeteer, it is returned to the bot that is trying to crawl the page.

Our solution will be somewhat similar to the above, but in Python. Yep, Python it is.

My solution

My idea was simple. I need to have a middleware that will do the following

Check if the request is being made by some bot
If it is, get a skeleton HTML page from my backend that is rendered dynamically and it adds all the necessary meta tags
Else, return proper index.html

I made a basic static file server using Flask that does exactly the above. It checks if the user agent is a bot by using the user-agents library and accordingly returns the content.

Implementation

I'm not going to dive deep into the second point. What I did in there was to add an endpoint in my API that returns a dynamically rendered HTML that contains all the necessary meta tags.

Checking the user agent is done in the following code. We will use the user-agents library to determine if the request is being made by some bot.

# Considering you have registered a flask app

@app.route("/<slug>")
def return_post(slug):
    # If the user agent is a bot, return just meta
    user_agent = parse(request.user_agent.string)

    # Check if it is not bot
    if not user_agent.is_bot:
        app.logger.info("{}: Returning content for not bot".format(request.full_path))
        return send_file("index.html")

In the above code chunk, flask is checking the route /<slug>. Slug is a string. Since all the posts in my page are in the form of blog.deepjyoti30.dev/<slug>, it's only necessary to check those posts.

After that slug is matched, we are parsing the User-Agent string and checking if it is not a bot. If so, we are returning the index.html.

So we are now set to serve the files to a normal user, what should we do regarding the bots?

The following code chunk shows what to do if it is a bot.

    # Get proper meta and return
    content = get_rendered_html(slug)

    app.logger.info("{}: Returning content for bot.".format(request.full_path))
    return content

The above code is pretty self explanatory. If the one making the request is a bot then we get the rendered html using the slug and accordingly return that.

This way the social share bots will be able to go through the content and accordingly show the image cover etc.

Now, the above code becomes the following

# Considering you have registered a flask app
# Check the flask docs to know how to do that

@app.route("/<slug>")
def return_post(slug):
    # If the user agent is a bot, return just meta
    user_agent = parse(request.user_agent.string)

    # Check if it is not bot
    if not user_agent.is_bot:
        app.logger.info("{}: Returning content for not bot".format(request.full_path))
        return send_file("index.html")

    # Get proper meta and return
    content = get_rendered_html(slug)

    app.logger.info("{}: Returning content for bot.".format(request.full_path))
    return content

The above is code for serving the basic index.html. You will need to specifically serve other files as well such as js, css etc. One way of doing those is by the following:

# Serve the img directory
@app.route("/img/<path:file>")
def send_img(file):
    app.logger.info("{}".format(request.full_path))
    return send_from_directory(Path("dist").joinpath("img"), file)

You will have to do something like above for other directories as well, like the js directory as well as teh css directory and other files.

For production, use a WSGI http server like gunicorn.

Conclusion

Dynamic rendering is fun and really usefull, however it has it's own caveats. All in all, it is still worth it since the issues are not that big a deal. You can try out the above code in action in the following way.

Let's take this post as an example: https://blog.deepjyoti30.dev/using-mongo-with-heroku

Make a GET request using a proper User-Agent in the following way.

curl https://blog.deepjyoti30.dev/using-mongo-with-heroku \
    -H "User-Agent: Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit53519"

If you have a proper internet connection, you'll see some minimal JS related code. This means the request was recognized as being from a user and not a bot.

Let's now try to imitate a bot in the following way:

curl https://blog.deepjyoti30.dev/using-mongo-with-heroku \
    -H "User-Agent: just-a-bot"

You should now get some HTML content. This clearly shows the request was recognized as being from a bot and thus some HTML content was returned with proper metadata.

Letting bots crawl your dynamically rendered page

How do they work?

Techincally, how do they work?

What is dynamic rendering of content

What are static pages?

What is dynamic rendering?

The problem

The solution

My solution

Implementation

Conclusion

Discussion

Cache dynamic pages for better performance

Dynamic Path Matching with Go and mux

Getting started with web scraping in Python

Letting bots crawl your dynamically rendered page

How do they work?

Techincally, how do they work?

What is dynamic rendering of content

What are static pages?

What is dynamic rendering?

The problem

The solution

My solution

Implementation

Conclusion

Share

For the love of coding

Discussion

Read More

Cache dynamic pages for better performance

Dynamic Path Matching with Go and mux

Getting started with web scraping in Python