Improve our coverage of your site

Overview

More than 100,000 registered academics, researchers and students use our Commons indices. Collectively they generate millions of visits to the websites we index. This document outlines:

  • How to make your content easier to discover by Coherent and others

  • How to request additional indexing and enrichment

  • How to suggest individual sites and organizations for indexing

Our indexes are build manually by editors and by submissions and suggestions from Commons members. For larger sites we have a web spider called CoherenceBot that visits these websites and prepares Google-like snippets and links back to selected reports.

Please contact us at information@coherentdigital.net if you have more questions!


CoherenceBot Basics

CoherenceBot visits your website on a periodic basis; typically once every 30 days. It follows links on your website to discover the various pages of your site and then makes a determination if the page is a suitable for discovery from within a Commons.

Like any web spider, CoherenceBot benefits from Search Engine Optimization (SEO) strategies that many web sites use to improve discoverability on Google and other web search engines. This guide outlines the following:

  • How to ensure your reports are discoverable by Commons users.

  • How to optimize the metadata Commons users see when discovering a report from your site. There are several options here including custom crawls of your site involving no effort on your part.

CoherenceBot follows best practices when it comes to crawling and linking to your web content.

  • It honors all features of robot exclusion configured into your site.

  • It announces itself with a stable, recognizable user-agent so that you know from your web log analytics that CoherenceBot has visited.

  • It throttles requests over time to ensure it does not affect performance on your site.

  • It recognizes SEO strategies that improve the content in the index.

Discovery Services

The discovery options for your content in a Commons range from a basic, no-effort option that simply permits visits from CoherenceBot to your site, to a completely custom, per-report, metadata creation service that is usable for discovery on both Policy Commons and your own site. In short the options are:

  • Basic crawl

    1. CoherenceBot visits every page and will index content that matches the editorial parameters for each Commons.

  • Focused Crawl

    • CoherentBot visits specific sections of your site where the publications are located, ignoring other sections of the site.

  • Custom Crawl

    • Coherent Digital will develop a custom page analysis tool to extract value-added metadata for each of your reports. This works best on structured sites produced from a content management system.

  • Customization of your Directory Entry

    • Your organization directory entry can be enhanced to brand your content and feature selected reports.

  • Report Archival Service

    • By default, all indexed content leads Commons users to your site. If a certain report becomes undiscoverable, then Commons supports citations to this report to an archived copy.


CoherenceBot Discovery Options

Basic Crawl

A basic crawl requires only that CoherenceBot be permitted to crawl your site and select content to link to.  The main permission comes from your site’s robots.txt file.  If it already allows visits from web spiders, then there’s nothing more you need to do.  If it blocks some or all spiders, then you need to add a section to robots.txt that permits visits by CoherenceBot.  It will look like this:
User-agent: coherencebot

Allow: *

Focused Crawl

In a focused crawl, you instruct CoherenceBot to follow certain directories of content that you would like to be discoverable, for example:

User-agent: coherencebot

Allow: /publication-directory

Allow: /journal-directory

Please remember to update this if the locations of the content changes.

Custom Crawl

If your website is suitably structured and includes rich metadata, Coherent Digital can develop a custom crawler that parses your site structure and extracts displayed metadata typically not found by generic crawlers. Custom crawls can be configured to visit more frequently to ensure new content drops on your site are quickly discoverable on each Commons.

Contact us for more information on Custom Crawls.

Enhanced Directory Entry

The organization directory entry for your institution on each Commons can include details about branding, the topics you cover and can feature selected reports. Contact us for information on how to enhance your organization directory entry.

Any Questions?

If you would like to contact our product team to discuss content discovery, please don’t hesitate to contact us.

 

Metadata Optimization

This section describes strategies that will help your content look great and be more easily discoverable in our Commons, and also in Google and other search engines.

HTML Header Metadata

CoherenceBot recognizes and will index the following information from the HTML header (applied specifically to reports written in HTML)

<html lang="en">
<title>Report Title</title>
<meta name="keywords" content="Keyword1, Keyword2, Keyword3">
<meta name="description" content="Article summary…">
<link rel="canonical" href="http://example.com/best/url/to/this/report" />

CoherenceBot also recognizes Open Graph and Twitter Card headers. These features also make your content look great in social media posts.

PDF File Metadata

CoherenceBot recognizes and will index the following information from the PDF file metadata.

Title

Author

Subject

Language

Page count

Creation Date

Modification Date

If the PDF file metadata is missing a title, CoherenceBot will inspect the text with the largest fonts that appears on page 1. So you can make the title of a report recognized by CoherenceBot by simply giving it the largest font on page 1.

CoherenceBot will recognize the anchor text from the link on your site to the PDF. Often your link to the document might say ‘Download PDF’ or some other call-to-action. If however it is the title of the document, then this can be used as the discovery title in the absence of other choices.

Also remember to include your font information in the PDF. This will ensure that CoherenceBot will successfully parse and index the full text of the PDF and be able to detect the content language.

Inclusion / Exclusion of Selected Reports

CoherenceBot will exclude reports that have title keywords that suggest the PDF may not be a suitable for indexing in a Commons. CoherenceBot excludes job openings, calls for papers, resumes and small documents.

Though we will make a best effort to include everything that is relevant, we cannot guarantee that CoherenceBot will index every document allowed. In these cases a custom crawl is the best option.

Site Health Check

Other factors can influence the success of a CoherenceBot visit:

  1. Is your SSL Certificate valid?

  2. The speed of the site

  3. Heavy use of redirects

  4. Broken links

  5. Use of javascript to show or hide content and links

  6. Single-page-applications make crawling more difficult