Web Application - Crawl Settings

In this tab, you will select what links will be crawled, specify URLs that you want to crawl explicitly,

Crawl Scope

You've selected Limit to URL hostname. This means we'll limit crawling to the hostname within the URL, using HTTP or HTTPS and any port.

Let's say your starting URL is http://www.test.com.

What links WILL be crawled: All links discovered in www.test.com domain will be crawled. Also note that http://www.test.com/* (* as a wildcard here) will be crawled. All links discovered in http://www.test.com/support and https://www.test.com:8080/logout, etc. will be crawled.

What links WILL NOT be crawled: No links will be followed from sub-domains of www.test.com. This means http://www2.test.com. and/or http://sub1.test.com/ will not be crawled.

Crawl Scope

You've selected Limit to content located at or below URL subdirectory. This means we'll crawl all links starting with a URL subdirectory using HTTP or HTTPS and any port.

Let's say your starting URL is http://www.test.com/news.

What links WILL be crawled: All links starting with http://www.test.com/news will be crawled. Also http://www.test.com/news/headlines and https://www.test.com:8080/news/ will be crawled.

What links WILL NOT be crawled: Links like http://www.test.com/agenda and http://www2.test.com will not be crawled.

Crawl Scope

You've selected Limit to URL hostname and specified sub-domain. This means we'll crawl only the URL hostname and one specified sub-domain, using HTTP or HTTPS and any port.

Let's say your starting URL is http://www.test.com/news/ and the sub-domain is sub1.test.com.

What links WILL be crawled: All links discovered in www.test.com and in sub1.test.com and any of its sub-domains will be crawled. Also these domains will be crawled: http://www.test.com/support, https://www.test.com:8080/logout, http://sub1.test.com/images/ and http://videos.sub1.test.com.

What links WILL NOT be crawled: Links whose domain does not match the web application URL hostname or is not a sub-domain of sub1.test.com will not be followed. This means http://videos.test.com will not be crawled.

Crawl Scope

Limit to URL hostname and specified domains - This means we'll crawl only the URL hostname and specified domains, using HTTP or HTTPS and any port.

Let's say your starting URL is http://www.test.com/news/ and the specified domains are sub1.test.com and site.test.com.

What links WILL be crawled: All links discovered in www.test.com and in sub1.test.com and all other domains specified will be crawled. This means these domains will be crawled: http://www.test.com/support, https://www.test.com:8080/logout and http://sub1.test.com/images/.

What links WILL NOT be crawled: Links whose domain does not match web application URL hostname or one of the domains specified will not be followed. This means http://videos.test.com and http://videos.sub1.test.com will not be crawled.

Explicit URLs to Crawl

Specify URLs you want the scan to crawl. This is useful for pages that are not directly linked to other pages within the application. For example, a registration link is e-mailed to the user and the user clicks through to the application registration page from the email. You can also include WSDL URLs for web services you want our service to crawl. Enter each URL on a separate line. Each entry must be a valid HTTP or HTTPS URL. In case of authenticated scan, ensure that you always put the login link as the first link. You can enter a maximum of 2048 characters for each URL. The URLs you enter must be consistent with the selected scope:

Limit to hostname. If this scope is selected, additional URLs must have the same FQDN or IP address as the starting URL.

Limit to sub-directories of the web application URL. If this scope is selected, additional URLs must be in the same path as the web application URL.

Follow links in a specific sub-domain. If this scope is selected, additional URLs must have the domain name specified in the Domain Name field.

Follow links to specific hosts. If this scope is selected, additional URLs must have domain names specified in the Domains field.

Crawling Links

Select crawling links to instruct the scan to adhere to existing configurations when scanning the web application.

Crawl all links and directories found in the robots.txt file, if present. Robots.txt is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a web site that is otherwise publicly viewable.

Do not crawl links or directories that are excluded from the robots.txt file. Select to fully adhere to the robots.txt file, if present in the web application. Links and directories that are not included in the robots.txt file will not be crawled.

Crawl all links and directories found in the sitemap.xml, if present. Select to adhere to a sitemap.xml file if present in the web application. Sitemap.xml is an XML file that lists URLs for a site to inform search engines about URLs that are available for crawling.

Add Script

Click Add Script to upload a Selenium script to be used for crawling the web application. Click Choose File to upload a script from your local file system, or drag and drop the file into the Import File window. You can add any number of scripts. Use Qualys Browser Recorder to create a Selenium script. To know more about Qualys Browser Recorder, refer to the online help.

Specify URL or regular expression to trigger this script. You must enter a URL or a regular expression (in PCRE format). Then select the Use Regex check box.

Run only after form authenticated was successful. Select if form authentication is defined for the web application and you want this script to run only after form authentication has been successful.

Validation Regular Expression. Enter a valid regular expression to be used by our service to verify that the script ran successfully. The regular expression must be entered in PCRE format.

Web Application - Crawl Settings

Crawl Scope

Crawl Scope

Crawl Scope

Crawl Scope

Explicit URLs to Crawl

Crawling Links

Add Script

Related topic