Skip to search Skip to main content
New Zealand Web Archive Logo
NZ Web Archive
  • Help
  • About

Help


Searching the Web Archive


What can be found in the New Zealand Web Archive

The Web Archive Discovery Platform provides public access to the New Zealand Web Archive. The full-text of web pages and documents (PDF, text, Word, Excel, PowerPoint and others) have been indexed and are searchable.

This does not include any boilerplate HTML code that is used by a web browser to render a web page. However, it can include text from rendered headers, footers, menus and navigation elements.

The Web Archive Discovery Platform does not currently index media such as, images, audio and video. Such items can still be discovered by searching on the context of the surrounding web pages that link to or embed them.

Finding a non-indexed resource explanation image one Finding a non-indexed resource explanation image two
Figure 1. An example of searching for a photo. A photo of the Federal Hotel in Picton can be found within the NZETC collection by searching for descriptive text from the web page that embeds the image.


Different types of Search options

Searching all Text

Use this option to search for any text relating to a web page or document, including text content, title, URL and Domain.

Search Tips:

  • All text searches are case insensitive.
  • By default, any words in a search are treated as individual words. For example, searching for the words Queens Birthday Weekend, may return results that:
    • contain all three words in sequence.
    • contain all three words, out of order, or separately and in different locations.
    • contain one or more of the words.
  • Search results will be ranked by the proximity of those words together, and their frequency.
  • By default, the following list of stop words are ignored during searching: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.
  • If searching for a phrase, surround your text in double quotes, e.g. "Home of the Paekakariki Express".
Searching by Title

Use this option to search for any title of a web page. Titles are the text content extracted from the HTML title tag of a page, i.e. "<title>". This is the text that is displayed in a web browser’s title bar, or web page’s tab.

While the title tag is required in HTML documents, its text does not always match what appears to be a title rendered within a web page, as this is at the web page author’s discretion.

Search Tips:

  • All title searches are case insensitive.
  • By default, any words in a search are treated as individual words. For example, searching for the words Queens Birthday Weekend, may return results that:
    • contain all three words, in sequence, or out of order, within a title.
    • contain one or more of the words within a title.
  • Search results will be ranked by the proximity of those keywords together, and their frequency.
  • By default, the following list of stop words are ignored during searching: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with.
  • If looking for an exact match, surround your text in double quotes, e.g. "A Romance of Lake Wakatipu".
Searching by URL

Use this option to search for any URL of a web page or document.

Searching by URL explanation image
Figure 2. The URL components that are used in a URL search; the protocol, the primary domain and any sub-domains, and the top-level domain and any second-level domains.

Search Tips:

  • URL searches are case insensitive.
  • URL searches expect and are sensitive to the protocol of the URL, i.e. http or https. For example, if you include the protocol (http, https) as part of the URL search it will only match the results that have that same protocol.
  • URL searches perform an exact match. To perform partial searches of a URL, prepend or append your text with wildcards *, for example:
    • *nzetc.victoria.ac.nz/tm/scholarly/tei-corpus-nhsj.html (find any URL that ends with this text)
    • https://nzetc.victoria.ac.nz/tm/scholarly/tei-corpus-nhsj* (find any URL that starts with this text)
    • *tei-NHSJ06_03* (find any URL that contains this text)
Searching by Domain

Use this option to search for any domain of a web page or document. The term domain is broadly used to cover domain, subdomain and hostname in the context of this search.

Searching by Domain explanation image
Figure 3. The URL components that are used in a Domain search; the primary domain and any sub-domains, and the top-level domain and any second-level domains.

Search Tips:

  • Domain searches are case insensitive.
  • Do not include any protocol in your domain search, i.e. http:// or https://.
  • Domain searches perform an exact match. To perform partial searches of a domain, prepend or append your text with wildcards *, for example:
    • *victoria.ac.nz
    • nzetc*
    • *dia.govt.nz
    • *govt.nz*
    • *school*

Search Facets

On the search result page, you will find the search facets on the left side of the page. Search facets allow you to narrow the results. There are six available facet types. Within each facet, you will see a summary of values that can be clicked on to narrow your search.

When a facet has been applied, it will appear as a box above the search results. To remove an applied facet, simply click the 'x' button on it.

Removing facet constraints explanation image
Figure 4. An example of facets applied to a search.

See below for a description of the available facets.

Collection

The Collection facet refers to a collection of harvested web pages or documents, for example the archived web pages that make up the New Zealand Electronic Text Collection (NZETC). Collections are added incrementally to this discovery platform.

Content Type

The Content Type facet refers to the general content type that has been determined when indexing a web page or document. Apache Tika is used to identify content types and formats in the indexing process.

The first part of a format identification is used to determine the general content type. See the formats below and their corresponding content type:

  • "text/html; charset=UTF-8" : "html"
  • "application/pdf; version=1.4" : "pdf"
  • "text/plain; charset=UTF-8" : "text"
  • "application/vnd.ms-powerpoint" : "powerpoint"

Be aware that occasionally web pages or documents can be classified with the wrong content type. And if a content type can't be determined, then a classification of "Other" will be assigned. If you can’t find a document based on its content type, try performing a URL search for its file extension, e.g. *.doc.

Crawl Year

The Crawl Year facet refers to the year in which the web page or document was harvested by the National Library. This is not the same as a year of publication, although it’s possible they could be the same.

Public Suffix

The Public Suffix facet refers to the combination of the top-level domain (TLD) and possible second-level domain (SLD), of a URL for a web page or document. For example, in .govt.nz, "nz" is the TLD, and "govt" is the SLD. There are many URLs that also contain just a TLD, such as .com or .nz.

Faceting by Public Suffix explanation image
Figure 5. The URL components that are used in the Public Suffix facet; the top-level domain and any second-level domains.

See the examples below of common public suffixes:

  • .govt.nz
  • .ac.nz
  • .parliament.nz
  • .org.nz
  • .co.nz
  • .com
  • .net
Domain

The Domain facet refers to the registered domain name for a web page or document. A domain name consists of a hierarchical sequence of names separated by periods (dots) and ending with a top-level domain.

Faceting by Domain explanation image
Figure 6. The URL components that are used in the Domain facet; the primary domain, and the top-level domain and any second-level domains.

See the examples below of common domain names:

  • natlib.govt.nz
  • elections.org.nz
  • scoop.co.nz
  • wordpress.com

Be aware that this facet does not include subdomains beyond the public suffix and initial domain. See the examples below of subdomains that would not show under the facet:

  • nzetc.victoria.ac.nz
  • news.google.co.nz
  • play.stuff.co.nz

To search specifically for subdomains, use the URL search with wildcards *, e.g. *nzetc.victoria.ac.nz*

Search Results

Search results explanation image
Figure 7. An example search result.

Search results contain the following metadata fields: view, type, date collected, collections and sample. These are described below.

View

A link to view the web page or document within the National Digital Heritage Archive.

Type

The content type and language detected for the web page or document.

Date Collected

The date and time that the web page or document was harvested by the National Library.

Collections

Any collections that the web page or document belong to within the New Zealand Web Archive.

Sample

A selection of text highlighting the first match of any search terms within the web page or document content.

Terms of use

Copyright and Privacy

Contact us

National Library of New Zealand
New Zealand Government Logo