Module 0342: Using cheerio and regex to change all hrefs to full URLs

Tak Auyeung, Ph.D.

November 30, 2021

Contents

 1 About this module
 2 Objectives
 3 How to find all the ’href’ attributes
 4 Iterating elements in an array
 5 Find out which href to modify
 6 Turning a relative href into an absolute one
  6.1 Finding the path of the container page
  6.2 Differentiating a file from a folder
  6.3 Change the href attributes and regenerate the HTML document

1 About this module

2 Objectives

In this module, we are learning how to use cheerio to utilize regular expressions to find all href attributes that specify a relative path. For each relative path found, the relative path should be changed to a fully qualified absolute path.

To save you the trouble of starting from scratch, you can download appBS.js and use it as a starting point. Of course, if you have a working script from the "cheerio to the rescue" lab, you can use your own code as a starting point. Be sure to copy the old file to a new name so you do not lose the old file or to change the timestamp of the old file.

You can also download getThisCheerio.js, make a copy, and modify this script for testing purposes. Generally speaking, it is more convenient to test features in a script that does not respond to HTTP requests.

3 How to find all the ’href’ attributes

cheerio, like JQuery, has a ’selector’ that can be used to get all the elements based on the presence of an attribute.

In the previous porject, we utilized a selector to find an element based on the type of the element (’link’). This time, however, we are locating the elements based on the presence of an ’href’ attribute.

In the previous project, to find all the ’link’ elements, the select was specified as ’link’. In this project, to find all elements that have a href attribute, the selector is ’[href]’.

Modify getThisCheerio.js to locate elements with an href attribute, and print the number of such elements. The application of a selector returns an array of elements, and every array has a length member.

4 Iterating elements in an array

Unlike a more traditional programming language like C/C++, Javascript provides a very handy way to go through elements in an array.

There is the forEach prototype method that is defined for every array. However, because forEach is a method of an array, it is technically not a control structure. A control structure is a programming language feature that specifies the flow of execution of a program. Because of this, forEach cannot handle async expressions.

An alternative is the for(... of ...) control structure. For example, if elements is the name of an array, then the following code can be used to iterate each item in the array:

001for (let item of elements) 
002
003  // do something with item here 
004}

Because we know each item of the array returned by applying a selector is an element with a href attribute, you can test your code by specifying console.log(item.attribs[’href’]) as the “do something with item here”.

5 Find out which href to modify

In the previous step, you find that not all the href attributes need to be turned into full URL paths. Some are already full URLs. Even more interesting, some specify a bookmark (beginning with a pound # symbol).

This is where we need to utilize the regular expression from the last step of the "just your regular regular expression" lab. Because the href attribute of an element is a string, the match method can be applied to check to see if a href is a full URL or not.

Use a conditional statement to filter and only show the href attributes that are not full URLs.

6 Turning a relative href into an absolute one

If we already know the full URL of the web page that contains the relative URL paths, then we can simply insert the path of the web page before every relative href. In this case, we can insert https://power.arc.losrios.edu/ in front of all the relative href.

6.1 Finding the path of the container page

However, this simplistic method only works when we make an assumption about the URL of the containing page. The containing page can be specified in many different ways. For example, the following are all equivalent:

What can be helpful is to apply the regular expression to each of these strings, and then try to figure out the logic to find the path (without the file index.pl).

"https://power.arc.losrios.edu".match(/^https?:\/\/(\w+(\.\w+)*)(:\d+)?((\/\~?(\w|\.)+)(\/(\w|\.)+)*\/)?/) yields the following outcome (comments added after the fact):

[
  ’https://power.arc.losrios.edu’,
  ’power.arc.losrios.edu’,
  ’.edu’,
  undefined, // port number
  undefined, // path
  undefined,
  undefined,
  undefined,
  undefined,
  index: 0,
  input: ’https://power.arc.losrios.edu’,
  groups: undefined
]
  

Interestingly, the other two URLs give us exactly the same result. Note how element 4 (counting from zero) is undefined, indicating we are referring to the root of a server.

This means that if the result of match is captured in a variable cg (for captured groups), then the following logic can be used to recreate the full path:

001  let url = '' 
002  if (cg !== null// it is a match 
003  { 
004   url = cg[0] 
005   if (cg[4] === undefined) 
006   { 
007     url = url + '/' 
008   } 
009   else 
010   { 
011     url = url + cg[4] 
012   } 
013  }

Unfortunately, this approach does not always work. Consider the URL https://power.arc.losrios.edu/~auyeunt. This path actually specifies a directory, and the index page is implicitly replied. Using our logic, however, it will create an URL that is the parent of the target folder.

The problem is that it is impossible to tell from a URL that does not end with a slash (/) whether the last item is a folder or not. A folder should be ended with a slash.

6.2 Differentiating a file from a folder

There is no way to differentiate a file from a folder. This is because a folder name may contain a period and an extension, just like a file. The only way to differentiate is to make a request.

In our example, https://power.arc.losrios.edu/~auyeunt is a folder, the Apache web server actually returns a code of 302 and a redirect address.

To accomplish this, here is a complete script.

 
001"use strict"; 
002const https = require('https'
003 
004function httpsRequestAwait(url, options, postData) 
005
006  return new Promise( 
007    (resolve, reject) => 
008    { 
009      let response = '' 
010      let req = https.request(url, options, 
011        (res) => 
012        { 
013          res.on('data'
014            (d) => 
015            { 
016              response += d 
017            } 
018          ) 
019          res.on('end'
020            () => 
021            { 
022              resolve(response) 
023            } 
024          ) 
025        } 
026      ) 
027      req.on('error'
028        (e) => 
029        { 
030          reject(e) 
031        } 
032      ) 
033      req.write(postData) 
034      req.end() 
035    } 
036  ) 
037
038 
039function httpIsFolderAwait(url) 
040
041  return new Promise( 
042    (resolve, reject) => 
043    { 
044      let req = https.request(url, {}, 
045        (res) => 
046        { 
047          res.on('data'
048            (d) => 
049            { 
050            } 
051          ) 
052          res.on('end'
053            () => 
054            { 
055              resolve( 
056                res.statusCode == 301 && 
057                'location' in res.headers && 
058                res.headers['location'].match(/\/$/) != null 
059              ) 
060            } 
061          ) 
062        } 
063      ) 
064      req.on('error'
065        (e) => 
066        { 
067          reject(e) 
068        } 
069      ) 
070      req.write(''
071      req.end() 
072    } 
073  ) 
074
075 
076async function main() 
077
078  console.log('start'
079  console.log(await httpIsFolderAwait('https://power.arc.losrios.edu/~auyeunt')) 
080  console.log('end'
081
082 
083main()

You can also download getThisDir.js.

The httpIsFolderAwait function is an async function that returns true if and only if the URL parameter is a folder.

With this function, the logic to determine the path (excluding the file) of an URL is as follows, assuming the URL of the containing page is in variable pageUrl, and the file excluding URL will be in variable url:

001  let cg = pageUrl.match(/^https?:\/\/(\w+(\.\w+)*)(:\d+)?((\/\~?(\w|\.)+)(\/(\w|\.)+)*\/)?/) 
002  let url = '' 
003  if (cg !== null
004  { 
005   if (pageUrl.at(-1) == '/'
006   { 
007     url = pageUrl 
008   } 
009   else 
010   { 
011     if (cg[4] === undefined) 
012     { 
013       if (cg.input == cg[0]) 
014       { 
015         url = cg.input + '/' 
016       } 
017       else 
018       { 
019         if (await httpIsFolderAwait(pageUrl)) 
020         { 
021           url = cg.input + '/' 
022         } 
023       } 
024     } 
025     else 
026     { 
027       if (await httpIsFolderAwait(pageUrl)) 
028       { 
029         url = pageUrl + '/' 
030       } 
031       else 
032       { 
033         url = cg[0] 
034       } 
035     } 
036   } 
037  }

Integrate this logic to find the prefix to add to all the relative href URLs. You can also turn this logic into an async function so it can be utilized from multiple places of a program.

001async function findUrlPath(pageUrl) 
002
003  let cg = pageUrl.match(/^https?:\/\/(\w+(\.\w+)*)(:\d+)?((\/\~?(\w|\.)+)(\/(\w|\.)+)*\/)?/) 
004  let url = '' 
005  if (cg !== null
006  { 
007   if (pageUrl.at(-1) == '/'
008   { 
009     url = pageUrl 
010   } 
011   else 
012   { 
013     if (cg[4] === undefined) 
014     { 
015       if (cg.input == cg[0]) 
016       { 
017         url = cg.input + '/' 
018       } 
019       else 
020       { 
021         if (await httpIsFolderAwait(pageUrl)) 
022         { 
023           url = cg.input + '/' 
024         } 
025       } 
026     } 
027     else 
028     { 
029       if (await httpIsFolderAwait(pageUrl)) 
030       { 
031         url = pageUrl + '/' 
032       } 
033       else 
034       { 
035         url = cg[0] 
036       } 
037     } 
038   } 
039  } 
040  return url 
041}

6.3 Change the href attributes and regenerate the HTML document

Now you have all the pieces to integrate into the copy of getThisCheerio.js. Generate the modified HTML document and check that all the href attributes are correct.