Module 0342: Using cheerio and regex to change all hrefs to full URLs

In this module, we are learning how to use cheerio to utilize regular expressions to ﬁnd all href attributes that specify a relative path. For each relative path found, the relative path should be changed to a fully qualiﬁed absolute path.

To save you the trouble of starting from scratch, you can download appBS.js and use it as a starting point. Of course, if you have a working script from the "cheerio to the rescue" lab, you can use your own code as a starting point. Be sure to copy the old ﬁle to a new name so you do not lose the old ﬁle or to change the timestamp of the old ﬁle.

You can also download getThisCheerio.js, make a copy, and modify this script for testing purposes. Generally speaking, it is more convenient to test features in a script that does not respond to HTTP requests.

3 How to ﬁnd all the ’href’ attributes

cheerio, like JQuery, has a ’selector’ that can be used to get all the elements based on the presence of an attribute.

In the previous porject, we utilized a selector to ﬁnd an element based on the type of the element (’link’). This time, however, we are locating the elements based on the presence of an ’href’ attribute.

In the previous project, to ﬁnd all the ’link’ elements, the select was speciﬁed as ’link’. In this project, to ﬁnd all elements that have a href attribute, the selector is ’[href]’.

Modify getThisCheerio.js to locate elements with an href attribute, and print the number of such elements. The application of a selector returns an array of elements, and every array has a length member.

4 Iterating elements in an array

Unlike a more traditional programming language like C/C++, Javascript provides a very handy way to go through elements in an array.

There is the forEach prototype method that is deﬁned for every array. However, because forEach is a method of an array, it is technically not a control structure. A control structure is a programming language feature that speciﬁes the ﬂow of execution of a program. Because of this, forEach cannot handle async expressions.

An alternative is the for(... of ...) control structure. For example, if elements is the name of an array, then the following code can be used to iterate each item in the array:

Because we know each item of the array returned by applying a selector is an element with a href attribute, you can test your code by specifying console.log(item.attribs[’href’]) as the “do something with item here”.

5 Find out which href to modify

In the previous step, you ﬁnd that not all the href attributes need to be turned into full URL paths. Some are already full URLs. Even more interesting, some specify a bookmark (beginning with a pound # symbol).

This is where we need to utilize the regular expression from the last step of the "just your regular regular expression" lab. Because the href attribute of an element is a string, the match method can be applied to check to see if a href is a full URL or not.

Use a conditional statement to ﬁlter and only show the href attributes that are not full URLs.

6 Turning a relative href into an absolute one

If we already know the full URL of the web page that contains the relative URL paths, then we can simply insert the path of the web page before every relative href. In this case, we can insert https://power.arc.losrios.edu/ in front of all the relative href.

6.1 Finding the path of the container page

However, this simplistic method only works when we make an assumption about the URL of the containing page. The containing page can be speciﬁed in many diﬀerent ways. For example, the following are all equivalent:

What can be helpful is to apply the regular expression to each of these strings, and then try to ﬁgure out the logic to ﬁnd the path (without the ﬁle index.pl).

"https://power.arc.losrios.edu".match(/^https?:\/\/(\w+(\.\w+)*)(:\d+)?((\/\~?(\w|\.)+)(\/(\w|\.)+)*\/)?/) yields the following outcome (comments added after the fact):

Interestingly, the other two URLs give us exactly the same result. Note how element 4 (counting from zero) is undefined, indicating we are referring to the root of a server.

This means that if the result of match is captured in a variable cg (for captured groups), then the following logic can be used to recreate the full path:

Unfortunately, this approach does not always work. Consider the URL https://power.arc.losrios.edu/~auyeunt. This path actually speciﬁes a directory, and the index page is implicitly replied. Using our logic, however, it will create an URL that is the parent of the target folder.

The problem is that it is impossible to tell from a URL that does not end with a slash (/) whether the last item is a folder or not. A folder should be ended with a slash.

6.2 Diﬀerentiating a ﬁle from a folder

There is no way to diﬀerentiate a ﬁle from a folder. This is because a folder name may contain a period and an extension, just like a ﬁle. The only way to diﬀerentiate is to make a request.

In our example, https://power.arc.losrios.edu/~auyeunt is a folder, the Apache web server actually returns a code of 302 and a redirect address.

001"use strict";
002const https = require('https')
003
004function httpsRequestAwait(url, options, postData)
005{
006  return new Promise(
007    (resolve, reject) =>
008    {
009      let response = ''
010      let req = https.request(url, options,
011        (res) =>
012        {
013          res.on('data',
014            (d) =>
015            {
016              response += d
017            }
018          )
019          res.on('end',
020            () =>
021            {
022              resolve(response)
023            }
024          )
025        }
026      )
027      req.on('error',
028        (e) =>
029        {
030          reject(e)
031        }
032      )
033      req.write(postData)
034      req.end()
035    }
036  )
037}
038
039function httpIsFolderAwait(url)
040{
041  return new Promise(
042    (resolve, reject) =>
043    {
044      let req = https.request(url, {},
045        (res) =>
046        {
047          res.on('data',
048            (d) =>
049            {
050            }
051          )
052          res.on('end',
053            () =>
054            {
055              resolve(
056                res.statusCode == 301 &&
057                'location' in res.headers &&
058                res.headers['location'].match(/\/$/) != null
059              )
060            }
061          )
062        }
063      )
064      req.on('error',
065        (e) =>
066        {
067          reject(e)
068        }
069      )
070      req.write('')
071      req.end()
072    }
073  )
074}
075
076async function main()
077{
078  console.log('start')
079  console.log(await httpIsFolderAwait('https://power.arc.losrios.edu/~auyeunt'))
080  console.log('end')
081}
082
083main()

The httpIsFolderAwait function is an async function that returns true if and only if the URL parameter is a folder.

With this function, the logic to determine the path (excluding the ﬁle) of an URL is as follows, assuming the URL of the containing page is in variable pageUrl, and the ﬁle excluding URL will be in variable url:

001  let cg = pageUrl.match(/^https?:\/\/(\w+(\.\w+)*)(:\d+)?((\/\~?(\w|\.)+)(\/(\w|\.)+)*\/)?/)
002  let url = ''
003  if (cg !== null)
004  {
005   if (pageUrl.at(-1) == '/')
006   {
007     url = pageUrl
008   }
009   else
010   {
011     if (cg[4] === undefined)
012     {
013       if (cg.input == cg[0])
014       {
015         url = cg.input + '/'
016       }
017       else
018       {
019         if (await httpIsFolderAwait(pageUrl))
020         {
021           url = cg.input + '/'
022         }
023       }
024     }
025     else
026     {
027       if (await httpIsFolderAwait(pageUrl))
028       {
029         url = pageUrl + '/'
030       }
031       else
032       {
033         url = cg[0]
034       }
035     }
036   }
037  }

Integrate this logic to ﬁnd the preﬁx to add to all the relative href URLs. You can also turn this logic into an async function so it can be utilized from multiple places of a program.

001async function findUrlPath(pageUrl)
002{
003  let cg = pageUrl.match(/^https?:\/\/(\w+(\.\w+)*)(:\d+)?((\/\~?(\w|\.)+)(\/(\w|\.)+)*\/)?/)
004  let url = ''
005  if (cg !== null)
006  {
007   if (pageUrl.at(-1) == '/')
008   {
009     url = pageUrl
010   }
011   else
012   {
013     if (cg[4] === undefined)
014     {
015       if (cg.input == cg[0])
016       {
017         url = cg.input + '/'
018       }
019       else
020       {
021         if (await httpIsFolderAwait(pageUrl))
022         {
023           url = cg.input + '/'
024         }
025       }
026     }
027     else
028     {
029       if (await httpIsFolderAwait(pageUrl))
030       {
031         url = pageUrl + '/'
032       }
033       else
034       {
035         url = cg[0]
036       }
037     }
038   }
039  }
040  return url
041}

6.3 Change the href attributes and regenerate the HTML document

Now you have all the pieces to integrate into the copy of getThisCheerio.js. Generate the modiﬁed HTML document and check that all the href attributes are correct.