From 46702edd1564b98ec11e51b1acd4498f1b893b1f Mon Sep 17 00:00:00 2001 From: Tushar Banik Date: Thu, 28 Mar 2024 19:38:52 +0530 Subject: [PATCH] update medium test analysis --- output/medium/README.md | 438 ++++++++++++++++++++++++++++--------- output/medium/analysis.Rmd | 210 +++++++++++++----- 2 files changed, 483 insertions(+), 165 deletions(-) diff --git a/output/medium/README.md b/output/medium/README.md index e8d8e5a..d3bda92 100644 --- a/output/medium/README.md +++ b/output/medium/README.md @@ -3,55 +3,62 @@ Medium Test Analysis ## Introduction -This document demonstrates the analysis of XML data using R, focusing on -extracting information about movies from an XML string. The analysis -leverages the `XML` library in R to parse and manipulate XML data. +The medium test aims to **replicate** the **analysis** conducted in +Section 41.1.3 using functions from the **XML** package. This test +focuses on transforming XML data into a **structured** **data frame** +for further analysis and reporting. The XML package, being an older +package, offers a straightforward way to convert XML data into data +frames, which is particularly useful for handling nested XML +structures.The XML document provided in the test is a nested structure, +with each \<**node**\> element containing potentially **nested** +\<**node**\> elements (and some containing attributes). The goal is to +extract specific information from this XML document, such as the +**text** values, **length**, **attributes**,**name**,**children** etc. +of nested \ elements under the root \ element. ## Setting Up the Environment -## XML Data - -The XML string contains information about two movies, including their -titles, directors, release years, and genres. The structure of the XML -string is hierarchical, with each movie enclosed within `` tags. +### **Section 1: Loading Libraries and parsing XML Content** ``` r library(XML) + xml_content <- c( - '', - "", - '', - "Good Will Hunting", - "", - "Gus", - "Van Sant", - "", - "1998", - "drama", - "", - '', - "Y tu mama tambien", - "", - "Alfonso", - "Cuaron", - "", - "2001", - "drama", - "", - "" + '', + '', + '', + 'Good Will Hunting', + '', + 'Gus', + 'Van Sant', + '', + '1998', + 'drama', + '', + '', + 'Y tu mama tambien', + '', + 'Alfonso', + 'Cuaron', + '', + '2001', + 'drama', + '', + '' ) ``` -## Parsing XML Data +**Explanation** -To analyze the XML data, we first need to parse it into an R object. The -`xmlTreeParse` function from the `XML` library is used for this purpose. -This function converts the XML string into an XML document object, which -can then be manipulated using R. +- The **XML** library is loaded to handle XML data in R. + +- An XML content string representing a list of movies is defined, + including details like **title**, **director**, **year**, and + **genre**. ``` r -xml_doc <- xmlTreeParse(paste(xml_content, collapse = ""), useInternalNodes = TRUE) -print(xml_doc) +doc <- xmlTreeParse(paste(xml_content, collapse = ''), useInternalNodes = TRUE) +doc ``` ## @@ -77,110 +84,327 @@ print(xml_doc) ## ## -## Extracting Movie Information +**Explanation** + +- The **xmlTreeParse** function from the XML package is used to parse + the XML string into an XML document object. + +- The **paste** function with **collapse = ’ ’** is used to concatenate + the XML string into a single string before parsing. + +- The **useInternalNodes** = **TRUE** argument specifies that the + function should return an internal node, which is more efficient for + extracting parts of the XML + document[1](https://stackoverflow.com/questions/20684507/in-r-xml-package-what-is-the-difference-between-xmlparse-and-xmltreeparse). + +- The **parsed** XML document is stored in the variable xml_doc. -To extract information about the movies, we use the `xmlRoot` function -to access the root node of the XML document. We then iterate over the -child nodes of the root node, which represent the movies, and extract -their information. +### **Section 2:** Navigation of XML Tree + +#### 2.1 Access the root Node ``` r -movies_node <- xmlRoot(xml_doc) +# Get the root node of the XML document +movies <- xmlRoot(doc) +movies +``` -cat("Root Node Name:", xmlName(movies_node), "\n") + ## + ## + ## Good Will Hunting + ## + ## Gus + ## Van Sant + ## + ## 1998 + ## drama + ## + ## + ## Y tu mama tambien + ## + ## Alfonso + ## Cuaron + ## + ## 2001 + ## drama + ## + ## + +``` r +# Check if the XML document and the root node are identical +identical(doc, movies) ``` - ## Root Node Name: movies + ## [1] FALSE + +It turns out that `doc` and `movies` are not actually identical + +**Explanation** + +- The **xmlRoot** function extracts the root node of the XML document, + which is stored in movies. + +- The identical function checks if the **root** **node** is the same as + the **original document**, demonstrating the structure of the XML + document. + +#### 2.2 Access the children of movies node ``` r -root_attrs <- xmlAttrs(movies_node) +# Access the child nodes of the root node +xmlChildren(movies) +``` + + ## $movie + ## + ## Good Will Hunting + ## + ## Gus + ## Van Sant + ## + ## 1998 + ## drama + ## + ## + ## $movie + ## + ## Y tu mama tambien + ## + ## Alfonso + ## Cuaron + ## + ## 2001 + ## drama + ## + ## + ## attr(,"class") + ## [1] "XMLInternalNodeList" "XMLNodeList" -cat("Root Node Attributes:", "\n") +``` r +# Access the first movie node +good_will <- xmlChildren(movies)[[1]] +good_will ``` - ## Root Node Attributes: + ## + ## Good Will Hunting + ## + ## Gus + ## Van Sant + ## + ## 1998 + ## drama + ## ``` r -print(root_attrs) +# Access the second movie node +tu_mama <- xmlChildren(movies)[[2]] +tu_mama ``` - ## NULL + ## + ## Y tu mama tambien + ## + ## Alfonso + ## Cuaron + ## + ## 2001 + ## drama + ## + +**Explanation** + +- **xmlChildren(movies)** retrieves the child nodes of the node + “movies”. + +- **xmlChildren(movies)$$\[1$$\]** accesses the first movie node from + the child nodes of “movies”. + +- **xmlChildren(movies)$$\[2$$\]** accesses the second movie node from + the child nodes of “movies”. + +### **Section 3:** Inspecting first node + +#### 3.1 Inspecting contents of the children of movies node ``` r -movie_nodes <- xmlChildren(movies_node) +# Access the children nodes of 'good_will' +xmlChildren(good_will) ``` -## Iterate through each Movie child node and display Information + ## $title + ## Good Will Hunting + ## + ## $director + ## + ## Gus + ## Van Sant + ## + ## + ## $year + ## 1998 + ## + ## $genre + ## drama + ## + ## attr(,"class") + ## [1] "XMLInternalNodeList" "XMLNodeList" ``` r -for (i in seq_along(movie_nodes)) { - movie_node <- movie_nodes[[i]] +# Access the children nodes of 'tu_mama' +xmlChildren(tu_mama) +``` - cat("Movie Node", i, "Name:", xmlName(movie_node), "\n") + ## $title + ## Y tu mama tambien + ## + ## $director + ## + ## Alfonso + ## Cuaron + ## + ## + ## $year + ## 2001 + ## + ## $genre + ## drama + ## + ## attr(,"class") + ## [1] "XMLInternalNodeList" "XMLNodeList" - movie_attrs <- xmlAttrs(movie_node) +``` r +# Get the name of the 'good_will' node +xmlName(good_will) +``` - cat("Movie Node", i, "Attributes:", "\n") + ## [1] "movie" - print(movie_attrs) +``` r +# Get the attributes of the 'good_will' node +xmlAttrs(good_will) +``` - movie_children <- xmlChildren(movie_node) + ## mins lang + ## "126" "eng" + +``` r +# Get the size (number of children) of the 'good_will' node +xmlSize(good_will) +``` + + ## [1] 4 - for (j in seq_along(movie_children)) { - child_node <- movie_children[[j]] +**Explanation** - cat("Child Node", j, "Name:", xmlName(child_node), "\n") +- The **xmlName** function is used to get the name of the **good_will** + node. - cat("Child Node", j, "Content:", xmlValue(child_node), "\n") +- The **xmlAttrs** function is used to get the attributes of the root + node. - child_attrs <- xmlAttrs(child_node) +- The **xmlChildren** function lists all child nodes of the root node, + which represent individual movies. - cat("Child Node", j, "Attributes:", "\n") - print(child_attrs) - } +#### 3.2 Inspecting contents of good_will node - cat("\n") +``` r +# Iterate over each child node of 'good_will' and print their names +children_nodes <- xmlChildren(good_will) +for (node in children_nodes) { + print(xmlName(node)) } ``` - ## Movie Node 1 Name: movie - ## Movie Node 1 Attributes: - ## mins lang - ## "126" "eng" - ## Child Node 1 Name: title - ## Child Node 1 Content: Good Will Hunting - ## Child Node 1 Attributes: - ## NULL - ## Child Node 2 Name: director - ## Child Node 2 Content: GusVan Sant - ## Child Node 2 Attributes: - ## NULL - ## Child Node 3 Name: year - ## Child Node 3 Content: 1998 - ## Child Node 3 Attributes: - ## NULL - ## Child Node 4 Name: genre - ## Child Node 4 Content: drama - ## Child Node 4 Attributes: - ## NULL + ## [1] "title" + ## [1] "director" + ## [1] "year" + ## [1] "genre" + +``` r +# Access the title node of 'good_will' +title1 <- xmlChildren(good_will)[["title"]] +title1 +``` + + ## Good Will Hunting + +``` r +# Access the children nodes of 'title1' +xmlChildren(title1) +``` + + ## $text + ## Good Will Hunting ## - ## Movie Node 2 Name: movie - ## Movie Node 2 Attributes: - ## mins lang - ## "106" "spa" - ## Child Node 1 Name: title - ## Child Node 1 Content: Y tu mama tambien - ## Child Node 1 Attributes: - ## NULL - ## Child Node 2 Name: director - ## Child Node 2 Content: AlfonsoCuaron - ## Child Node 2 Attributes: - ## NULL - ## Child Node 3 Name: year - ## Child Node 3 Content: 2001 - ## Child Node 3 Attributes: - ## NULL - ## Child Node 4 Name: genre - ## Child Node 4 Content: drama - ## Child Node 4 Attributes: - ## NULL + ## attr(,"class") + ## [1] "XMLInternalNodeList" "XMLNodeList" + +``` r +# Get the text content of 'title1' +xmlValue(title1) +``` + + ## [1] "Good Will Hunting" + +**Explanation** + +- **xmlChildren(good_will)** retrieves the child nodes of the + ‘good_will’ node. + +- **xmlChildren(good_will)$$\["title"$$\]** accesses the ‘title’ node + within the ‘good_will’ node. + +- **xmlChildren(title1)** accesses the child nodes of the ‘title1’ node + +- **xmlValue(title1)** extracts the text content of the ‘title1’ node, + representing the title of the movie. + +### **Section 4:** Inspecting director node + +``` r +# Access the director node of 'good_will' +dir1 <- xmlChildren(good_will)[["director"]] +dir1 +``` + + ## + ## Gus + ## Van Sant + ## + +``` r +# Access the children nodes of 'dir1' +xmlChildren(dir1) +``` + + ## $first_name + ## Gus + ## + ## $last_name + ## Van Sant + ## + ## attr(,"class") + ## [1] "XMLInternalNodeList" "XMLNodeList" + +``` r +# Get the text content of 'dir1' +xmlValue(dir1) +``` + + ## [1] "GusVan Sant" + +**Explanation** + +- **xmlChildren(good_will)$$\["director"$$\]** accesses the ‘director’ + node within the ‘good_will’ node. + +- **xmlChildren(dir1)** accesses the child nodes of the ‘dir1’ node. + +- **xmlValue(dir1)** extracts the text content of the ‘dir1’ node, + representing the director’s name. + +The following **results** obtained from the code can be compared with +the required section outlinedexample data set in [Section 41.1.3 of +Computing with +Data](https://www.gastonsanchez.com/intro2cwd/parsing.html#navigation-of-xml-html-tree) diff --git a/output/medium/analysis.Rmd b/output/medium/analysis.Rmd index dc4148d..9f54363 100644 --- a/output/medium/analysis.Rmd +++ b/output/medium/analysis.Rmd @@ -5,7 +5,7 @@ output: github_document ## Introduction -This document demonstrates the analysis of XML data using R, focusing on extracting information about movies from an XML string. The analysis leverages the `XML` library in R to parse and manipulate XML data. +The medium test aims to **replicate** the **analysis** conducted in Section 41.1.3 using functions from the **XML** package. This test focuses on transforming XML data into a **structured** **data frame** for further analysis and reporting. The XML package, being an older package, offers a straightforward way to convert XML data into data frames, which is particularly useful for handling nested XML structures.The XML document provided in the test is a nested structure, with each \<**node**\> element containing potentially **nested** \<**node**\> elements (and some containing attributes). The goal is to extract specific information from this XML document, such as the **text** values, **length**, **attributes**,**name**,**children** etc. of nested \ elements under the root \ element. ## Setting Up the Environment @@ -13,93 +13,187 @@ This document demonstrates the analysis of XML data using R, focusing on extract knitr::opts_chunk$set(echo = TRUE) ``` -## XML Data +### [**Section 1: Loading Libraries and parsing XML Content**]{.underline} -The XML string contains information about two movies, including their titles, directors, release years, and genres. The structure of the XML string is hierarchical, with each movie enclosed within `` tags. +```{r message=FALSE} -```{r xml_string, echo=TRUE} library(XML) + xml_content <- c( - '', - "", - '', - "Good Will Hunting", - "", - "Gus", - "Van Sant", - "", - "1998", - "drama", - "", - '', - "Y tu mama tambien", - "", - "Alfonso", - "Cuaron", - "", - "2001", - "drama", - "", - "" + '', + '', + '', + 'Good Will Hunting', + '', + 'Gus', + 'Van Sant', + '', + '1998', + 'drama', + '', + '', + 'Y tu mama tambien', + '', + 'Alfonso', + 'Cuaron', + '', + '2001', + 'drama', + '', + '' ) ``` -## Parsing XML Data +**Explanation** + +- The **XML** library is loaded to handle XML data in R. + +- An XML content string representing a list of movies is defined, including details like **title**, **director**, **year**, and **genre**. -To analyze the XML data, we first need to parse it into an R object. The `xmlTreeParse` function from the `XML` library is used for this purpose. This function converts the XML string into an XML document object, which can then be manipulated using R. +```{r message=FALSE} -```{r read_xml, echo=TRUE} -xml_doc <- xmlTreeParse(paste(xml_content, collapse = ""), useInternalNodes = TRUE) -print(xml_doc) +doc <- xmlTreeParse(paste(xml_content, collapse = ''), useInternalNodes = TRUE) +doc ``` -## Extracting Movie Information +**Explanation** -To extract information about the movies, we use the `xmlRoot` function to access the root node of the XML document. We then iterate over the child nodes of the root node, which represent the movies, and extract their information. +- The **xmlTreeParse** function from the XML package is used to parse the XML string into an XML document object. -```{r extract_movie, echo=TRUE} -movies_node <- xmlRoot(xml_doc) +- The **paste** function with **collapse = ' '** is used to concatenate the XML string into a single string before parsing. -cat("Root Node Name:", xmlName(movies_node), "\n") +- The **useInternalNodes** = **TRUE** argument specifies that the function should return an internal node, which is more efficient for extracting parts of the XML document[1](https://stackoverflow.com/questions/20684507/in-r-xml-package-what-is-the-difference-between-xmlparse-and-xmltreeparse). -root_attrs <- xmlAttrs(movies_node) +- The **parsed** XML document is stored in the variable xml_doc.\ -cat("Root Node Attributes:", "\n") +### [**Section 2:** Navigation of XML Tree]{.underline} -print(root_attrs) +#### [2.1 Access the root Node]{.underline} -movie_nodes <- xmlChildren(movies_node) +```{r message=FALSE} + +# Get the root node of the XML document +movies <- xmlRoot(doc) +movies + +# Check if the XML document and the root node are identical +identical(doc, movies) ``` -## Iterate through each Movie child node and display Information +It turns out that `doc` and `movies` are not actually identical + +**Explanation** + +- The **xmlRoot** function extracts the root node of the XML document, which is stored in movies. + +- The identical function checks if the **root** **node** is the same as the **original document**, demonstrating the structure of the XML document. + +#### [2.2 Access the children of movies node]{.underline} + +```{r message=FALSE} + +# Access the child nodes of the root node +xmlChildren(movies) + +# Access the first movie node +good_will <- xmlChildren(movies)[[1]] +good_will + +# Access the second movie node +tu_mama <- xmlChildren(movies)[[2]] +tu_mama +``` + +**Explanation** + +- **xmlChildren(movies)** retrieves the child nodes of the node "movies". + +- **xmlChildren(movies)\[\[1\]\]** accesses the first movie node from the child nodes of "movies". + +- **xmlChildren(movies)\[\[2\]\]** accesses the second movie node from the child nodes of "movies". + +### [**Section 3:** Inspecting first node]{.underline} + +#### [3.1 Inspecting contents of the children of movies node]{.underline} + +```{r message=FALSE} -```{r display_movie_info, echo=TRUE} -for (i in seq_along(movie_nodes)) { - movie_node <- movie_nodes[[i]] +# Access the children nodes of 'good_will' +xmlChildren(good_will) - cat("Movie Node", i, "Name:", xmlName(movie_node), "\n") +# Access the children nodes of 'tu_mama' +xmlChildren(tu_mama) - movie_attrs <- xmlAttrs(movie_node) +# Get the name of the 'good_will' node +xmlName(good_will) - cat("Movie Node", i, "Attributes:", "\n") +# Get the attributes of the 'good_will' node +xmlAttrs(good_will) - print(movie_attrs) +# Get the size (number of children) of the 'good_will' node +xmlSize(good_will) +``` - movie_children <- xmlChildren(movie_node) +**Explanation** - for (j in seq_along(movie_children)) { - child_node <- movie_children[[j]] +- The **xmlName** function is used to get the name of the **good_will** node. - cat("Child Node", j, "Name:", xmlName(child_node), "\n") +- The **xmlAttrs** function is used to get the attributes of the root node. - cat("Child Node", j, "Content:", xmlValue(child_node), "\n") +- The **xmlChildren** function lists all child nodes of the root node, which represent individual movies. - child_attrs <- xmlAttrs(child_node) +#### [3.2 Inspecting contents of good_will node]{.underline} - cat("Child Node", j, "Attributes:", "\n") - print(child_attrs) - } +```{r message=FALSE} - cat("\n") +# Iterate over each child node of 'good_will' and print their names +children_nodes <- xmlChildren(good_will) +for (node in children_nodes) { + print(xmlName(node)) } -``` \ No newline at end of file + +# Access the title node of 'good_will' +title1 <- xmlChildren(good_will)[["title"]] +title1 + +# Access the children nodes of 'title1' +xmlChildren(title1) + +# Get the text content of 'title1' +xmlValue(title1) +``` + +**Explanation** + +- **xmlChildren(good_will)** retrieves the child nodes of the 'good_will' node. + +- **xmlChildren(good_will)\[\["title"\]\]** accesses the 'title' node within the 'good_will' node. + +- **xmlChildren(title1)** accesses the child nodes of the 'title1' node + +- **xmlValue(title1)** extracts the text content of the 'title1' node, representing the title of the movie. + +### [**Section 4:** Inspecting director node]{.underline} + +```{r message=FALSE} + +# Access the director node of 'good_will' +dir1 <- xmlChildren(good_will)[["director"]] +dir1 + +# Access the children nodes of 'dir1' +xmlChildren(dir1) + +# Get the text content of 'dir1' +xmlValue(dir1) +``` + +**Explanation** + +- **xmlChildren(good_will)\[\["director"\]\]** accesses the 'director' node within the 'good_will' node. + +- **xmlChildren(dir1)** accesses the child nodes of the 'dir1' node. + +- **xmlValue(dir1)** extracts the text content of the 'dir1' node, representing the director's name. + +The following **results** obtained from the code can be compared with the required section outlinedexample data set in [Section 41.1.3 of Computing with Data](https://www.gastonsanchez.com/intro2cwd/parsing.html#navigation-of-xml-html-tree) \ No newline at end of file