From 84aeef5aa26b588e43c92e79b126ce2b21e3f3cc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlos=20Arcila=20Calder=C3=B3n?= Date: Mon, 4 Dec 2023 09:37:03 -0600 Subject: [PATCH] Update chapter15.qmd minor correction of cross reference --- content/chapter15.qmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/chapter15.qmd b/content/chapter15.qmd index 36dee11..4f9fbd5 100644 --- a/content/chapter15.qmd +++ b/content/chapter15.qmd @@ -346,7 +346,7 @@ for instance using *letsencrypt*. Finally, option 4 must be selected when the scale of your data and the complexity of the tasks cannot be deployed in a single server or virtual machine. For example, building a classification model by training a complex and deep convolutional neural network with millions of images and update this model constantly may require the use of different computers at the same time. Actually, in modern computers with multiple cores or processors you normally run parallel computing within a single machine. But when working at scale you will probably need to set a infrastructure of different computers such as that of a *grid* or a *cluster*. -Cloud services (e.g. AWS, Microsoft Azure, etc.) or scientific infrastructures (e.g. supercomputers) offer the possibility to set these architectures remotely. For instance, in a computer cluster you can configure a group of virtual computers, where one will act as a *main* and the others as *workers*. With this logic the main can distribute the storage and analysis of data among the slaves and then resume the results: see for example the *MapReduce* or the *Resilient Distributed Dataset* (RDD) approaches used by the open-source software *Apache Hadoop* and *Apache Spark* respectively. For a specific example of parallel computing in computational analysis of communication you can take a look at the implementation of distributed supervised sentiment analysis, in which one of the authors of this book deployed supervised text classification in *Apache Spark* and connected this infrastructure with real-time analysis of tweets using *Apache Kafka* in order to perform streaming analytics@calderon2019distributed. +Cloud services (e.g. AWS, Microsoft Azure, etc.) or scientific infrastructures (e.g. supercomputers) offer the possibility to set these architectures remotely. For instance, in a computer cluster you can configure a group of virtual computers, where one will act as a *main* and the others as *workers*. With this logic the main can distribute the storage and analysis of data among the slaves and then resume the results: see for example the *MapReduce* or the *Resilient Distributed Dataset* (RDD) approaches used by the open-source software *Apache Hadoop* and *Apache Spark* respectively. For a specific example of parallel computing in computational analysis of communication you can take a look at the implementation of distributed supervised sentiment analysis, in which one of the authors of this book deployed supervised text classification in *Apache Spark* and connected this infrastructure with real-time analysis of tweets using *Apache Kafka* in order to perform streaming analytics (@calderon2019distributed). These architectures for parallel processing will significantly increase your computation capacity for big data problems but the initial implementation will consume time and (most of the time) money, which is the reason why you must think in advance if there is a simpler solution (such as a single but powerful machine) before implementing a more complex infrastructure in your analysis. @@ -516,4 +516,4 @@ In short, the Docker image is rarely the *only* way in which you distribute your ```{bash} #| echo: false rm -f mydb.db mydb.sqlite -``` \ No newline at end of file +```