Skip to content

Commit

Permalink
Update chapter15.qmd
Browse files Browse the repository at this point in the history
minor correction of cross reference
  • Loading branch information
carlosarcila authored Dec 4, 2023
1 parent d143d8b commit 84aeef5
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions content/chapter15.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,7 @@ for instance using *letsencrypt*.

Finally, option 4 must be selected when the scale of your data and the complexity of the tasks cannot be deployed in a single server or virtual machine. For example, building a classification model by training a complex and deep convolutional neural network with millions of images and update this model constantly may require the use of different computers at the same time. Actually, in modern computers with multiple cores or processors you normally run parallel computing within a single machine. But when working at scale you will probably need to set a infrastructure of different computers such as that of a *grid* or a *cluster*.

Cloud services (e.g. AWS, Microsoft Azure, etc.) or scientific infrastructures (e.g. supercomputers) offer the possibility to set these architectures remotely. For instance, in a computer cluster you can configure a group of virtual computers, where one will act as a *main* and the others as *workers*. With this logic the main can distribute the storage and analysis of data among the slaves and then resume the results: see for example the *MapReduce* or the *Resilient Distributed Dataset* (RDD) approaches used by the open-source software *Apache Hadoop* and *Apache Spark* respectively. For a specific example of parallel computing in computational analysis of communication you can take a look at the implementation of distributed supervised sentiment analysis, in which one of the authors of this book deployed supervised text classification in *Apache Spark* and connected this infrastructure with real-time analysis of tweets using *Apache Kafka* in order to perform streaming analytics@calderon2019distributed.
Cloud services (e.g. AWS, Microsoft Azure, etc.) or scientific infrastructures (e.g. supercomputers) offer the possibility to set these architectures remotely. For instance, in a computer cluster you can configure a group of virtual computers, where one will act as a *main* and the others as *workers*. With this logic the main can distribute the storage and analysis of data among the slaves and then resume the results: see for example the *MapReduce* or the *Resilient Distributed Dataset* (RDD) approaches used by the open-source software *Apache Hadoop* and *Apache Spark* respectively. For a specific example of parallel computing in computational analysis of communication you can take a look at the implementation of distributed supervised sentiment analysis, in which one of the authors of this book deployed supervised text classification in *Apache Spark* and connected this infrastructure with real-time analysis of tweets using *Apache Kafka* in order to perform streaming analytics (@calderon2019distributed).

These architectures for parallel processing will significantly increase your computation capacity for big data problems but the initial implementation will consume time and (most of the time) money, which is the reason why you must think in advance if there is a simpler solution (such as a single but powerful machine) before implementing a more complex infrastructure in your analysis.

Expand Down Expand Up @@ -516,4 +516,4 @@ In short, the Docker image is rarely the *only* way in which you distribute your
```{bash}
#| echo: false
rm -f mydb.db mydb.sqlite
```
```

0 comments on commit 84aeef5

Please sign in to comment.