Skip to content

Commit

Permalink
title
Browse files Browse the repository at this point in the history
  • Loading branch information
ana-vranic committed Nov 20, 2023
1 parent 7ce2135 commit 0fab198
Show file tree
Hide file tree
Showing 15 changed files with 698 additions and 955 deletions.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction
# Big Data analysis with Map-Reduce

In Data Science, we often deal with big amounts of data. In those cases, many standard approaches won't work as expected, and to process big data, we need to apply a different technique called MapReduce. The main problem when dealing with big data is that the data size is so large that filesystem access times become a dominant factor in the execution time. Because of that, it is not efficient to process big data on a standard MPI cluster machines. With distributed computing solutions like Hadoop and Spark clusters, which rely on the MapReduce approach, big volumes of data are processed and created by diving work into independent tasks, performing the job in parallel. For the first time, the MapReduce approach was formalized by Google, in the paper [MapReduce: Simplified Data Processing on Large Clusters](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) when they encountered a problem in indexing all pages on the WWW.

Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,9 @@ plugins:
- with-pdf:
custom_template_path: custom_pdf
author: Ana Vranic <br/> Scientific Computing Laboratory, Institute of Physics Belgrade, Serbia, <br/> Social Physics and Complexity Laboratory, LIP, Portugal
cover_title: Big Data analysis with Map-Reduce

cover_subtitle: DSC 2023, Belgrade
cover_subtitle: DSC 2023, November, Belgrade
#render_js: true
#headless_chrome_path: /snap/bin/chromium

Expand Down
19 changes: 19 additions & 0 deletions site/404.html
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,25 @@








<li class="md-nav__item">
<a href="/tutorial/" class="md-nav__link">


<span class="md-ellipsis">
Used Datasets and packages
</span>


</a>
</li>



</ul>
</nav>
</div>
Expand Down
23 changes: 21 additions & 2 deletions site/MapReduce/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,25 @@








<li class="md-nav__item">
<a href="../tutorial/" class="md-nav__link">


<span class="md-ellipsis">
Used Datasets and packages
</span>


</a>
</li>



</ul>
</nav>
</div>
Expand Down Expand Up @@ -343,7 +362,7 @@
<h1 id="hadoop">Hadoop</h1>
<p>Apache Hadoop is an open-source implementation of a distributed MapReduce system. A Hadoop cluster consists of a name node and a number of data nodes. The name node holds the distributed file system metadata and layout, and it organizes executions of jobs on the cluster. Data nodes hold the chunks of actual data and execute jobs on their local subset of the data.</p>
<p>Hadoop was developed in Java, and its primary use is through Java API; however, Hadoop also offers a “streaming” API, which is more general and it can work with map-reduce jobs written in any language which can read data from standard input and return data to standard output. In this tutorial, we will provide examples in Python. </p>
<h3 id="python-streaming">Python Streaming</h3>
<h2 id="python-streaming">Python Streaming</h2>
<p>If you prefer languages other than Java, Hadoop offers the streaming API. The term streaming here refers to how Hadoop uses standard input and output streams of your non-java mapper and reducer programs to pipe data between them. Relying on stdin and stdout enables easy integration with any other language.</p>
<p>In the first example, we will implement word count.
The mapper is defined as following: </p>
Expand Down Expand Up @@ -389,7 +408,7 @@ <h3 id="python-streaming">Python Streaming</h3>
</span><span id="__span-3-5"><a id="__codelineno-3-5" name="__codelineno-3-5" href="#__codelineno-3-5"></a>-file<span class="w"> </span>mapper.py<span class="w"> </span>-file<span class="w"> </span>reducer.py
</span></code></pre></div>
The input and output parameters specify the locations for input and output data on the HDFS file system - Hierarchical distributed file system. Mapper and reducer parameters specify the mapper and reducer programs, respectively. The following parameters specify files on the local file system that will be uploaded to Hadoop and made available in the context of that job. Here, we define our python scripts to make them available for execution on the data nodes.</p>
<h3 id="calculating-elo-ratings">Calculating ELO ratings</h3>
<h2 id="calculating-elo-ratings">Calculating ELO ratings</h2>
<p>We can download data from <a href="https://github.com/JeffSackmann/tennis_wta">repository</a> </p>
<p><div class="language-sh highlight"><pre><span></span><code><span id="__span-4-1"><a id="__codelineno-4-1" name="__codelineno-4-1" href="#__codelineno-4-1"></a>wget<span class="w"> </span><span class="s2">&quot;https://github.com/JeffSackmann/tennis_wta/archive/refs/heads/master.zip&quot;</span>
</span></code></pre></div>
Expand Down
Loading

0 comments on commit 0fab198

Please sign in to comment.