Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
hope-data-science authored Oct 10, 2024
1 parent f916878 commit 68b5bb2
Show file tree
Hide file tree
Showing 6 changed files with 59 additions and 25 deletions.
4 changes: 2 additions & 2 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes">

<meta name="author" content="黄天元">
<meta name="dcterms.date" content="2024-07-31">
<meta name="dcterms.date" content="2024-10-10">

<title>实战大数据:基于R语言</title>
<style>
Expand Down Expand Up @@ -267,7 +267,7 @@ <h1 class="title">实战大数据:基于R语言</h1>
<div>
<div class="quarto-title-meta-heading">Published</div>
<div class="quarto-title-meta-contents">
<p class="date">July 31, 2024</p>
<p class="date">October 10, 2024</p>
</div>
</div>

Expand Down
32 changes: 26 additions & 6 deletions docs/search.json

Large diffs are not rendered by default.

11 changes: 4 additions & 7 deletions docs/从内存到外存:用数据库管理数据.html
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@ <h3 data-number="12.2.1" class="anchored" data-anchor-id="基本环境配置"><s
</section>
<section id="数据库的连接" class="level3" data-number="12.2.2">
<h3 data-number="12.2.2" class="anchored" data-anchor-id="数据库的连接"><span class="header-section-number">12.2.2</span> 数据库的连接</h3>
<p>完事开头难,对数据库操作的第一步就是必须让R环境与数据库连接起来。在R中要与数据库连接,一般需要两个包:其一是<strong>DBI</strong>,这个包提供了用于数据库连接、数据传输、执行查询的通用函数;其二是针对用户连接数据库系统的定制包,这些包能够把<strong>DBI</strong>命令转化为特定数据库系统能够解读的命令,比如要使用SQLite就需要<strong>RSQLite</strong>包,使用PostgreSQL就需要使用<strong>PostgreSQL</strong>包。对于咱们的试验来说,需要使用<strong>duckdb</strong>包来完成这个操作,实现方法如下:</p>
<p>万事开头难,对数据库操作的第一步就是必须让R环境与数据库连接起来。在R中要与数据库连接,一般需要两个包:其一是<strong>DBI</strong>,这个包提供了用于数据库连接、数据传输、执行查询的通用函数;其二是针对用户连接数据库系统的定制包,这些包能够把<strong>DBI</strong>命令转化为特定数据库系统能够解读的命令,比如要使用SQLite就需要<strong>RSQLite</strong>包,使用PostgreSQL就需要使用<strong>PostgreSQL</strong>包。对于咱们的试验来说,需要使用<strong>duckdb</strong>包来完成这个操作,实现方法如下:</p>
<div class="cell">
<div class="sourceCode cell-code" id="cb2"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a>con <span class="ot">=</span> <span class="fu">dbConnect</span>(<span class="fu">duckdb</span>())</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
Expand Down Expand Up @@ -670,7 +670,7 @@ <h2 data-number="12.4" class="anchored" data-anchor-id="基于polars的大数据
<span id="cb38-2"><a href="#cb38-2" aria-hidden="true" tabindex="-1"></a><span class="fu">p_load</span>(polars,tidypolars,tidyverse,tidyfst)</span>
<span id="cb38-3"><a href="#cb38-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb38-4"><a href="#cb38-4" aria-hidden="true" tabindex="-1"></a><span class="co"># 扫描数据</span></span>
<span id="cb38-5"><a href="#cb38-5" aria-hidden="true" tabindex="-1"></a>pl<span class="sc">$</span><span class="fu">scan_parquet</span>(<span class="st">"df.parquet"</span>) <span class="ot">-&gt;</span> dat_pl</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb38-5"><a href="#cb38-5" aria-hidden="true" tabindex="-1"></a><span class="fu">scan_parquet_polars</span>(<span class="st">"temp/df.parquet"</span>) <span class="ot">-&gt;</span> dat_pl</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>需要注意的是,在上面的操作中,我们并没有把数据导入到环境里面。我们用了“扫描”一词,其实相当于对数据进行了连接,类似于我们在前一章节中提到的<code>open_dataset</code>操作。在这个背景下,我们可以对这个没有导入环境的数据进行各种操作,并把结果收集到环境中进行展示,操作方法如下:</p>
<div class="cell">
Expand All @@ -694,11 +694,8 @@ <h2 data-number="12.4" class="anchored" data-anchor-id="基于polars的大数据
<span id="cb39-18"><a href="#cb39-18" aria-hidden="true" tabindex="-1"></a><span class="co"># 查看结果</span></span>
<span id="cb39-19"><a href="#cb39-19" aria-hidden="true" tabindex="-1"></a>res</span>
<span id="cb39-20"><a href="#cb39-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-21"><a href="#cb39-21" aria-hidden="true" tabindex="-1"></a><span class="co"># 把结果转化为R中的数据框</span></span>
<span id="cb39-22"><a href="#cb39-22" aria-hidden="true" tabindex="-1"></a>res<span class="sc">$</span><span class="fu">to_data_frame</span>()</span>
<span id="cb39-23"><a href="#cb39-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb39-24"><a href="#cb39-24" aria-hidden="true" tabindex="-1"></a><span class="co"># 把结果转化为数据框并使用tibble形式进行展示</span></span>
<span id="cb39-25"><a href="#cb39-25" aria-hidden="true" tabindex="-1"></a>res <span class="sc">%&gt;%</span> <span class="fu">as_tibble</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<span id="cb39-21"><a href="#cb39-21" aria-hidden="true" tabindex="-1"></a><span class="co"># 把结果转化为数据框并使用tibble形式进行展示</span></span>
<span id="cb39-22"><a href="#cb39-22" aria-hidden="true" tabindex="-1"></a>res <span class="sc">%&gt;%</span> <span class="fu">as_tibble</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
</div>
<p>通过上面的试验,我们可以发现只需要把数据先存为Parquet格式,然后使用<code>scan_parquet</code>方法进行数据连接,就可以利用我们熟悉的<strong>dplyr</strong><strong>tidyr</strong>函数对保存在磁盘中的数据进行各式的数据操作,这给我们的大数据分析提供了巨大的便利,是解决内存不足计算(Out-of-Memory Computation)的最佳方案之一。</p>
</section>
Expand Down
1 change: 1 addition & 0 deletions docs/参考资料.html
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@ <h1 class="title">参考资料</h1>
<li><a href="https://spark.posit.co/">R interface to Apache Spark</a></li>
<li><a href="https://pola-rs.github.io/r-polars/vignettes/polars.html">An Introduction to Polars from R</a></li>
<li><a href="https://tidypolars.etiennebacher.com/">tidypolars</a></li>
<li><a href="https://fastverse.github.io/fastverse/">fastverse</a></li>
</ol>


Expand Down
16 changes: 16 additions & 0 deletions docs/快速读写:大数据的导入与导出.html
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,8 @@ <h2 id="toc-title">目录</h2>
<li><a href="#极限压缩" id="toc-极限压缩" class="nav-link" data-scroll-target="#极限压缩"><span class="header-section-number">4.3</span> 极限压缩</a></li>
<li><a href="#通用交流" id="toc-通用交流" class="nav-link" data-scroll-target="#通用交流"><span class="header-section-number">4.4</span> 通用交流</a></li>
<li><a href="#小结" id="toc-小结" class="nav-link" data-scroll-target="#小结"><span class="header-section-number">4.5</span> 小结</a></li>
<li><a href="#练习" id="toc-练习" class="nav-link" data-scroll-target="#练习"><span class="header-section-number">4.6</span> 练习</a></li>
<li><a href="#参考资料" id="toc-参考资料" class="nav-link" data-scroll-target="#参考资料"><span class="header-section-number">4.7</span> 参考资料</a></li>
</ul>
</nav>
</div>
Expand Down Expand Up @@ -482,6 +484,20 @@ <h2 data-number="4.4" class="anchored" data-anchor-id="通用交流"><span class
<section id="小结" class="level2" data-number="4.5">
<h2 data-number="4.5" class="anchored" data-anchor-id="小结"><span class="header-section-number">4.5</span> 小结</h2>
<p>本章聚焦于大数据的读写性能,介绍了大数据读写中需要考虑的三个要素:(1)读写速度;(2)内存占用;(3)文件格式通用性。在R平台中进行测试,发现读写速度最快的文件格式是fst,而存储效率最高的是Parquet格式,在考虑通用交流的时候则需靠考虑团队成员能够读取什么格式的文件。</p>
</section>
<section id="练习" class="level2" data-number="4.6">
<h2 data-number="4.6" class="anchored" data-anchor-id="练习"><span class="header-section-number">4.6</span> 练习</h2>
<p>设计一个试验,对于不同体量(不应低于100M)的数据,观察读写不同数据格式的文件(包括但不限于csv、parquet、qs、fst等),需要的时间和空间分别是多少。要求使用图表进行展示,并给出明确的结论。附加考虑:当数据是不同类型的时候,上面的结论是否有所变化?</p>
</section>
<section id="参考资料" class="level2" data-number="4.7">
<h2 data-number="4.7" class="anchored" data-anchor-id="参考资料"><span class="header-section-number">4.7</span> 参考资料</h2>
<ul>
<li><a href="https://rsangole.netlify.app/posts/2022-09-14_data-read-write-performance/data-read-write-perf" class="uri">https://rsangole.netlify.app/posts/2022-09-14_data-read-write-performance/data-read-write-perf</a></li>
<li><a href="https://tomaztsql.wordpress.com/2022/05/08/comparing-performances-of-csv-to-rds-parquet-and-feather-data-types/" class="uri">https://tomaztsql.wordpress.com/2022/05/08/comparing-performances-of-csv-to-rds-parquet-and-feather-data-types/</a></li>
<li><a href="https://prof-thiagooliveira.netlify.app/post/data-read-write-performance/" class="uri">https://prof-thiagooliveira.netlify.app/post/data-read-write-performance/</a></li>
<li><a href="https://stackoverflow.com/questions/58699848/best-file-type-for-loading-data-in-to-r-speed-wise" class="uri">https://stackoverflow.com/questions/58699848/best-file-type-for-loading-data-in-to-r-speed-wise</a></li>
<li><a href="https://h2oai.github.io/db-benchmark/" class="uri">https://h2oai.github.io/db-benchmark/</a></li>
</ul>


</section>
Expand Down
20 changes: 10 additions & 10 deletions docs/数据处理效能的衡量.html
Original file line number Diff line number Diff line change
Expand Up @@ -300,8 +300,8 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="时间衡量"><span class
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> <span class="at">times =</span> <span class="dv">5</span>) <span class="co"># 重复运行5次</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>Unit: milliseconds
expr min lq mean median uq max neval
Sys.sleep(0.1) 108.5984 108.8506 109.6036 109.2285 109.5447 111.796 5</code></pre>
expr min lq mean median uq max neval
Sys.sleep(0.1) 109.6687 110.5331 111.458 110.6252 111.5059 114.9572 5</code></pre>
</div>
</div>
<p>得到的结果中,“Unit”部分声明了本次代码执行时间衡量的时间单位(“milliseconds”代表时间单位为微秒),<em>expr</em>代表执行的代码,<em>neval</em>代表代码执行的次数,而<em>mean</em><em>median</em>分别代表多次执行中时间的平均值和中位数,<em>min</em><em>max</em>则给出了执行时间的最小值和最大值。 此外,<code>microbenchmark</code>函数还可以比较不同代码的运行时间长短,实现方法如下:</p>
Expand All @@ -312,8 +312,8 @@ <h2 data-number="3.1" class="anchored" data-anchor-id="时间衡量"><span class
<div class="cell-output cell-output-stdout">
<pre><code>Unit: milliseconds
expr min lq mean median uq max neval
Sys.sleep(0.1) 107.9187 108.5331 108.7826 108.5972 109.3027 109.5615 5
Sys.sleep(0.2) 201.7077 202.2759 205.4304 204.4330 204.9903 213.7449 5</code></pre>
Sys.sleep(0.1) 101.8236 103.7175 108.0753 109.8820 109.9186 115.0347 5
Sys.sleep(0.2) 203.1670 204.1940 206.2796 204.2952 204.7377 215.0043 5</code></pre>
</div>
</div>
</section>
Expand Down Expand Up @@ -342,7 +342,7 @@ <h2 data-number="3.2" class="anchored" data-anchor-id="空间衡量"><span class
<div class="cell">
<div class="sourceCode cell-code" id="cb16"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="fu">mem_used</span>()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
<pre><code>53.5 MB</code></pre>
<pre><code>53.4 MB</code></pre>
</div>
</div>
<p>有的时候,我们需要知道经过某一步操作之后,我们R占用内存的变化是多少,可以使用<code>mem_change</code>函数来完成这一步操作:</p>
Expand All @@ -352,7 +352,7 @@ <h2 data-number="3.2" class="anchored" data-anchor-id="空间衡量"><span class
<pre><code>-800 kB</code></pre>
</div>
</div>
<p>注意,如果结果带有负号,说明R所占用内存减少了,否则就是增加了。</p>
<p>注意,如果结果带有负号,说明R所占用内存减少了,否则就是增加了。 此外,如果对一个已经保存在本地的文件,需要查看其占用内存空间,可以使用<strong>fs</strong>包的<code>file_size</code>函数进行查看,只需要把文件的路径放入即可。</p>
</section>
<section id="综合衡量" class="level2" data-number="3.3">
<h2 data-number="3.3" class="anchored" data-anchor-id="综合衡量"><span class="header-section-number">3.3</span> 综合衡量</h2>
Expand All @@ -364,7 +364,7 @@ <h2 data-number="3.3" class="anchored" data-anchor-id="综合衡量"><span class
<pre><code># A tibble: 1 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
&lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt;
1 a 2.55ms 2.81ms 349. 781KB 4.08</code></pre>
1 a 2.54ms 2.79ms 352. 781KB 4.09</code></pre>
</div>
</div>
<p>上面的代码会对表达式执行若干次,然后进行效能的衡量。在返回结果中,<em>median</em>代表若干次迭代中时间花销的中位数,<em>mem_alloc</em>代表在运行表达式时,R分配的内存总量;而<em>itr/sec</em>则告诉我们每秒可以对该表达式执行多少次。 <code>mark</code>函数不仅可以对一个表达式的效能进行测量,还可以对多个表达式的效能进行比较。一般来说,默认情况下要求不同表达式的返回值必须是一致的,但是我们可以通过把<em>check</em>参数设置为FALSE来避免这一默认设置。实现方法如下:</p>
Expand All @@ -379,9 +379,9 @@ <h2 data-number="3.3" class="anchored" data-anchor-id="综合衡量"><span class
<pre><code># A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
&lt;bch:expr&gt; &lt;bch:tm&gt; &lt;bch:tm&gt; &lt;dbl&gt; &lt;bch:byt&gt; &lt;dbl&gt;
1 a 2.54ms 2.8ms 345. 781.3KB 6.24
2 b 2.55ms 2.79ms 346. 781.3KB 4.09
3 c 5.11ms 5.61ms 175. 1.53MB 6.39</code></pre>
1 a 2.55ms 2.79ms 346. 781.3KB 6.25
2 b 2.54ms 2.8ms 345. 781.3KB 4.08
3 c 5.33ms 5.67ms 173. 1.53MB 6.39</code></pre>
</div>
</div>
<p>以上代码利用<code>mark</code>函数对比了3个表达式的效能。</p>
Expand Down

0 comments on commit 68b5bb2

Please sign in to comment.