Skip to content

Commit

Permalink
article: text changes
Browse files Browse the repository at this point in the history
  • Loading branch information
alexandrehtrb committed Aug 29, 2024
1 parent 9619a72 commit b0734d2
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 16 deletions.
16 changes: 8 additions & 8 deletions src/posts/2024/08/collation-and-encoding-in-databases.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,14 @@ UTF-8 uses a variable number of bytes, starting at 1 and up to 4 for a char. It'

The choice of encoding directly affects the size of text storage. If most characters lie in the basic latin range, UTF-8 is better, because it uses fewer bytes than UTF-16; however, if it's an asian text, UTF-16 is the best, because each character occupies 2 bytes, instead of 3 on UTF-8.

The table below shows how an Unicode number is converted to UTF-8 or UTF-16.

| Unicode range | Example character | Code point, in binary | In UTF-8 | In UTF-16 |
|:-:|:-:|:-:|:-:|:-:|
| 0x0000 - 0x007F | **P** (0x0050) | 00110010 | 00110010 | 00000000 00110010 |
| 0x0080 - 0x07FF | **Ω** (0x03A9) | 00000<span style="color:green">011</span> <span style="color:red">1010</span><span style="color:purple">1001</span> | **110<span style="color:green">011</span></span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1001</span>** | 00000011 10101001 |
| 0x0800 - 0xFFFF | **** (0x20AC) | <span style="color:blue">0010</span><span style="color:green">0000</span> <span style="color:red">1010</span><span style="color:purple">1100</span> | **1110<span style="color:blue">0010</span> 10<span style="color:green">0000</span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1100</span>** | 00100000 10101100 |
| 0x010000 - 0x10FFFF | 🐎 (0x1F40E) | 000<span style="color:SeaGreen">0</span><span style="color:sienna">0001</span> <span style="color:blue">1111</span><span style="color:green">0100</span> <span style="color:red">0000</span><span style="color:purple">1110</span> | **11110<span style="color:SeaGreen">0</span><span style="color:sienna">00</span> 10<span style="color:sienna">01</span><span style="color:blue">1111</span> 10<span style="color:green">0100</span><span style="color:red">00</span> 10<span style="color:red">00</span><span style="color:purple">1110</span>** | **110110<span style="color:mediumvioletred">00</span> <span style="color:mediumvioletred">00</span><span style="color:blue">1111</span><span style="color:green">01</span> 110111<span style="color:green">00</span> <span style="color:red">0000</span><span style="color:purple">1110</span>** |
The table below shows how an Unicode number is converted to UTF-8 or UTF-16, for each range above.

| Example character | Code point, in binary | In UTF-8 | In UTF-16 |
|:-:|:-:|:-:|:-:|
| **P** (0x0050) | 00110010 | 00110010 | 00000000 00110010 |
| **Ω** (0x03A9) | 00000<span style="color:green">011</span> <span style="color:red">1010</span><span style="color:purple">1001</span> | **110<span style="color:green">011</span></span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1001</span>** | 00000011 10101001 |
| **** (0x20AC) | <span style="color:blue">0010</span><span style="color:green">0000</span> <span style="color:red">1010</span><span style="color:purple">1100</span> | **1110<span style="color:blue">0010</span> 10<span style="color:green">0000</span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1100</span>** | 00100000 10101100 |
| 🐎 (0x1F40E) | 000<span style="color:SeaGreen">0</span><span style="color:sienna">0001</span> <span style="color:blue">1111</span><span style="color:green">0100</span> <span style="color:red">0000</span><span style="color:purple">1110</span> | **11110<span style="color:SeaGreen">0</span><span style="color:sienna">00</span> 10<span style="color:sienna">01</span><span style="color:blue">1111</span> 10<span style="color:green">0100</span><span style="color:red">00</span> 10<span style="color:red">00</span><span style="color:purple">1110</span>** | **110110<span style="color:mediumvioletred">00</span> <span style="color:mediumvioletred">00</span><span style="color:blue">1111</span><span style="color:green">01</span> 110111<span style="color:green">00</span> <span style="color:red">0000</span><span style="color:purple">1110</span>** |

The logic for UTF-16 code points above 0x010000 is:

Expand Down
16 changes: 8 additions & 8 deletions src/posts/2024/08/collation-e-encoding-em-bancos-de-dados.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,14 @@ O UTF-8 usa uma quantidade variável de bytes, de 1 a 4 por caractér. É o prin

A escolha do encoding afeta diretamente o tamanho usado para armazenamento de textos. Se a maioria dos caractéres estiver no intervalo de alfabeto latino básico, o UTF-8 é melhor, porque usa menos bytes do que o UTF-16; mas, se a escrita for asiática, o UTF-16 tem vantagem, pois o caractér ocupa 2 bytes, contra 3 do UTF-8.

A tabela abaixo mostra como um número Unicode é convertido para UTF-8 ou UTF-16.

| Intervalo Unicode | Caractér de exemplo | Code point em binário | Em UTF-8 | Em UTF-16 |
|:-:|:-:|:-:|:-:|:-:|
| 0x0000 a 0x007F | **P** (0x0050) | 00110010 | 00110010 | 00000000 00110010 |
| 0x0080 a 0x07FF | **Ω** (0x03A9) | 00000<span style="color:green">011</span> <span style="color:red">1010</span><span style="color:purple">1001</span> | **110<span style="color:green">011</span></span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1001</span>** | 00000011 10101001 |
| 0x0800 a 0xFFFF | **** (0x20AC) | <span style="color:blue">0010</span><span style="color:green">0000</span> <span style="color:red">1010</span><span style="color:purple">1100</span> | **1110<span style="color:blue">0010</span> 10<span style="color:green">0000</span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1100</span>** | 00100000 10101100 |
| 0x010000 a 0x10FFFF | 🐎 (0x1F40E) | 000<span style="color:SeaGreen">0</span><span style="color:sienna">0001</span> <span style="color:blue">1111</span><span style="color:green">0100</span> <span style="color:red">0000</span><span style="color:purple">1110</span> | **11110<span style="color:SeaGreen">0</span><span style="color:sienna">00</span> 10<span style="color:sienna">01</span><span style="color:blue">1111</span> 10<span style="color:green">0100</span><span style="color:red">00</span> 10<span style="color:red">00</span><span style="color:purple">1110</span>** | **110110<span style="color:mediumvioletred">00</span> <span style="color:mediumvioletred">00</span><span style="color:blue">1111</span><span style="color:green">01</span> 110111<span style="color:green">00</span> <span style="color:red">0000</span><span style="color:purple">1110</span>** |
A tabela abaixo mostra como um número Unicode é convertido para UTF-8 ou UTF-16, para cada intervalo acima.

| Caractér de exemplo | Code point em binário | Em UTF-8 | Em UTF-16 |
|:-:|:-:|:-:|:-:|
| **P** (0x0050) | 00110010 | 00110010 | 00000000 00110010 |
| **Ω** (0x03A9) | 00000<span style="color:green">011</span> <span style="color:red">1010</span><span style="color:purple">1001</span> | **110<span style="color:green">011</span></span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1001</span>** | 00000011 10101001 |
| **** (0x20AC) | <span style="color:blue">0010</span><span style="color:green">0000</span> <span style="color:red">1010</span><span style="color:purple">1100</span> | **1110<span style="color:blue">0010</span> 10<span style="color:green">0000</span><span style="color:red">10</span> 10<span style="color:red">10</span><span style="color:purple">1100</span>** | 00100000 10101100 |
| 🐎 (0x1F40E) | 000<span style="color:SeaGreen">0</span><span style="color:sienna">0001</span> <span style="color:blue">1111</span><span style="color:green">0100</span> <span style="color:red">0000</span><span style="color:purple">1110</span> | **11110<span style="color:SeaGreen">0</span><span style="color:sienna">00</span> 10<span style="color:sienna">01</span><span style="color:blue">1111</span> 10<span style="color:green">0100</span><span style="color:red">00</span> 10<span style="color:red">00</span><span style="color:purple">1110</span>** | **110110<span style="color:mediumvioletred">00</span> <span style="color:mediumvioletred">00</span><span style="color:blue">1111</span><span style="color:green">01</span> 110111<span style="color:green">00</span> <span style="color:red">0000</span><span style="color:purple">1110</span>** |

A lógica do UTF-16 para code points acima de 0x010000 é:

Expand Down

0 comments on commit b0734d2

Please sign in to comment.