Skip to content

Commit

Permalink
Merge pull request #14 from tmck-code/utf-8-sig
Browse files Browse the repository at this point in the history
Utf 8 sig
  • Loading branch information
tmck-code authored Mar 10, 2024
2 parents 19689fd + 4503324 commit 46256ed
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 0 deletions.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ my blog

> _The configs that I use for CS2 surf, currently a WIP_
### [20230919 Parsing BOMs in Python](articles/20230919_parsing_boms_in_python/20230919_parsing_boms_in_python.md)

> _How to detect/read/write UTF 8/16 BOMs_
### [20230704 Jupyter Cell Wrappers](articles/20230704_jupyter_cell_wrappers/20230704_jupyter_cell_wrappers.md)

> _Adding decorator-style functionality to jupyter cells_
Expand Down Expand Up @@ -81,3 +85,4 @@ my blog




Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# 20230919 Parsing BOMs in Python

```python
import csv, codecs

CODECS = {
"utf-8-sig": [codecs.BOM_UTF8],
"utf-16": [
codecs.BOM_UTF16,
codecs.BOM_UTF16_BE,
codecs.BOM_UTF16_LE,
]
}

def detect_encoding(fpath):
with open(fpath, 'rb') as istream:
data = istream.read(3)
for encoding, boms in CODECS.items():
if any(data.startswith(bom) for bom in boms):
return encoding
return 'utf-8'

def read(fpath):
with open(fpath, 'r', encoding=detect_encoding(fpath)) as istream:
yield from csv.DictReader(istream)
```

```python
# run here
for i, row in enumerate(read('test.csv')):
print(i, row)
if i > 10:
break
```

0 comments on commit 46256ed

Please sign in to comment.