Skip to content

Commit

Permalink
feat(nlp): update exercise 7 subject and audit
Browse files Browse the repository at this point in the history
  • Loading branch information
nprimo committed Dec 18, 2023
1 parent b5230d1 commit 6dd1be6
Show file tree
Hide file tree
Showing 2 changed files with 73 additions and 26 deletions.
21 changes: 14 additions & 7 deletions subjects/ai/nlp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ Steps:

> Note: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommended to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.
2. Using `from_spmatrix` from Pandas, create a DataFrame with documents in rows and the dictionary in columns.
2. Using `from_spmatrix` from Pandas, create a DataFrame `count_vecotrized_df` using the output features names as column names. The final results should be similar to the below one.

| | and | boat | compute |
| --: | --: | ---: | ------: |
Expand All @@ -206,16 +206,23 @@ Steps:

> Note: The sample 3x3 table mentioned is a small representation of the expected output for demonstration purposes. It's not necessary to drop columns in this context.
3. Create a DataFrame with labels where:
3. Show the token counts (obtained with the above-mentioned steps) of the fourth tweet.

4. Using the word counter, show the 15 most used tokenized words in the datasets' tweets

5. Add to your `count_vecotrized_df` a `label` column considering the following:
- 1: Positive
- 0: Neutral
- -1: Negative

| | Label |
| --: | ----: |
| 0 | -1 |
| 1 | 0 |
| 2 | 1 |
The final DataFrame should be similar to the below:


| | ... | label |
|---:|-------:|--------:|
| 0 | ... | 1 |
| 1 | ... | -1 |
| 2 | ... | -1 |
| 3 | ... | -1 |

_Resources: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)_
78 changes: 59 additions & 19 deletions subjects/ai/nlp/audit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,26 +183,66 @@ Remove this from the sentence

##### The exercise is validated if all questions of the exercise are validated

###### For question 1, is the output of the CountVectorizer the following?
###### For question 1, is the output of the `CountVectorizer` the following?

```
<6588x500 sparse matrix of type '<class 'numpy.int64'>'
with 79709 stored elements in Compressed Sparse Row format>
with 37334 stored elements in Compressed Sparse Row format>
```

###### For question 2, is the output of `print(count_vecotrized_df.iloc[:3,400:403].to_markdown())` the following?

```python
| | someth | son | song |
|---:|---------:|------:|-------:|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
```

###### For question 3, is the output matching with the following one?

```python
cant 1
deal 1
end 1
find 1
keep 1
like 1
may 1
say 1
talk 1
Name: 3, dtype: Sparse[int64, 0]
```

###### For question 4, is the output matching with the following one?

```python
tomorrow 1126
go 733
day 667
night 641
may 533
tonight 501
see 439
time 429
im 422
get 398
today 389
game 382
saturday 379
friday 375
sunday 368
dtype: int64
```

###### For question 5, is the output of `print(count_vectorized_df.iloc[350:354,499:501].to_markdown())` the following?

```python
| | your | label |
|----:|-------:|--------:|
| 350 | 0 | 1 |
| 351 | 1 | -1 |
| 352 | 0 | 1 |
| 353 | 0 | 0 |
```

###### For question 2, is the output of `print(df.iloc[:3,400:403].to_markdown())` the following?

| | talk | team | tell |
|---:|-------:|-------:|-------:|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |

###### For question 3, is the shape of the wordcount DataFrame `(6588, 501)` and the output of `print(df.iloc[300:304,499:501].to_markdown())` the following?

| | youtube | label |
|----:|----------:|--------:|
| 300 | 0 | 0 |
| 301 | 0 | -1 |
| 302 | 1 | 0 |
| 303 | 0 | 1 |

0 comments on commit 6dd1be6

Please sign in to comment.