Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CON-2329-tackle-gh-issue-nlp-exercise-7 #2360

Merged
merged 1 commit into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions subjects/ai/nlp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ Steps:

> Note: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommended to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.

2. Using `from_spmatrix` from Pandas, create a DataFrame with documents in rows and the dictionary in columns.
2. Using `from_spmatrix` from Pandas, create a DataFrame `count_vecotrized_df` using the output features names as column names. The final results should be similar to the below one.

| | and | boat | compute |
| --: | --: | ---: | ------: |
Expand All @@ -206,16 +206,23 @@ Steps:

> Note: The sample 3x3 table mentioned is a small representation of the expected output for demonstration purposes. It's not necessary to drop columns in this context.

3. Create a DataFrame with labels where:
3. Show the token counts (obtained with the above-mentioned steps) of the fourth tweet.

4. Using the word counter, show the 15 most used tokenized words in the datasets' tweets

5. Add to your `count_vecotrized_df` a `label` column considering the following:
- 1: Positive
- 0: Neutral
- -1: Negative

| | Label |
| --: | ----: |
| 0 | -1 |
| 1 | 0 |
| 2 | 1 |
The final DataFrame should be similar to the below:


| | ... | label |
|---:|-------:|--------:|
| 0 | ... | 1 |
| 1 | ... | -1 |
| 2 | ... | -1 |
| 3 | ... | -1 |

_Resources: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)_
78 changes: 59 additions & 19 deletions subjects/ai/nlp/audit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,26 +183,66 @@ Remove this from the sentence

##### The exercise is validated if all questions of the exercise are validated

###### For question 1, is the output of the CountVectorizer the following?
###### For question 1, is the output of the `CountVectorizer` the following?

```
<6588x500 sparse matrix of type '<class 'numpy.int64'>'
with 79709 stored elements in Compressed Sparse Row format>
with 37334 stored elements in Compressed Sparse Row format>
```

###### For question 2, is the output of `print(count_vecotrized_df.iloc[:3,400:403].to_markdown())` the following?

```python
| | someth | son | song |
|---:|---------:|------:|-------:|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
```

###### For question 3, is the output matching with the following one?

```python
cant 1
deal 1
end 1
find 1
keep 1
like 1
may 1
say 1
talk 1
Name: 3, dtype: Sparse[int64, 0]
```

###### For question 4, is the output matching with the following one?

```python
tomorrow 1126
go 733
day 667
night 641
may 533
tonight 501
see 439
time 429
im 422
get 398
today 389
game 382
saturday 379
friday 375
sunday 368
dtype: int64
```

###### For question 5, is the output of `print(count_vectorized_df.iloc[350:354,499:501].to_markdown())` the following?

```python
| | your | label |
|----:|-------:|--------:|
| 350 | 0 | 1 |
| 351 | 1 | -1 |
| 352 | 0 | 1 |
| 353 | 0 | 0 |
```

###### For question 2, is the output of `print(df.iloc[:3,400:403].to_markdown())` the following?

| | talk | team | tell |
|---:|-------:|-------:|-------:|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |

###### For question 3, is the shape of the wordcount DataFrame `(6588, 501)` and the output of `print(df.iloc[300:304,499:501].to_markdown())` the following?

| | youtube | label |
|----:|----------:|--------:|
| 300 | 0 | 0 |
| 301 | 0 | -1 |
| 302 | 1 | 0 |
| 303 | 0 | 1 |
Loading