Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
Added 5 PORTULAN corpora.
  • Loading branch information
jakoble authored Nov 5, 2024
1 parent 57d191b commit e921ea0
Show file tree
Hide file tree
Showing 5 changed files with 75 additions and 0 deletions.
15 changes: 15 additions & 0 deletions corpora/cmc-corpora/askit-dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"Name": "askIT Dataset",
"URL": "https://hdl.handle.net/21.11129/0000-000D-F8BD-7",
"Family": "Computer-mediated communication corpora",
"Description": "This is a corpus of dialogues automatically extracted from subreddits related to the Information Technology domain.\nThe dialogues were extracted with the <a href=\"https://hdl.handle.net/21.11129/0000-000D-F898-0\">Reddit Dataset Extraction Tool</a>.\nThe corpus is available from PORTULAN.",
"Language": ["eng"],
"Licence": "CC BY",
"Size": ["180,000 texts", "61.9 million tokens"],
"Annotation": [""],
"Infrastructure": "CLARIN",
"Access": {
"Download": "https://hdl.handle.net/21.11129/0000-000D-F8BD-7"
},
"Publication":""
}
15 changes: 15 additions & 0 deletions corpora/cmc-corpora/brandsbr.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"Name": "Brands.Br – a Portuguese Reviews Corpus",
"URL": "https://hdl.handle.net/21.11129/0000-000D-FE57-4",
"Family": "Computer-mediated communication corpora",
"Description": "This is a corpus of product reviews.\nThe subjects of the reviews were semi-automatically classified.\nThe corpus is available from PORTULAN.",
"Language": ["por"],
"Licence": "CC BY-NC-ND",
"Size": ["252 entries"],
"Annotation": [""],
"Infrastructure": "CLARIN",
"Access": {
"Download": "https://hdl.handle.net/21.11129/0000-000D-FE57-4"
},
"Publication":""
}
15 changes: 15 additions & 0 deletions corpora/cmc-corpora/feup.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"Name": "FEUP Tweets",
"URL": "https://hdl.handle.net/21.11129/0000-000D-F8C1-1",
"Family": "Computer-mediated communication corpora",
"Description": "This is a corpus of tweets.\nThe corpus is available from PORTULAN.",
"Language": ["eng"],
"Licence": "MS NC-NoReD-ND",
"Size": ["338 million texts"],
"Annotation": [""],
"Infrastructure": "CLARIN",
"Access": {
"Download": "https://hdl.handle.net/21.11129/0000-000D-F8C1-1"
},
"Publication":""
}
15 changes: 15 additions & 0 deletions corpora/cmc-corpora/georeferenced-tweets.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"Name": "Georeferenced Tweets",
"URL": "https://hdl.handle.net/21.11129/0000-000D-F8C4-E",
"Family": "Computer-mediated communication corpora",
"Description": "This is a corpus of tweets annotated with geographic coordinates.\nThe corpus is available from PORTULAN.",
"Language": ["eng"],
"Licence": "MS NC-NoReD-ND",
"Size": ["26 million texts"],
"Annotation": [""],
"Infrastructure": "CLARIN",
"Access": {
"Download": "https://hdl.handle.net/21.11129/0000-000D-F8C4-E"
},
"Publication":""
}
15 changes: 15 additions & 0 deletions corpora/cmc-corpora/redditpt-dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"Name": "RedditPT Dataset",
"URL": "https://hdl.handle.net/21.11129/0000-000D-F8BC-8",
"Family": "Computer-mediated communication corpora",
"Description": "This corpus collects dialogues extracted from the <a href=\"https://www.reddit.com/r/portugal/\"Portugal subreddit</a>.\nThe extraction was done with the <a href=\"https://hdl.handle.net/21.11129/0000-000D-F898-0\">Reddit Dataset Extraction Tool</a>.\nThe corpus is available from PORTULAN.",
"Language": ["por"],
"Licence": "CC BY",
"Size": ["218,500 dialogues", "58.9 million tokens"],
"Annotation": [""],
"Infrastructure": "CLARIN",
"Access": {
"Download": "https://hdl.handle.net/21.11129/0000-000D-F8BC-8"
},
"Publication":""
}

0 comments on commit e921ea0

Please sign in to comment.