Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
julienijs authored May 19, 2024
1 parent c4c7f53 commit 570d674
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,21 +35,21 @@ The dataset used in this experiment is based on the the multilingual parallel ED

3 new datasets were derived from the EDGe corpus:
- EDGe_Zipped_Sizes.xlsx: contains the sizes of all the files in EDGe when they are zipped
- morph_zipped_all.xlsx: contains the sizes of all the files in EDGe when they are morphologically distorted and then zipped over 1000 iterations
- synt_zipped_all.xlsx: contains the sizes of all the files in EDGe when they are syntactically distorted and then zipped over 1000 iterations
- EDGe_Morph_Zipped.xlsx: contains the sizes of all the files in EDGe when they are morphologically distorted and then zipped over 1000 iterations
- EDGe_Synt_zipped.xlsx: contains the sizes of all the files in EDGe when they are syntactically distorted and then zipped over 1000 iterations

### Workflow & code
#### Step 1: create a file with all the file sizes of the zipped files in EDGe - EDGe_Zipped_Sizes.xlsx
In order to create EDGe_Zipped_Sizes.xlsx first all the files in the dataset need to be zipped. This is done by running gzip_files.py. The second step is retrieving all the file sizes of the zipped files. This is done by running file_size.py on the newly created zipped files.

#### Step 2: morphological distortion - morph_zipped_all.xlsx
In this step all files are first morphologically distorted and subsequently zipped. Morphological distortion is achieved as described above, by randomly deleting 10% of all characters in the file. For each file this is done 1000 times and each time the size of the file is stored in morph_zipped_all.xlsx. This step requires morphological_distortion_pipeline.py.
In this step all files are first morphologically distorted and subsequently zipped. Morphological distortion is achieved as described above, by randomly deleting 10% of all characters in the file. For each file this is done 1000 times and each time the size of the file is stored in EDGe_Morph_Zipped.xlsx. This step requires morphological_distortion_pipeline.py.

#### Step 3: syntactic distortion - synt_zipped_all.xlsx
In this step all files are first syntactically distorted and subsequently zipped. Syntactic distortion is achieved as described above, by randomly deleting 10% of all words in the file. For each file this is done 1000 times and each time the size of the file is stored in synt_zipped_all.xlsx. This step requires syntactic_distortion_pipeline.py.
In this step all files are first syntactically distorted and subsequently zipped. Syntactic distortion is achieved as described above, by randomly deleting 10% of all words in the file. For each file this is done 1000 times and each time the size of the file is stored in EDGe_Synt_zipped.xlsx. This step requires syntactic_distortion_pipeline.py.

#### Step 4: statistical analysis in R
The statistical analysis of the created datasets (input = EDGe_Zipped_Sizes.xlsx, morph_zipped_all.xlsx and synt_zipped_all.xlsx) is done by running complexity_analysis.R. The script calculates the morphological and syntactic complexity as described above. The output of this script are graphs in .png format.
The statistical analysis of the created datasets (input = EDGe_Zipped_Sizes.xlsx, EDGe_Morph_Zipped.xlsx and EDGe_Synt_zipped.xlsx) is done by running complexity_analysis.R. The script calculates the morphological and syntactic complexity as described above. The output of this script are graphs in .png format.

### Result
![Syntactic vs morphological complexity ratio](https://user-images.githubusercontent.com/107923146/212687027-2c4eaac4-89a9-45b5-b8bf-000191aa7c16.png)
Expand Down

0 comments on commit 570d674

Please sign in to comment.