Q-Title

Essential Recourses:

Downloads of Q-Title Corpus:

You can download the zip here: https://github.com/YisongMiao/Q-Title/blob/main/question_info.csv.zip
Or a csv file: https://yisong.me/publications/question_info.csv

This README includes:

Codebase Instruction
- Usage: You can replicate the same process for future papers only if they are indexed by dblp or ACL. Please read the codebase.
Corpus Description

1. Codebase Structure

data-curl

Our study includes papers from two major sources: ACL for computational linguistic society, DBLP for other conferences (we choose Top 50 CS conference from [link])
- curl.sh: A bash script that curl webpages (.html) from dblp for top 50 CS conferences in past 30 years.
- We directly download its official full bib file from [link].
data-mining
- html2csv.py: Convert the downloaded html file into an intermediate csv file.
  - Key metadata encoded in csv columns: 'title', 'question', 'venue', 'year'.
- bib2csv.py: Convert the downloaded html bib into an intermediate csv file.
  - Key metadata is the same as above: 'title', 'question', 'venue', 'year'.
- csv2question.py:
  
  This script is more about NLP than data mining :P
  
  In the class of question_mining_process we have following methods that being executed sequentially:
  - question_detection: from 350k+ paper titiles, we detect 4k+ questions that have a question in it.
    
    Input: the original panda dataframe from original csv files.
    
    Output: a new panda dataframe that only contains titles that contain questions.
  - insights_question_detection: generate insights from question_detection method.
    
    Input: the dataframe from question_detection
    
    Output: question frequencies w.r.t. year/venue.
  - question_mining: from 4k+ titles with questions, we extract the questions from the title.
    
    Input: The dataframe from question_detection
    
    Output: A new dataframe with a new column of title
    
    Key technology: spacy's sentence segmentation.
  - question_analyze: Analyze questions.
    
    Input: The dataframe from question_mining
    
    Output: A new dataframe, adding a new column for the type of questions, we now have general questions, special questions, choice questions and disjunctive questions.
  - question_type_per_year_venue
    
    Input: The dataframe from question_analyze.
    
    Output: Question types w.r.t. year/venue.

2. Q-Title: Corpus Analysis

2.1 Corpus Metadata

In this development repo, we don't upload raw data file.

You can download the zip here: https://github.com/YisongMiao/Q-Title/blob/main/question_info.csv.zip

Or a csv file: https://yisong.me/publications/question_info.csv

In the csv file:

Number of Rows (each question in one row): 4181
Number of Columns: 6, and they are 'Question', 'Title', 'Index', 'venue', 'year', 'type'.
- Question: in text format.
- Title: in text format.
- Index: from 0 to 4087, indicating the index of title. This is because some title has more than one questions.
- venue: The venue where the paper was presented. We have ACL events + other
- year: In range (1990, 2020)
- type: With five options: general, special, choice, disjunctive, and unknown.

2.2 Corpus Samples in Q-Title:

Randomly seleted 30 samples.

Question	Title	Index	venue	year	type
Yes we can!?	Yes we can!? Annotating English modal verbs	933	acl-events	2012	unknown
What does the power industry need from the EDA industry and what is the EDA industry doing about it?	Panel: What does the power industry need from the EDA industry and what is the EDA industry doing about it?	3883	date	2011	special
How Much Data Do You Need?	How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR	87	acl-events	2020	special
How Far from Optimal Is Fast and Frugal?.	On the Accuracy of Bounded Rationality: How Far from Optimal Is Fast and Frugal?.	1626	nips	2005	special
Does estimation trump compression?	Learning Markov distributions: Does estimation trump compression?	3786	isit	2016	general
How Does Channel Coding Affect the Design of Uplink SCMA Multidimensional Constellations?	How Does Channel Coding Affect the Design of Uplink SCMA Multidimensional Constellations?	3691	wcnc	2020	special
Do you remember this source code?	Do you remember this source code?	2732	icse	2018	general
How Hard Is the Manipulative Design of Scoring Systems?.	How Hard Is the Manipulative Design of Scoring Systems?.	2691	ijcai	2019	special
Which PKI (public key infrastructure) is the right one?	Which PKI (public key infrastructure) is the right one? (panel session).	2467	ccs	2000	special
Easy as ABC?	Easy as ABC? Facilitating Pictorial Communication via Semantically Enhanced Layout	1106	acl-events	2008	unknown
CM(1, 2) or CM(2, 2)?	Which constant modulus criterion is better for blind adaptive filtering: CM(1, 2) or CM(2, 2)? [equalizer example].	2552	icassp	2005	choice
When must all terminals talk?	Achieving SK capacity in the source model: When must all terminals talk?	3839	isit	2014	special
When Does Diversity of Agent Preferences Improve Outcomes in Selfish Routing?	When Does Diversity of Agent Preferences Improve Outcomes in Selfish Routing?	2658	ijcai	2018	special
Does Vertical Bring more Satisfaction?	Does Vertical Bring more Satisfaction?: Predicting Search Satisfaction in a Heterogeneous Environment.	3272	cikm	2015	general
The Usual Suspects?	The Usual Suspects? Reassessing Blame for VAE Posterior Collapse.	1801	icml	2020	unknown
Where to park?	Where to park? minimizing the expected time to find a parking space.	1854	icra	2015	special
Far but Near or Near but Far?	Far but Near or Near but Far?: The Effects of Perceived Distance on the Relationship between Geographic Dispersion and Perceived Diversity.	2095	chi	2016	choice
; Where are we today?!	Summarization and Evaluation; Where are we today?!	1145	acl-events	2007	unknown
Can it Work?	Cognitive Radio in a Frequency Planned Environment: Can it Work?	3347	globecom	2007	general
Is this app safe?	Is this app safe?: a large scale study on application permissions and risk signals.	2001	www	2012	general
is this even the right question?	Lightweight vs. heavyweight processes: is this even the right question?	2799	icse	2002	general
Is actuation redundancy a good solution for pick-and-place?	Towards 100G with PKM. Is actuation redundancy a good solution for pick-and-place?	1840	icra	2010	general
Support or Oppose?	Support or Oppose? Classifying Positions in Online Debates from Reply Activities and Opinion Expressions	1045	acl-events	2010	choice
Who's Better?	Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination.	1483	cvpr	2018	special
What good are strong specifications?	What good are strong specifications?	2769	icse	2013	special
Do You See What I See?	Do You See What I See? Differential Treatment of Anonymous Users.	3462	ndss	2016	general
Culture or fluency?	Culture or fluency?: unpacking interactions between culture and communication medium.	2308	chi	2011	choice
Expect the Unexpected?	Expect the Unexpected? The Processing of Possibility Hedges in Medical Diagnoses and Medical Advice	433	acl-events	2018	unknown
Fluency, Adequacy, or HTER?	Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric	1074	acl-events	2009	choice
Given Obstacles?	Feasibility: Can Humanoid Robots Overcome Given Obstacles?	1867	icra	2005	unknown

2.3 Insights and Discussions

Please refer to our slides for full insights and discussions. [Slides@Google]

We only show key images here.

Figure 1: Frequency of questions w.r.t. years:

Figure 2: Frequency of questions w.r.t. venues:

Figure 3: Distribution of Question Types.

Figure 4: Distribution of Question Types w.r.t. Years.

Figure 5: Distribution of Question Types w.r.t. Venues.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data-curl		data-curl
data-mining		data-mining
images		images
.gitignore		.gitignore
README.md		README.md
question_info.csv.zip		question_info.csv.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q-Title

Essential Recourses:

1. Codebase Structure

2. Q-Title: Corpus Analysis

2.1 Corpus Metadata

2.2 Corpus Samples in Q-Title:

2.3 Insights and Discussions

About

Releases

Packages

Languages

YisongMiao/Q-Title

Folders and files

Latest commit

History

Repository files navigation

Q-Title

Essential Recourses:

1. Codebase Structure

2. Q-Title: Corpus Analysis

2.1 Corpus Metadata

2.2 Corpus Samples in Q-Title:

2.3 Insights and Discussions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages