Posted on: April 26, 2020
The EMNLP program chairs (PCs) recently posted about EMNLP Findings. This is an interesting experiment. I definitely see where the PCs are coming from, but the details raise some concerns. The initial announcement actually did not attract my attention. I studied it more carefully after students expressed concerns about implications on their publication potential and what topics they should study. The students were worried about papers studying niche topics that get into EMNLP, and how this initiative will influence their review. This pushed me to look into this in more depth.
There is a comments section below for discussion. There is also extensive discussion on Twitter (here). The first part of the letter highlights concerns about the Findings post. I then make a concrete suggestion. The quotes below are taken from the PCs’ post as it was on April 26, 2020 at 6pm EST.
Dear EMNLP PCs,
First and foremost, thank you for putting all this energy and thinking into original ideas to improve the quality of research and life in our community. Findings has a lot of potential to address current challenges in the NLP community. However, in how it is currently presented, it raises several concerns.
The idea here is to separate the paper ranking process that is used to select papers to accept for the main conference, from the classification of whether a paper has sufficient substance, quality and novelty to warrant publication. It is based on the assumption that there are significant numbers of rejected papers that are of a publishable standard, but for various reasons, could not be accepted into the conference.
Reviewing is a complex and soft process. It is often hard to tease apart the different aspects. The criteria listed can be interpreted to include much of what reviewing is about. It is hard to make them binary and separate from the ranking. In fact, the complexity of reviewing is what makes it so hard to outline well defined guidelines. Can we separate this process from ranking without influencing the ranking?
One common requirement is that the reviewers must agree that the paper is well written, makes an original contribution, has sound methodology, and includes appropriate analysis and conclusions. What sets Findings apart from the main conference papers is that there is no requirement for high perceived impact, and accordingly solid work in untrendy areas and other more niche works will be eligible. These requirements are based on the criteria for publication at PLoS One, which you can read about in more detail.
There are solid well-written papers that do not get enough “excitement” to get into EMNLP. While this “excitement” is often vague, it is not a flaw of the review process. My understanding: Findings aims to catch these papers (more below).
The post provides four instances (numbered here for easy reference) of papers more appropriate for Findings (than EMNLP):
Beyond simply ranking, certain kinds of papers are more appropriate for appearing in Findings, for instance:
(1) Papers that make a specific contribution to a narrow subfield, and while not of widespread interest, will have an impact on a small community;
(2) Papers that extend the state of the art on a particular focused task, but have few novel insights or Findings of broader applicability to the wider EMNLP community;
(3) Papers that have well-executed, novel experiments and present thorough analyses and Findings, but using methods that are not thought to be sufficiently “novel”; and
(4) Papers that don’t quite fit in EMNLP, but make contributions that are potentially of interest to specific sub-communities.
Instances (2) and (3) fit the solid-but-not-exciting reasoning. Instances (1) and (4) bring into the mix narrowness and fit. This is my main concern.
Beyond simply ranking, certain kinds of papers are more appropriate for appearing in Findings, for instance:
(1) Papers that make a specific contribution to a narrow subfield, and while not of widespread interest, will have an impact on a small community;
What is narrow today, will create new fields tomorrow. Trendy does not necessarily (or necessarily does not) mean high quality in research. Indicating that a top-tier venue like EMNLP is not the place for such non-trendy work amounts to encouraging researchers interested in first-tier publications to not work on such topics. It is also extremely hard to define “narrowness”, and bringing it into the mix in a prominent way alters the reviewing dynamics. Even worse, the effect will be latent. It will make a noisy process worse, and will ripple to other venues. With several equivalent venues, what fits one fits the others, and what does not fit one is implied to not fit the others. Is this a decision we wish one top-tier venue to take separately from others in the same professional society?
Beyond simply ranking, certain kinds of papers are more appropriate for appearing in Findings, for instance:
…
(4) Papers that don’t quite fit in EMNLP, but make contributions that are potentially of interest to specific sub-communities.
The fourth instance raises similar concerns. NLP is broad and diverse. I understand what papers that currently get rejected this aims for. However, how is this phrasing going to influence the reviewing of papers that currently do get in? How will such publicly published criteria influence reviewers behavior? What are we dog-whistling NLP/EMNLP reviewers? This list of instances risks introducing biases that will be impossible to disentangle from the reviews later on.
Following some initial discussion, the Findings post was updated with an addendum, reflecting concerns raised.
Will papers that meet the criteria for Findings be excluded from the main conference?
No, many such papers make fine additions to the conference. Findings will take those borderline papers which would otherwise have been rejected, but that PC have assessed as being solid work. Reviewers will be asked some additional questions (exact wording yet to be decided) to aid the senior PC to make these decisions, and additional guidance will provided to educate reviewers, and to ensure reviewers don’t use this as an invitation to down-vote work they see as non-trendy, or other counterproductive behaviours.
Will further guidelines help? People are notoriously bad at reading guidelines, over-committed researchers even worse. Even if reviewers read the guidelines, the implicit message will likely influence their judgment of papers. I worry this cannot be avoided, and will be hard to trace. Consider how the Arxiv anonymity guidelines pop up in reviews from time to time, even though they are supposed to be decided centrally by the PCs and should not influence review content. Anonymity complaints are easy to identify (although the potentially cascading bias is not), and ask reviewers not to consider. The biases of “fit” and “narrowness” will be much harder to identify.
Will this scheme exacerbate biases in reviewing and paper acceptance?
The reviewing process isn’t perfect, with several inherent biases. See for example, Ken Church’s papers on this topic, and broader issues with reviewing that led to the creation of EMNLP in the first place. We will do our best to mitigate these biases (e.g., overly conservative reviewing biased towards well established tasks, reviewers overly biased towards “trendy” tasks, or their own sub-fields) and ensure that the reviewing process is as fair as possible. This applies equally to papers for the EMNLP conference and Findings. Our primary focus will be on EMNLP conference papers, and only after these decisions are made, will we look at the remaining papers to identify those that warrant acceptance into Findings.
Can we draw clear boundaries between the reviewing criteria of conferences that serve the same community? Yes, the reviewing process is far from perfect. But what is the impact of injecting these “fit” and “narrowness” guidelines in such an explicit way into it?
Concrete suggestions? First: let’s be careful about injecting new considerations into the review process. Especially not through roundabout, indirect messaging way that will be hard-to-observe and correct for later. This does not seem necessary for this effort, which is not intended to influence the core conference standards. We can experiment with creating Findings using the current system without any reviewer-facing modifications. The first tier (~25%) gets in, just like now. From the second tier (next ~25%), valid research that did not get enough excitement from reviewers gets the Findings opportunity. No new reviewer guidelines. The decision is done by ACs and SACs as part of their acceptance recommendations to the PCs. In addition to the acceptance recommendations, the tables sent to the PCs will include a “Findings if not accepted” column. The ACs and SACs can discuss this decision with the reviewers, but only after they converge on the EMNLP acceptance recommendation.
might be more willing to have the paper published now, even if rejected from EMNLP, partially so they can continue their lives to the next research project
Another interesting question is: why do existing venues not fulfill this need? We have workshops and lower-tier conferences, both publish valid research that does not make it into EMNLP. All conferences and most workshops have proceedings. My reading: researchers prefer to submit to top-tier venues, but might be more willing to have the paper published now, even if rejected from EMNLP, partially so they can continue their lives to the next research project. This is not a bad idea. However, if they roll the dice again, they will likely go to the next top-tier venue. Under this interpretation, this initiative is rooted more in psychology than lack of venues, and Findings will decrease the number of papers rolling between conferences and accumulating reviewing. This is a great goal. Important question though: how much of our reviewing load is because of such repeat reviews that keep getting the same decision? This should be relatively easy to estimate for ACL conferences. Maybe this number is actually known already.
A final relatively minor note about using the term “journal”:
Should Findings be called a “ journal”?
Perhaps there’s a better term for the publication venue, but we opted for “journal” for want of a more appropriate word. Your suggestions are welcome.
Journal is a publication category used for promotion, especially in institutions that do not consider conferences for promotion purposes, or consider them less. Findings does not use a journal-like reviewing process, in contrast to TACL and CL. Diluting how our community uses this term may hurt researchers being evaluated on journal publications.
Finally, I thank the PCs and others involved in this proposal. This reflects hard work and original thinking about solving challenges that make life and research progress harder. Thank you!
Best,
Yoav