Thursday, January 19, 2017
Established datasets, proxies, and customized data collection: The case of international LLMs (Michael Simkovic)
How should researchers make tradeoffs between the costs of data collection, the speed of the analysis, the precision of the measurements, reproducibility by other researchers, and broader context about the meaning of the data: how we might compare one group or one course of action to another, how we might understand historical trends, and the like?
Must we always measure the precise group of interest, with zero tolerance for over-inclusion or under-inclusion? Or might one or a series of proxy groups be sufficient, or even preferable for some purposes? What if the proxies have substantial overlap with the groups of interest and biases introduced by use of proxy groups are reasonably well understood? How close must the proxy group be to the group of interest?
These are important questions raised by a group of legal profession researchers which includes several of the principal investigators of the widely used After the JD dataset.
Professors Carole Silver, Ethan Michelson, Robert Nelson, Nancy Reichman, Rebecca Sandefur, and Joyce Sterling (hereinafter, Silver et al.) recently wrote a three-part response (Parts 1, 2, and 3) to my two-part blog post from December about International LLM students who remain in the United States (Part 1) and International LLM students who return to their home countries (Part 2). The bulk of Silver et al.’s critique appears in Part 2 of their post, and focuses mainly on Part 1 of my LLM post.
My post, which I described as “a very preliminarily, quick analysis intended primarily to satisfy my own curiosity” used U.S. Census data from the American Community Survey and two proxy groups for international LLM (“Masters of Law”) graduates to make inferences about the financial benefits of LLM degrees to international students who remain in the U.S. Silver et al. agree with several of the limitations of this analysis that I noted in paragraphs 5 through 8 of Part 1 of my post. They also note that historically, many LLMs have returned to their home countries and argue that the benefits of LLM programs to returning students may be greater than the benefits to those who remain in the United States. (While I am skeptical of this last claim—especially if we focus exclusively on pecuniary benefits—it seems likely that both groups benefit).
Silver et al. have also helpfully made several additional points about limitations in my proxy approach and ways in which proxies could over-count or under-count foreign LLMs. The most important of these limitations can be addressed with a few modifications to the LLM proxy group approach. Those interested in the technical details are encouraged to read footnote 1 below.
Returning to broader questions about the use of proxy groups, my view is that proxy groups can be helpful and potentially necessary for certain kinds of analysis.
Suppose that we wish to know the temperature in New York’s Central Park before we take a stroll, but we only have temperature readings for LaGuardia and Newark airport. While neither of those proxies will tell us the precise temperature in Central Park, they will usually be sufficiently close that we can ascertain with a reasonable degree of certainty whether we should bring our winter coats, wear sweaters, or proceed with short sleeves. Indeed, readings from Boston or Philadelphia will probably suffice, particularly if we’re aware of the direction and magnitude of typical temperature differences relative to Central Park.
Should we refuse to venture out until we can obtain a temperature reading from Central Park itself?
Perhaps if we need accuracy to within one or two degrees Celsius. Otherwise, the airport readings may be good enough, and the cost and delay required for further data collection may be prohibitive relative to the benefits.
Now suppose that we wish to know when we can pack our winter clothes into storage based on historical seasonal weather patterns. If we have a precise current reading for our location, but only have long term data for adjacent proxies, it may be more sensible for us to focus on the proxy data rather than the data for our current location.
In the context of legal education and the legal profession, there are many advantages to proxy data using large, nationally representative government data sets such as the American Community Survey, particularly if one wishes to make comparisons to other groups or other periods of time, and resources are limited. Since many other researchers use these datasets, their properties and any response biases tend to be relatively well understood. Many datasets also are updated regularly and routinely, and they are carefully administered and weighted to be as representative as possible.
ACS is not the only data set currently available to assess LLM programs. There are other off-the-shelf surveys, more targeted toward immigrants, that may be useful, and which Professor Silver and her co-authors may wish to consult.
Custom data sets can address problems that off-the-shelf data cannot because they can be designed to answer very specific questions. But if such surveys are not designed carefully, they risk losing the broader context that enables the results to be readily interpretable. Thus a data set that only reports on the earnings or other outcomes for “LLM graduates” is less useful for assessing the benefits of such programs than one that also provides the same information for a relevant control group who did not obtain LLMs, but are reasonably similar in important respects that predict outcome variables.
In many cases, results from off-the-shelf and custom data sets can be mutually reinforcing. For example, the results of After the JD III suggested that most law graduates were doing well financially 12 years after graduation, while The Economic Value of a Law Degree suggested that they probably could not have done nearly so well had they entered the labor market with only a bachelor’s degree. Timing Law School suggested that the results of AJD III were not a fluke due to respondents graduating in a good year.
Silver et al.’s interest extend beyond earnings premiums, and they believe that they can advance our understanding of the benefits of LLM programs by building a custom data set. I look forward to their findings.
 Perhaps the most important of these points is that foreign-born individuals could include those who immigrated to the United States prior to obtaining their bachelor’s degrees, and therefore do not resemble the typical international LLM graduate. The typical international LLM graduate has obtained a bachelor’s degree outside of the United States and a graduate degree in the United States.
Fortunately, this problem can be readily addressed. ACS includes variables for both year of birth and year of immigration. These variables can be used to exclude those who immigrated to the United States prior to the age at which they likely completed their bachelor’s degrees (i.e., age 22-26), depending on the country from which they immigrated.
Silver et al. also object to the exclusion of Hispanics from the analysis because LSAC data suggests that approximately 18 percent of LLMs in recent years come from Central and South America and the Caribbean. While many immigrants from these regions do not typically describe themselves to the Census as Hispanic—for example, those from Brazil, Belize or Trinidad—the objection to excluding Hispanics is reasonable.
Re-running the original analysis with Hispanics included does not change the results very much—earnings for both the non-LLM control group and the LLM proxy group both fall a bit, and the implied earnings premium in dollars decreases slightly. (Compare either the first proxy with and without Hispanics; or second proxy with and without Hispanics).
Silver et al. also argue that my proxy approach could underestimate the benefits of an international LLM, because they believe that international LLMs who remain in the U.S. are very likely to work as lawyers and judges and very unlikely to work as paralegals or legal assistants. It would be simple enough to construct an LLM proxy group that includes only foreign-born lawyers and judges with Masters degrees who immigrated to the United States after the age at which they likely completed their bachelor’s degrees. In combination with the broader proxy groups, this would provide a range for the earnings premium. Frank McIntyre and I have used a similar three-proxy-group approach in our research on the value of a law degree by college major.
Silver et al. also ask how the Census deals with unemployment and occupation. In IPUMS ACS, individuals who are unemployed report their most recent occupation. There is a separate occupation category for those who are seeking their first job and have never worked, and for those who have been unemployed for more than 5 years straight.