Using data to address racial disparities

A man, Deen Freelon, answering a question on an online meeting.

Racial disparities and data science were at the center of a recent Data Science Now webinar that explored how:

Unequal opportunities prevent young people from learning data science and how to reverse the trends.
An economic model that “disciplines data” sheds light on what keeps Black men and women from becoming entrepreneurs.
A University Libraries team developed a text-mining project to identify racially biased legislation signed into North Carolina law from Reconstruction to the Civil Rights Movement.

The Feb. 24 webinar was the second in the Carolina Data Science Now series, through which the Renaissance Computing Institute (RENCI) is helping data science practitioners connect. The webinar showed the interdisciplinary approach to the School of Data Science and Society that the University will launch in fall 2022.

Jay Aikat, RENCI chief operating officer and research professor of computer science, moderated the webinar. The next webinar is scheduled for March 24.

Underrepresentation in data science

Deen Freelon, associate professor at the Hussman School of Journalism and Media, researches how citizens use social media and digital communication for political purposes, with attention to how identity characteristics influence use.

Underrepresentation in data science, Freelon said, is a pet issue of his and central to his practice. He said that computational science lacks some of the major barriers that keep people out of other technical fields. That’s because major programming languages such as R and Python and associated software packages are free and require no specific operating system. “Many tutorials and examples are free including Stack Exchange, where you can ask questions for free, and your questions get answered.”

But data science is not diverse. “In the industry and perhaps in academia only 15% of data scientists are women. That percentage is lower at higher levels of career advancement,” he said. The numbers by ethnicity are worse, with people of Latino heritage accounting for less than 5%, Blacks less than 4% and Native Americans less than 0.5%.

Inaccessible resources have not caused the inequality. Instead, he said, contributing problems include:

Unequal K-12 opportunities that keep young people out of the pipeline.
Lack of mentorship and representation, which creates a vicious cycle. “You don’t see anybody like yourself in the field, so you don’t go into it.”
Issues of culture or fit within companies that set hiring trends. “Many companies follow the hiring policies of Facebook, Apple, Amazon, Netflix and Google, hiring mainly from a small number of elite universities.”

Freelon said that little data science research on marginalized communities has involved the people being studied. “The lack of representation creates problems, including elevating technical skills above subject matter expertise, potential cultural misunderstandings in research design and interpretation and contributing to the notion that data science is an othering lens,” he said.

Freelon suggested several ways to increase diversity:

Promoting groups such as Black Girls Code working to diversify tech.
Equipping students with data science skills before they enter graduate school.
Removing barriers to underrepresented individuals.

The entrepreneurship gap

Andrés Hincapié, assistant professor in the College of Arts & Sciences’ economics department, uses applied microeconometrics to study entrepreneurial and health-related choices by individuals.

He broke down the gap in entrepreneurship between white and Black males. Using an economic model with data, he said, allows him to — in the language of economists — discipline the data with theory. “That implies that we can impose data or integrate it in the model of people who are facing budget constraints, that the behavior that people choose reflects their preferences as well as the constraints that they face.”

Andrés Hincapié, assistant professor in the College of Arts & Sciences’ economics department, talks about his use of applied microeconometrics to study entrepreneurial and health-related choices of individuals.

With the model, he found that, at age 25, a one percentage point gap in self-employment exists between white and Black women. The gap grows over the life cycle and becomes almost four percentage points by age 50.

The gap is wider among men, said Hincapié. At age 25, it’s at two to three percentage points between white and Black males and increases to 10 to 11 percentage points by age 50.

He researches the economic factors causing the gap — differences in wealth, human capital, education, credit constraints, access to loans, profitability and discrimination. “Discrimination can interact with these other mechanisms and potentially cause differences,” he said. “We’re putting data together with an economic model to try to separate some of these economic mechanisms.”

Preliminary findings came from a longitudinal study and the subjects’ data points — occupation, paid worker or self-employed, income sources, wealth and demographics such as age education, gender and race. Hincapié found that Black male entrepreneurs:

Get a return on their efforts of about one-third that of white males.
Tend to not use their entrepreneurial skills in paid employment because of the difference in return on investment.
Are limited in entrepreneurial opportunities by their wealth — 26% of black males have zero or negative wealth, whereas 13% of white males have zero or negative wealth.

Data mining for Jim Crow

Matt Jansen, data analysis librarian in University Libraries, discussed “On the Books: Jim Crow and Algorithms of Resistance,” a data and machine-learning project that created a usable data set of North Carolina laws and racially based legislation created between Reconstruction and the Civil Rights Movement, 1866 to 1967.

“This project started as a question to special collections librarian Sarah Carrier from a K-12 teacher. She needed a list of Jim Crow laws to use in her instructional materials,” he said. The only options at the time, books by Pauli Murray and by Richard Paschal that identify 120 laws, were not digital.

Matt Jansen discusses his team's work on Jim Crow laws.

Matt Jansen, data analysis librarian in University Libraries, talks about the team’s work on “On the Books: Jim Crow and Algorithms of Resistance.”

The Digital Research Services team at Davis Library “dug into the problem, started exploring it and called on outside expertise,” he said. Other University Libraries staff, including special collections and the law library, and scholars on and off campus helped.

The project yielded 558 Jim Crow laws from 300,000 possible entries on 800 pages of legislative records.

The team wanted to create a database for human interaction using established text analysis techniques. They began by talking with historians and other subject matter experts, a critical step, Jansen said. The team learned about each era’s legal context and changes in how laws appeared. That knowledge helped them determine what data would be useful for analysis and for public use.

They started with a collection of images of North Carolina session laws, 80,000 pages previously digitized through an Institute of Museum and Library Services grant. They spent most of their time converting optical character recognition images into text. Bit by bit, they turned images into a structured data set of individual laws identifiable by chapter, section numbers and attributions.

They added other laws from research by William Sturkey, associate professor in the College of Arts & Science’s history department, and Kimber Thomas, former Council for Library and Information Resources Postdoctoral Fellow for Data Curation in African American Collections.

Jansen said that the list is not comprehensive. New funding will enable the work to expand through fellowships and research in two other states.

The speakers answered participants’ questions about their experiences and applications of data science. Aikat closed the session with a reminder that Carolina’s data science community can share resources and ideas for webinars and workshops. “We’re eager to hear from you,” Aikat said.

Advice for data science newcomers

Deen Freelon:

Data science is a method like any other. Learn it yourself or find somebody who already knows it.
The work may involve convincing people that data science is relevant and useful in their academic field.

Andrés Hincapié:

Consider using data science to complement, not replace your research methodologies.
Learn at your own pace.
Data science’s broad scope allows you to address many social sciences topics through data without discarding other methodologies.

Matt Jansen:

Consider building off something that exists to avoid answering every question yourself.
Start with something fun, low stakes or unrelated to your research to reduce the stress of learning.
Focus on one programming language at a time.