Promises and pitfalls of artificial intelligence for legal applications

tags
Machine Learning Law and Technology

Notes

Introduction

NOTER_PAGE: (2 . 0.107744)
NOTER_PAGE: (2 0.5828488372093024 . 0.7122302158273381)
NOTER_PAGE: (2 0.6824127906976744 . 0.5169578622816033)

Information processing

NOTER_PAGE: (2 . 0.887876)

Re- cent instruction-tuned language models (chatbots) cannot necessarily outperform models fine-tuned on law-specific datasets

NOTER_PAGE: (3 0.2005813953488372 . 0.5498458376156218)
NOTER_PAGE: (3 0.35101744186046513 . 0.7410071942446044)

there is generally a clear correct answer:

NOTER_PAGE: (3 0.404796511627907 . 0.13257965056526208)

there is high observability

NOTER_PAGE: (3 0.5065406976744186 . 0.12846865364850976)

what does a high score on the bar exam mean—and more generally, how much can we trust benchmark eval- uations?

NOTER_PAGE: (4 0.4476744186046512 . 0.5508735868448099)

in cases where there might be multiple areas of law im- plicated, experts might have higher rates of disagreement.

NOTER_PAGE: (4 0.4731104651162791 . 0.11716341212744091)

in adversarial settings, lawyers might disagree on how much context to redact and liti- gate over the issues.

NOTER_PAGE: (4 0.5428779069767442 . 0.19732785200411102)

Creativity, reasoning, or judgment

NOTER_PAGE: (4 . 0.855398)

Hurdles in evaluating language models

NOTER_PAGE: (4 . 0.573951)
Contamination
NOTER_PAGE: (4 . 0.631387)
not only does it emphasise the wrong thing, it overempha- sises precisely the thing that language models are good at.
NOTER_PAGE: (5 0.1972843450479233 . 0.8994350282485876)
The model could correctly answer most Codeforces questions from before its training date cutoff, but could not answer questions after its training date cutoff correctly
NOTER_PAGE: (5 0.21875 . 0.13257965056526208)
Lack of construct validity
NOTER_PAGE: (5 . 0.645062)
tests designed for humans lack construct validity when applied to bots.
NOTER_PAGE: (5 0.6829073482428115 . 0.519774011299435)
The assumption is that humans taking exams generalise the skills tested by the exam to a wider range of relevant tasks.
NOTER_PAGE: (5 0.7611821086261981 . 0.3581920903954802)
Prompt sensitivity
NOTER_PAGE: (5 . 0.873215)
NOTER_PAGE: (6 . 0.818295)
NOTER_PAGE: (6 . 0.144903)
qualitative studies of professionals and how they could use AI are likely to be even more useful, since these tools are so new that we still need consensus on what the right questions to ask are.
NOTER_PAGE: (6 0.5051020408163265 . 0.5888187556357078)
Develop naturalistic evaluation methods
NOTER_PAGE: (6 . 0.644831)
NOTER_PAGE: (6 0.7468112244897959 . 0.17493237150586113)
Communicate the limitations of current LLMs
NOTER_PAGE: (7 . 0.417866)
where easy-to-spot errors in ini- tial filings are prevalent.
NOTER_PAGE: (7 0.6830357142857142 . 0.6898106402164111)
favour helpful, informative recommen- dations to parties in a dispute rather than being used as a binding mechanism.
NOTER_PAGE: (7 0.8182397959183673 . 0.6384129846708746)
Use AI in narrow settings with well-defined outcomes and high observability of evidence.
NOTER_PAGE: (7 . 0.848971)

AI for making predictions about the future

NOTER_PAGE: (8 . 0.255961)

Predicting the outcomes of court decisions

NOTER_PAGE: (8 . 0.534419)
NOTER_PAGE: (8 0.2621173469387755 . 0.7944093778178539)

Predictive AI for making decisions

NOTER_PAGE: (9 . 0.19042)

Where dynamics are known and stable over time, and in- formation is readily available, prediction is possible—as in physical sciences, where we can build reliable approxima- tions of aspects of the world that we are modelling. Yet this is not true of predictive AI in law, where fundamentally, most predictions will be about people and societies.

NOTER_PAGE: (9 0.21237244897959182 . 0.521190261496844)

Low accuracy of deployed applications.

NOTER_PAGE: (9 0.3807397959183673 . 0.08836789900811541)

distribution shift: when the data used to train an ML model differs from the population on which the model is eventually deployed, models are unable to adapt well.

NOTER_PAGE: (9 0.7015306122448979 . 0.22633002705139765)

crime patterns in specific regions differ from nation- wide averages in important ways, which means that it fails catastrophically in some areas.

NOTER_PAGE: (9 0.8373724489795917 . 0.11181244364292155)

Conclusion

NOTER_PAGE: (10 . 0.621954)

Acknowledgments

NOTER_PAGE: (10 . 0.518392)