Buying your First AI
or “Never Trust a Used Algorithm Salesman”
There was a time when more of us knew how to change the oil and change a timing belt, keep an engine running and fix the brakes. Cars got more complicated and we got less mechanical, but one thing hasn’t changed: if you are buying a used car, you can’t trust the seller to tell you what’s wrong with it.
If you’re a government official buying an algorithm, whether marketed as predictive analytics, AI, or machine learning, the same thing is true. It’s great if you have a mechanic you trust. It’s even better if you also have some idea of what to look for yourself.
Algorithmic solutions are sold with promises of efficiency, accuracy, fairness, and speed. But no matter how shiny a tool might be, listen to the engine, take it for a test drive and look under the hood.
Is the hood welded shut?
Private companies often license their software as “black box algorithms”, selling the use of their software without the right to see the source code. There is a tension between vendors’ intellectual property interests and the ability of any prospective buyer to evaluate the product, the seller’s claims and the possible unintended consequences.
For a government buyer, using algorithms to make decisions affecting its citizens, there are additional considerations around fairness, discrimination and due process. Releasing too much information could unjustly harm individuals or businesses. The city of Pittsburgh makes its commercial building fire risk scores public at the block level but not at the building level to avoid unintended reputational harm and loss of business.¹ But not releasing information can also be problematic, even unconstitutional. In K.W. v. Armstrong, 2012, the lack of explanation when a black box algorithm decreased Idaho Medicaid disability benefits failed to meet constitutional due process requirements for government takings of private property.²
If you want access to the data and source code, remember that a black box purchase is a choice made in contract negotiations. It is not the only option. Cities and counties are creating datasets and developing software in house, hiring contractors to create works for hire, working with academics and foundations who want to provide public access and dividing IP rights with partners in a variety of ways. Some cities and counties are providing significant public access to their code. Pittsburgh’s commercial building fire risk model is freely available on Github³ and Allegheny County’s child protective services risk score model, where the code is accessible to individuals with approved institutional affiliations, including academics and journalists.
Negotiating with commercial vendors, a very large city may have the power to demand significant concessions in negotiations or the budget to fund in-house development, but smaller cities could pool resources or increase bargaining power through associations like the Civic Analytics Network, a national network of urban chief data officers in the U.S. or Canada’s AI Governance Joint Convening BC, a province wide, multi-agency collaboration to develop regional AI best practices.
Third party algorithm IP isn’t the only barrier to transparency and explainability. One citizen may not be able to understand and challenge their own score without access to data besides their own, sometimes impossible to provide without infringing on other citizens’ privacy rights. Even the software’s developers, with full source code and data access, may have an incomplete understanding because of a technical tradeoff in algorithm choice between explainability and accuracy.
What fuel does it take?
Whether or not we have access to the source code, we can understand something about the algorithm by understanding what its inputs are, how they were obtained and how they might be correlated with other variables.
How will you get the data?
Some algorithms have relatively modest data requirements but to train a machine learning model often requires tens or hundreds of thousands of pieces of data or more. Designing the model and writing the code can be far easier than preparing the data so one of the first questions will be whether you have the kind and quantity of data the algorithm will need and the infrastructure to use it. If you’re designing a model or evaluating a vendor’s software, ask experts from your relevant departments (fire, police, health and human services, courts, public works, IT…) what factors they use when they make decisions, what they think would be useful, what data is available, how much work it will be to prepare it for use and who would do that work.
While thinking about data availability, consider possible incompleteness, inaccuracies and biases as well as privacy and security issues.
Rare cases, inaccuracies and bias
The data used to train and test the model needs to be representative of the data the model will encounter in use although you may need to overrepresent unusual cases in the data to give your model enough examples to learn from.
It can be hard to know in advance if your data is sufficient or representative, but you should include examples of common values and edge cases, think about ways your data source might have a known bias and try to make your model fail (and then fix it) in testing. For an example of a biased data source, if you use only business information from the Chamber of Commerce, you only get a subset of the city’s businesses. You are leaving out most tech startups, big corporations, law firms and small construction contractors which have quite different characteristics from Chamber members and from each other. If you want your model to include all of the kinds of businesses in your city, you’ll need to find ways to get data from those other kinds of businesses.
Individuals or businesses can be hurt by inaccuracies or biases in your model or data based on characteristics you didn’t know you used. For example, if your restaurant health code violation risk score model uses Yelp reviews as an input⁴, we might find racial prejudices of reviewers skew scores for certain restaurants. Be aware that machine learning models can not only transmit bias from the data but exaggerate it.⁵
Variable utilization and proxies
A model might have access to hundreds or hundreds of thousands of variables. Ask which ones are having the biggest effect and whether it matters. Sometimes a model can provide that information globally even when it can’t in a specific case. Variable use and importance information can help your staff evaluate the model’s performance and check for unfair or illegal uses, such as constitutionally prohibited uses of protected class status. It may also expand their own domain expertise, revealing or explaining a relationship in the data they hadn’t seen before. Look for proxies like zip code, often highly correlated with race.
How big/bad are the variable gaps?
What we have and what we wish we had
We only have certain data available to use in computer models which often leads to a difference between the variables we would like to use and the variables actually used in machine learning models, for both the independent input variables and for the dependent variable we’re trying to predict. We should ask how big those gaps are, what effects we would expect them to have and what we can do if they seem problematic.
Gap and race: “arrest” as a proxy for “commission of crime”
For an example, suppose we were designing a tool like COMPAS, a criminal justice computerized risk assessment tool widely used in the U.S.⁶ To help judges use an individual’s danger to society as a factor for determining the length of sentence, we might want to try to predict how likely someone is to commit a crime within a couple of years after they are released from prison.
No one has good data on whether people commit crimes after release. What’s available are records for arrests and convictions and there is the gap. Some people commit crimes they are never arrested for and others are convicted of crimes they didn’t commit. While that gap may be problematic for criminal sentencing in the general population, we should be especially careful knowing that the gap affects different groups differently. The New York Times reported that in New York City, black residents are arrested for marijuana possession eight times more often than white residents while government surveys show black and white people use marijuana at roughly the same rate.⁷
Gap and poverty: “accesses county mental health services”
In some cases the gap is so big that we’re pointed in the opposite direction, predicting the opposite of what we thought we would. “Accesses county mental health services” at first sounds like a proxy for “Has mental health problems dangerous to child” in a child protective services model and seems problematic since wealthier citizens’ mental health visits to private doctors won’t be recorded. But “accesses health services” might actually be a much better proxy for “Will access resources to keep child safe”.
Safety features
Data security and privacy
Citizens and city leaders may consider some data sensitive even when the law does not, but at a minimum, be aware that data collection, storage and use creates legal obligations across multiple jurisdictions. You should ask what data the algorithm will use or create that the law may consider sensitive, what legal duties will arise and what resources it will take to meet them.
Any data project you’re considering also contributes to a cumulative danger. As you increase data collection, interconnectedness and reliance on digital and algorithmic systems, your city and citizens become increasingly vulnerable to data breaches and ransomware attacks like the attack against the city of Atlanta in 2018.⁸
Use and shifting use case
Ask if your constituents would consider this collection and use of data desirable. Look to the future and consider what uses might arise for the data or the model next week or ten years from now. Local or federal law enforcement agencies might like to put cameras on a city’s commercial sidewalk delivery robots or access lamp post sensor data originally collected to monitor traffic congestion. Industry partners may have other uses for the data that you don’t want to allow. Address use restrictions in contracts and public communications now⁹ and put a process in place for evaluating use cases in the future as they are proposed for implementation.
Level of risk and failure detection
Without good ground truth to compare our predictions to it can be hard to know how good our results are. Worse, we won’t know when the algorithm has gone terribly wrong. Find ways to measure the level of risk¹⁰ and ways to detect failure. Ask what kinds of wrong answers the model could give, what harm would be caused and how you will know.
System check
Access to the source code and all of the training data may not be enough to have any idea what outputs you should expect from a model, even for an expert. Certain machine learning predictive models based on neural networks are particularly hard to decipher. But even a black box deep learning model doesn’t mean you can’t demand a certain level of understandability and confidence.
Manual review of partial dataset
Have a staff member manually review a subset of the training and test data.
In 2015, Google’s photo tag notoriously misidentified some African America people as gorillas.¹¹ We can imagine factors that might have contributed — a shortage of pictures of black people or a correlation between race and other characteristics like clothing type, surroundings and activities.
Misidentifications seem to have been very rare so an engineer reviewing a hundred random test images wouldn’t have come across one, but looking for patterns in a hundred training images, the same engineer might have noticed if everyone was Caucasian.
Model continuity/stability
Even when you don’t know how variables are being used or don’t have ground truth for what answer the model should give you, you get some information from the continuity or stability of the results. Change some test inputs slightly and see how the output changes.
Citizen impact
Consider how the adoption of this tool for this particular application will affect citizens, employees and departmental workflows and capabilities. Ask how the use of this algorithm will affect the fair distribution of government resources, citizens’ meaningful ability to understand and appeal government decisions that affect them, or any other duties or rights.
Departmental Impact Checklist
Make a checklist to think about how this tool will affect employees and departments. Here’s a start:
- Talk to end users and stakeholders about their needs and pain points
- Verify this product will solve a real problem
- Create a tool employees will want to use
- Create a tool that citizens will want the city to use
- Plan how the tool can fit in well with existing workflows, perhaps integrating into an existing interface
- Understand how this tool may affect employee job security and satisfaction or create other fears — address those concerns if possible (e.g. is there a plan for placing employees in other similar quality jobs if their current jobs disappear?)
- How will it will affect departmental capabilities
Fuel, maintenance, repairs and lifetime cost
Much more to this project than model design
Consider the costs of infrastructure creation and data collection, preparation and protection, not just at the beginning but for the lifetime of the project. Plan what work will need to be done, who will do it and what resources they will need. New projects are easier to fund than ongoing maintenance but over a program’s lifetime the maintenance costs can be far higher.
- Negotiate project financing and partnerships
- Create the necessary data infrastructure
- Collect and prepare the data
- Design and implement the tool, consulting with users and domain experts
- Integrate tool into existing workflows
- Test and evaluate design, prototype and tool, including ethical risk evaluation
- Fix it if it doesn’t work
- Update with new data, as needed
- Monitor for unexpected consequences, long term evaluation
- Public communications
Compare relationship and licensing models
Compare an annual license agreement, one time purchase, in-house development, work for hire, partnership or other arrangement based on initial and long term costs, risks, responsibilities and IP rights. Consider negotiated reassignments of costs, risks and rights.
Test drive
Evaluation period and access for testing
If you are working with an outside vendor, negotiate an evaluation period and sufficient access for testing.
Decide how to measure success
Consider the goal of the software, whether it is to duplicate the existing predictions, decisions or calculations more efficiently or to make better decisions by some other measure. Working with your departmental domain experts, identify an appropriate way to measure success. If you have good ground truth, use it.
Compare results to existing method and expert’s intuition
Testing and deployment should include comparing predictions, decisions or calculations made by your old method, if you have one, to the proposed new method. Reserve enough historical data for separate validation and testing. Use of the tool or a review of selected results by a domain expert in the department that will use the tool can point to instances where the model’s results disagree with the expert’s intuition. Investigate further. Those may be flaws in the model or they may reveal new information your expert didn’t know.
Phased testing and rollout
Use a phased testing and rollout plan to catch problems before they have a large impact. Idaho’s Medicaid office almost certainly would have caught the errors in their vendor’s code, avoided a court case and harm to their citizens¹² if they had continued to pay benefits based on their old method while calculating benefits both ways and comparing the results before deployment.
Warranty
Before you sign a procurement contract, agree on methods for evaluation. If the software isn’t doing what a vendor has promised or what you thought it would do, what recourse will you have? Ask what guarantees a vendor is willing to make and what guarantees you’ll need.
[1] Interview with Pittsburgh Fire Inspector Skertich, April, 2018
[2] https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case
[3] https://github.com/CityofPittsburgh/fire_risk_analysis
[4] http://www.govtech.com/dc/articles/What-Can-Boston-Restaurant-Inspectors-Learn-from-Yelp-Reviews.html
[5] https://www.wired.com/story/machines-taught-by-photos-learn-a-sexist-view-of-women
[6] https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
[7] https://www.nytimes.com/2018/05/13/nyregion/marijuana-arrests-nyc-race.html
[8] Costs for the Atlanta March, 2018 ransomware attack may be $17M https://www.ajc.com/news/confidential-report-atlanta-cyber-attack-could-hit-million/GAljmndAF3EQdVWlMcXS0K/?icmp=np_inform_variation-control
[9] As an example, see Boston’s street sensor communications https://www.boston.gov/innovation-and-technology/smart-streets
[10] Measuring AI risk: https://datasmart.ash.harvard.edu/news/article/potholes-rats-and-criminals
[11] https://www.wired.com/story/when-it-comes-to-gorillas-google-photos-remains-blind/
[12] https://www.aclu.org/blog/privacy-technology/pitfalls-artificial-intelligence-decisionmaking-highlighted-idaho-aclu-case