Predicting rates of crime based on economic indicators
Author
Patrick
Date Published

Predicting rates of crime based on economic indicators
Predictive & Descriptive Statistical Analysis
As part of my graduate statistics course "Advanced Statistical Methods", using R, I data-engineered a dataset from public datasets, then tested various predictive models for their accuracy.
After finding the right model, I demonstrate a training model and test real data against it. Similar methodology, on a larger scale, could be replicated to make predictions of business outcomes. (In fact, one project I would like to work on is a D3 web dashboard for business metrics statistically derived with R or Python).
Here is the model I settled on: (note: each datapoint is per zipcode)
[Crime] per Capita prediction =
[% of Population Below Poverty Level] *
[Median Household Income] *
[Median Rent price]
Where Crime = Assault, Robbery, Burglary, or Theft, and * = interaction effects
Trained Models
Here are three examples of trained models tested with actual data. It's a multivariate linear regression model trained with 80% of the dataset, and tested with 20%. Each time I test the model, a new randomly selected training model is created.
Test 1 of Assault, Robbery, Burglary and Theft models

Test 2 of Assault, Robbery, Burglary and Theft models

Test 3 of Assault, Robbery, Burglary and Theft models

The Explanation: How does this project worK?
The project is a Multilinear regression model which seeks to determine:
"To what degree can a neighborhood's crime rates be predicted with economic factors?"
(which is what this summary represents:)
Can Crime per Capita (in a zipcode) can be predicted with these economic factors below?
Zipcode-based Economic factor 1: % of Population Below Poverty Level
Zipcode-based Economic factor 2: Median Household Income
Zipcode-based Economic factor 3: Median Rent price of dwellings
The anwser is YES, quite strongly-- with 70-80% confidence, depending on crime type, in Austin, TX
How you ask? Jump to how, in section: "The Results"
I aggregated data from, and data-engineered, the following datasets:
2014 Housing Market Analysis Dataset (Government Open Data initiative for Austin, TX): Source of % of Population Below Poverty Level, Median Household Income, Median Rent price
2014 Crime dataset (Government Open Data initiative for Austin, TX): 40,000 rows -- I aggregated & simplified variations of crime names using regex
2012 US Population Per Zipcode dataset: R's local data library to access 2012 population statistics, using the following command: data(df_pop_zip)
2015 U.S. Gazetteer Map Data / US Census Department Geographic dataset: Although I aggregated this data into the final data table, it did not go into the final model
The process:
I collected datasets for zipcodes in Austin showing the economic indicators mentioned above for each zipcode (the median income, median rent, and percentage of residents in poverty for 36 Austin, Texas zipcodes). I then cross-referenced these zipcodes with a dataset of about 40,000 crimes that occured in Austin in a particular year. I excluded all zipcodes except those for which I had data for both economic indicators and crimes. I then counted up crimes per category (6 categories total, after similfying category names: Assault, Burglary, Robbery, Theft, Homicide, and Rape), per zipcode. With these two datasets showing economic & crime data for particular zipdoes, we are able to produce an equation showing the degree of confidence with which we can predict rates of crime based on the chosen economic indicators.
The results:
According to coefficient of determination for the various models, we can say that about 70%-80% of the variability in these crimes—Assault, Burglary, and Robbery (for the zip codes examined) can be explained by the variability in the selected economic indicators. That's a pretty strong connection which means that these three types of crimes can be relatively well explained based on these three economic indicators (income, rent, and % of population below poverty)! Based on this study, the other types of crimes examined (homicide and rape) cannot be reliably predicted based on economic indicators– this means they're less determined by economic health of a neighborhood, and more by other unknown/non-analyzed factors.
The ultimate purpose of the project was to create a statistical model from the "training" dataset – i.e. a model that explains the data – and then to test the "test" dataset upon the training model, to check how well the model can be used to predict crime rates based on the economic factors studied.
In the statistical charts-- the colorful lines at the top of this page-- we can visualize how well the model predicts crime based on the economic factor inputs from the training dataset. The green lines show the boundaries of a 95% confidence interval (we can be "95% certain" that this interval contains the true mean of the population, which we use the data in the test group to simulate a sample of.), with the redline being the predicted average. The black line represents the actual test data used. As you can see, the green lines capture the majority of the datapoints from the test data, showing that this model (using economic data to explain crimes) does quite a good job of some of these predicting specific crimes (assault, robbery, and especially burglary).
I programmed the project for my two other teammates, who provided guidance, feedback, and ideation, and the final script can be viewed here.
Why do I enjoy statistical analysis?
I enjoy statistical analysis because it is a creative, fun, puzzle-solving way of thinking which allows us to explain the world through data in meaningful, measurable ways.
Throughout this particular statistics course ("Advanced Statistical Methods"), I noticed my thought processes changing. I noticed this especially while working on my statistics project, and for me, I think I really changed between the day I walked into the class, and the day class ended. I realized then that I had began to think more analytically and from a more data-driven, input-output, and equation-based perspective. Crafting equations from data is solving a puzzle– a puzzle where you can produce real and fascinating answers from apparently unrelated datasets! It's quite fun once you get into it!
In addition to changing the way you think, and having fun analyzing a real life puzzle... practicing statistics via scripting will improve your programming abilities! Learning R really helped me become a better programmer because before I could start playing with data, I spent plenty of time learning R's data types and data structures, which was useful when it came time to produce an analysis project and is applicable to programming.
I really like statistics in general for the contribution that the multivariate analysis makes when analyzing any kind of ideas, really. It is quite handy for empirically analyzing all sorts of ideas: whether business decisions, scientific & engineering projects, economic theories, or policy arguments. Being able to formulate an equation or model based on data is a skill in its own right.
One of my favorite data visualizations is the "choropleth map", i.e. heatmap.
Here's a choropleth map of Assault frequency for zipcodes in Austin, TX (data is annual assaults reported in 2014).


Explore the untold and overlooked. A magnified view into the corners of the world, where every story deserves its spotlight.

Money isn't just currency; it's a language. Dive deep into its nuances, where strategy meets intuition in the vast sea of finance.