Texas Crime Trends Analysis


An analysis by: Jose Miguel Boada | Liz McDanled

Overview


The Greater Houston Area is a large portion of Southeast Texas consisting of 17 counties, and with an estimated population of 7.21 million. Our goal is to look at documented crimes and their statistics to see how the years, counties, and other factors may compare, or if they may determine the crimes found in different counties. Our analysis looks at an overview of all Texas' counties, with a focus the Greater Houston Area.


Topic Selction


Crime Rates throughout Texas counties from 2015 - 2020

Our focus is to look at an overview of Texas counties in relation to crimes, with a focus on the Greater Houston area.

This topic selection is important to our group as we live in the Greater Houston Area as we are sure many of you do as well.

Texas is the second largest state by landmass and also has the second largest population out of all US states at 29,730,311 and an annual growth rate of 3.85% (as of March 2021), this makes it a large resource of data that could be accessed to preform a analysis on crime.





Data Source


Our Data comes from the FBI Crime Data Explorer, which offers an abundance of crime data broken down by different categories such as county, year, violent vs. nonviolent and more.

We decided to grab the crimes by county for 2015 - 2020 which breaks down not only by county but also by police department, letting us get a great in depth look at the whole states statistics.





Questions we aim to answer






Brief History on Texas Counties


Texas began with 23 counties when it gained independence from Mexico in 1836. A county was consisted of a council composed of at least one Judge, a varying number of Aldermen, an Attorney, a Sheriff and a Secretary.

New counties could be established if 100 free male inhabitants living in an area containing at least 900 square miles petitioned the government. By 1845, the count had risen to 36 and by the end of 1870 the state had 129 organized counties.

By 1931, texas had added its last county, Loving County. Nowadays, Texas has the most counties out of all US states, leading Georgia in second with almost 100 less. Texas also has 4 of the top 20 counties by population as of 2020 (Harris, Dallas, Tarrant, Bexar)

Between 2010 and 2020, Texas has been home to 5 of the 10 fastest growing counties by population (Harris, Tarrant, Bexar, Collin, and Travis).





Data Exploration


Our dataset consisted of all index crimes reported for Texas by County and their reporting agencies for the year 2015 - 2020.
The data was cleaned and explored in our Crime Data Analysis jupyter notebook.

With a cleaned and transformed dataset, we began to look at different visualizations to see how the different counties compared against each other. We began to plan how could we take this data and fit it to a machine learning model to get predicted outcomes of crime rates based on population.

We wanted to take a look a little closer to home by focusing on the Greater Houston Area. The Greater Houston area consisting of the following counties:

Harris County is also seperated from the Greater Houston Area DataFrame due to it's size and value counts becoming a large outlier if included with the other Greater Houston Area counties.

Visualizations


Tableau



Machine Learning


Clustering Texas Crime Data with K-means


Our plan for a machine learning model is using Clustering with K-means and Linear Regression models. Our feature selections will be looking at the crime rates compared to the population rates of the counties in Texas with a focused comparison on violent to nonviolent crimes. The reason for selecting our model is due to the data not being a fit for supervised machine learning models. When run through supervised machine learning with ensemble and resampling the data was split to small for training and testing purposes.

With unsupervised machine learning and clustering with K-means and Linear regression the data can be split, scaled, fit, and trained with no numerical value issues. The data is preprocessed and checked for null values, and duplicate values. A new DataFrame is made to hold the county names separately. The data is then standardized with StandardScaler().

PCA is used to reduce dimension to three principal components, and a new PCS DataFrame is created with them. An Elbow Cure is created to find the best value for K. Predictions are run after initializing the KMeans model. A new Clustered DataFrame is created and data from the Crime DataFrame, and PCS DataFrame, including the predictions held in the 'Class' column, are concatenated on the same columns.

A 3D Scatter Plot is used to visualize the PCA data and clusters with popup on hover to see crime data information. The data is scaled with MinMaxScaler().fit_transform and a new Plot DataFrame is created to hold the scaled data with the Clustered DataFrame index.

A scatter plot is created using hvplot.scatter using x="population" and y="total_crime" with popups on hover to display data information.


Future Analysis


For future analysis looking at larger detailed datasets over a longer period of time could result in the ability to preform more detailed machine learning and predictions models.

A classification of the data by count of population could also result in the ablitiy to compared smaller counties with larger counties by looking at the data for every 10,000 county residents.

Access to different datasets, that may not be pupblic access, could result in a more in depth analysis of crime statistics for the desired areas.

Data & Information Sources


Federal Bureau of Investigation Crime Data Explorer | Link


History on Texas Counties | Link


A Decade of Population Growth and Decline in U.S. Counties | Link


County Organization | Link


US County Populations 2022 | Link