The Greater Houston Area is a large portion of Southeast Texas
consisting of 17 counties, and with an estimated population of 7.21
million. Our goal is to look at documented crimes and their statistics to see how the years, counties,
and other factors may compare, or if they may determine the crimes found
in different counties. Our analysis looks at an overview of all Texas' counties,
with a focus the Greater Houston Area.
Our focus is to look at an overview of Texas counties in relation
to crimes, with a focus on the Greater Houston
area.
This topic selection is important to our group as we live in the Greater Houston Area
as we are sure many of you do as well.
Texas is the second largest state by landmass and also has the
second largest population out of all US states at 29,730,311 and
an annual growth rate of 3.85% (as of March 2021), this makes it a large
resource of data that could be accessed to preform a analysis on crime.
Our Data comes from the FBI Crime Data Explorer,
which offers an abundance of crime data broken down by
different categories such as county, year, violent vs.
nonviolent and more.
We decided to grab the crimes by county for 2015 - 2020
which breaks down not only by county but also by police
department, letting us get a great in depth look at the
whole states statistics.
Texas began with 23 counties when it gained independence
from Mexico in 1836. A county was consisted of a council
composed of at least one Judge, a varying number of Aldermen,
an Attorney, a Sheriff and a Secretary.
New counties could be established if 100 free male inhabitants
living in an area containing at least 900 square miles petitioned
the government. By 1845, the count had risen to 36 and by the end of 1870
the state had 129 organized counties.
By 1931, texas had added its last county, Loving County. Nowadays, Texas has the
most counties out of all US states, leading Georgia in second with almost 100 less.
Texas also has 4 of the top 20 counties by population as of 2020 (Harris, Dallas, Tarrant, Bexar)
Between 2010 and 2020, Texas has been home to 5 of the 10 fastest growing counties by population
(Harris, Tarrant, Bexar, Collin, and Travis).
Our dataset consisted of all index crimes reported for Texas
by County and their reporting agencies for the year 2015 - 2020.
The data was cleaned and explored in our Crime Data Analysis
jupyter notebook.
With a cleaned and transformed dataset, we began to look at different
visualizations to see how the different counties compared against
each other. We began to plan how could we take this data and fit it to a machine
learning model to get predicted outcomes of crime rates based on population.
We wanted to take a look a little closer to home by focusing on the Greater Houston Area.
The Greater Houston area consisting of the following counties:
Harris County is also seperated from the Greater Houston Area DataFrame due to it's size and value counts becoming a large outlier if included with the other Greater Houston Area counties.
Our plan for a machine learning model is using Clustering
with K-means and Linear Regression models. Our feature selections
will be looking at the crime rates compared to the population rates
of the counties in Texas with a focused comparison on violent to
nonviolent crimes.
The reason for selecting our model is due to the data not being a fit
for supervised machine learning models. When run through supervised
machine learning with ensemble and resampling the data was split to
small for training and testing purposes.
With unsupervised machine
learning and clustering with K-means and Linear regression the data
can be split, scaled, fit, and trained with no numerical value issues.
The data is preprocessed and checked for null values, and duplicate values.
A new DataFrame is made to hold the county names separately. The data is then
standardized with StandardScaler().
PCA is used to reduce dimension to three
principal components, and a new PCS DataFrame is created with them.
An Elbow Cure is created to find the best value for K. Predictions are run
after initializing the KMeans model. A new Clustered DataFrame is created
and data from the Crime DataFrame, and PCS DataFrame, including the
predictions held in the 'Class' column, are concatenated on the same
columns.
A 3D Scatter Plot is used to visualize the PCA data and clusters
with popup on hover to see crime data information.
The data is scaled with MinMaxScaler().fit_transform and a new Plot
DataFrame is created to hold the scaled data with the Clustered
DataFrame index.
A scatter plot is created using hvplot.scatter
using x="population" and y="total_crime" with popups
on hover to display data information.
For future analysis looking at larger detailed datasets over a longer
period of time could result in the ability to preform more detailed
machine learning and predictions models.
A classification of the data by count of population could also
result in the ablitiy to compared smaller counties with larger
counties by looking at the data for every 10,000 county residents.
Access to different datasets, that may not be pupblic access, could result
in a more in depth analysis of crime statistics for the desired areas.
This pie chart is a very interesting visualization, showing how much more reported crimes come from Harris county compared to other counties in the Greater Houston area. Harris had a reported 1,162,602 total crimes compared to Fort Bend in second with only 61,945. One thing to note is the obvious population difference, with Harris having 4,707,136 compared to Fort Bends 739,020. Another interesting conclusion we get from this chart is Galveston County with the third most crime reported (52,882) with only 347,699 residents. For reference, Montgomery has 220,000 more residents with 500 less crimes reported.
View CodeHere we can see the top 5 Texas Counties by population compared by total crimes reported. This shows us another view of how much more crime Harris county has had over these years compared to other Texas cities. Harris is topping the list with 1,162,602, followed by Dallas with 595,792 and Bexar with 543,627. Dallas and Tarrant are both Dallas-Area counties, with Dallas home to downtown Dallas and other cities and Tarrant housting Arlington and Fort-Worth. Dallas and Tarrant have a population of 4.5 million when added and averaged between 2015-2020. If we add the total crime reports between these two counties over the years, we get roughly 1,000,000 reports which is still quite lower than Harris county with a 4.7 million resident count.
View CodeAnother graph showing the Greater Houston area numbers througout the years. The results stay pretty consistant, with a small dip in all counties in 2018.
View CodeThis graph is a display of total crimes throughout the years in all of texas. There is also a split betweeen violent and non-violent reports, with non-violent taking up the large majority. The numbers are slowly falling year-by-year, which is a good sign. With 2021 numbers not totally compiled yet, we were unable to see if this downward trend has contiuned.
View CodeHere we can see the total crimes throughout all of texas by crime type from 2015-2020. Larceny is by far the most commonly occuring crime with burglary, assault, and auto theft following after. Murder is the least frequent crime with 8950 reports.
View CodeA 3D Scatter Plot is used to visualize the PCA data and clusters with popup on hover to see crypto data information. The data is scaled with MinMaxScaler().fit_transform and a new Plot DataFrame is created to hold the scaled data with the Clustered DataFrame index.
View CodeA scatter plot is created using hvplot.scatter using x="population" and y="total_crime" with popups on hover to display data information. The plot shows the positive correlation of population to total crime in Texas counties. The clusters are showing the different classes of counties based off of smiliarly sized populations.
View CodeA linear regression plot is used to show a prediction line of the correlation of population to crime.
View Code