The UFO Dataset
A collection of UFO incident reports recorded across the US and
Canada. Each incident is recorded with the following information
Date and Time
City and State
Shape of object
Duration of incident
Summary of incident
There are over 93,000 incidents reported so far, since 1998.
The data set is available from www.nuforc.org
Is there a predictable pattern to the incident occurrence based on time, location, and duration?
Is there a correlation between the color and shape of the object observed?
Through text mining, find what were the most frequently used words by witnesses to describe an UFO.
By studying the distribution of time between UFO occurrences, can we predict the chances of next UFO appearance?
Is the population has anything to do with the chances of reporting an
Associate the different parameters to bring meaningful insights and patterns. Challenges with the UFO Dataset
Inconsistent date and time formats: Time of incident was not recorded for about 2600 incidents. We have done missing value imputation on these cases.
Inconsistent place information: Around 6500 records had no mention of State. These type of records were included in the analysis where the
State parameter was not considered, but excluded in all other cases.
Missing shape information: Around 1500 records were missing the shape of the UFO. Missing value imputation performed and the value is set as
“Unknown” for these cases.
No Durations: No uniformity followed in recording this value. We categorized and the duration detail is preserved as “In seconds”, “In minutes”, “In hours”.
Extracting from summary: String search applied on this column to identify the color of the UFO object by matching the details with a list of known colors.
A new “color” column created based on the search.
Records with no mention of color is set as
States and Shapes Summary
UFO incidents by states
UFO incidents by Shapes
State Population Vs UFO Reporting
Population of a state has a very high impact on the number of UFO reported
Population of US states taken from https://www.census.gov.
Population and UFO incidents are mapped spatially in the below maps using ggplot.
States WA and NV slightly contradicts this claim – It shows low number of reporting.
Chances of Future UFO
Fitting the time series to a standard probability distribution.
Observed that the time series is closely following an Exponential distribution.
By estimating the parameter of the Exponential distribution, we have computed the chances of UFO appearing in a state, say in the next
Based on the past occurrences, the following states have highest chances that the UFO will appear in the next 24 hours. CA, TX, FL,
AZ has the highest chances. 96% chances!!!
The rest of the states together just has a 4% chance of a UFO appearing in the next 24 hours.
This closely follows the Pareto principle – the 80-20 rule; i.e., large number of reporting are from small fraction of states.
Time Series Analysis – Case Study of CA
Fitdist (normal distribution) yields a negative likelihood of -42482; and Fitdist (Exponential
distribution) yields a negative likelihood of -35004. The larger the better therefore,
distribution is the relatively best distribution that fits the data considered.
To estimate the probability of event occurring in next 24 hours is given by
P(X<= 24) = the cumulative function of exponential distribution = 1-e^(-rate*24)
For the State - CA
Fitting of the distribution ' exp ' by maximum likelihood
Parameters : estimate Std. Error rate 0.06340195 0.0006567898
AIC: 70010.88 BIC: 70018.02
0.781648 (80% probability that an incident can occur in the next 24 hours in state CA)
Fitting Distribution –