Bilikis Osomo

The UFO Dataset

A collection of UFO incident reports recorded across the US and

Canada. Each incident is recorded with the following information

•

Date and Time

•

City and State

•

Shape of object

•

Duration of incident

•

Summary of incident

There are over 93,000 incidents reported so far, since 1998.

The data set is available from www.nuforc.org

Problem Statement

Is there a predictable pattern to the incident occurrence based on time, location, and duration?

Is there a correlation between the color and shape of the object observed?

Through text mining, find what were the most frequently used words by witnesses to describe an UFO.

By studying the distribution of time between UFO occurrences, can we predict the chances of next UFO appearance?

Is the population has anything to do with the chances of reporting an

UFO.

Associate the different parameters to bring meaningful insights and patterns. Challenges with the UFO Dataset

Inconsistent date and time formats: Time of incident was not recorded for about 2600 incidents. We have done missing value imputation on these cases.

Inconsistent place information: Around 6500 records had no mention of State. These type of records were included in the analysis where the

State parameter was not considered, but excluded in all other cases.

Missing shape information: Around 1500 records were missing the shape of the UFO. Missing value imputation performed and the value is set as

“Unknown” for these cases.

No Durations: No uniformity followed in recording this value. We categorized and the duration detail is preserved as “In seconds”, “In minutes”, “In hours”.

Extracting from summary: String search applied on this column to identify the color of the UFO object by matching the details with a list of known colors.

A new “color” column created based on the search.

Records with no mention of color is set as

“Unknown”.

States and Shapes Summary

UFO incidents by states

UFO incidents by Shapes

State Population Vs UFO Reporting

Population of a state has a very high impact on the number of UFO reported

Population of US states taken from https://www.census.gov.

Population and UFO incidents are mapped spatially in the below maps using ggplot.

States WA and NV slightly contradicts this claim – It shows low number of reporting.

Chances of Future UFO

Appearance

Fitting the time series to a standard probability distribution.

Observed that the time series is closely following an Exponential distribution.

By estimating the parameter of the Exponential distribution, we have computed the chances of UFO appearing in a state, say in the next

24 hours.

Based on the past occurrences, the following states have highest chances that the UFO will appear in the next 24 hours. CA, TX, FL,

AZ has the highest chances. 96% chances!!!

The rest of the states together just has a 4% chance of a UFO appearing in the next 24 hours.

This closely follows the Pareto principle – the 80-20 rule; i.e., large number of reporting are from small fraction of states.

Time Series Analysis – Case Study of CA

Fitdist (normal distribution) yields a negative likelihood of -42482; and Fitdist (Exponential

distribution) yields a negative likelihood of -35004. The larger the better therefore,

Exponential

distribution is the relatively best distribution that fits the data considered.

To estimate the probability of event occurring in next 24 hours is given by

P(X<= 24) = the cumulative function of exponential distribution = 1-e^(-rate*24)

For the State - CA

summary(fitexp)

Fitting of the distribution ' exp ' by maximum likelihood

Parameters : estimate Std. Error rate 0.06340195 0.0006567898

Loglikelihood: -35004.44

AIC: 70010.88 BIC: 70018.02

> 1-(exp(-0.06340195*24))

0.781648 (80% probability that an incident can occur in the next 24 hours in state CA)

Fitting Distribution –