-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathnear_strings1.Rd
68 lines (59 loc) · 2.56 KB
/
near_strings1.Rd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/near_strings.R
\name{near_strings1}
\alias{near_strings1}
\title{Strings of Near Repeats}
\usage{
near_strings1(dat, id, x, y, tim, DistThresh, TimeThresh)
}
\arguments{
\item{dat}{data frame}
\item{id}{string for id variable in data frame (should be unique)}
\item{x}{string for variable that has the x coordinates}
\item{y}{string for variable that has the y coordinates}
\item{tim}{string for variable that has the time stamp (should be numeric or datetime)}
\item{DistThresh}{scaler for distance threshold (in whatever units x/y are in)}
\item{TimeThresh}{scaler for time threshold (in whatever units tim is in)}
}
\value{
A data frame that contains the ids as row.names, and two columns:
\itemize{
\item \code{CompId}, a unique identifier that lets you collapse original cases together
\item \code{CompNum}, the number of linked cases inside of a component
}
}
\description{
Identifies cases that are nearby each other in space/time
}
\details{
This function returns strings of cases nearby in space and time. Useful for near-repeat analysis, or to
identify potentially duplicate cases. This particular function is memory safe, although uses loops and will be
approximately \eqn{O(n^2)} time (or more specifically \code{choose(n,2)}). Tests I have done
\href{https://andrewpwheeler.com/2017/04/12/identifying-near-repeat-crime-strings-in-r-or-python/}{on my machine}
5k rows take only ~10 seconds, but ~100k rows takes around 12 minutes with this code.
}
\examples{
# Simplified example showing two clusters
s <- c(0,0,0,4,4)
ccheck <- c(1,1,1,2,2)
dat <- data.frame(x=1:5,y=0,
ti=s,
id=1:5)
res1 <- near_strings1(dat,'id','x','y','ti',2,1)
print(res1)
#Full nyc_shoot data with this function takes ~40 seconds
library(sp)
data(nyc_shoot)
nyc_shoot$id <- 1:nrow(nyc_shoot) #incident ID can have dups
mh <- nyc_shoot[nyc_shoot$BORO == 'MANHATTAN',]
print(Sys.time())
res <- near_strings1(mh@data,id='id',x='X_COORD_CD',y='Y_COORD_CD',
tim='OCCUR_DATE',DistThresh=1500,TimeThresh=3)
print(Sys.time()) #3k shootings takes only ~1 second on my machine
}
\references{
Wheeler, A. P., Riddell, J. R., & Haberman, C. P. (2021). Breaking the chain: How arrests reduce the probability of near repeat crimes. \emph{Criminal Justice Review}, 46(2), 236-258.
}
\seealso{
\code{\link[=near_strings2]{near_strings2()}}, which uses kdtrees, so should be faster with larger data frames, although still may run out of memory, and is not 100\% guaranteed to return all nearby strings.
}