Skip to content

Instantly share code, notes, and snippets.

@andrewsmhay
Created February 16, 2014 19:02
Show Gist options
  • Save andrewsmhay/9039012 to your computer and use it in GitHub Desktop.
Save andrewsmhay/9039012 to your computer and use it in GitHub Desktop.
I've got host scan data that I've imported into MongoDB in the following format:
{
"_id" : ObjectId("52fd928c62c9815b36f66e68"),
"date" : "1/1/2014",
"scanner" : "123.9.74.172",
"csp" : "aws",
"ip" : "126.34.44.38",
"port" : 445,
"latt" : 35.685,
"long" : 139.7514,
"country" : "Japan",
"continent" : "AS",
"region" : 40,
"city" : "Tokyo"
}
{
"_id" : ObjectId("52fd928c62c9815b36f66e69"),
"date" : "1/1/2014",
"scanner" : "119.9.74.172",
"csp" : "aws",
"ip" : "251.252.216.196",
"port" : 135,
"latt" : -33.86150000000001,
"long" : 151.20549999999997,
"country" : "Australia",
"continent" : "OC",
"region" : 2,
"city" : "Sydney"
}
{
"_id" : ObjectId("52fd928c62c9815b36f66e6a"),
"date" : "1/1/2014",
"scanner" : "143.9.74.172",
"csp" : "aws",
"ip" : "154.248.219.132",
"port" : 139,
"latt" : 35.685,
"long" : 139.7514,
"country" : "Japan",
"continent" : "AS",
"region" : 40,
"city" : "Tokyo"
}
Since I"m new to mongo, I've been looking at the aggregation framework and mapreduce to figure out how to create some queries. I can't, however, for the life of me figure out how to do something as simple as:
1) Count the distinct "ip" addresses with "port" 445 with a "date" of "1/1/2014"
2) Return the "ip" address with the most open "ports", by "date"
3) Count the distinct "ip" addresses, by "csp", for every "date" in January
Any help would be greatly appreciated. I've been reading and reading but the queries keep exceeding the 16MB limit. As you can see below, I have a lot of entries:
{
"ns" : "brisket.my_models",
"count" : 117715343,
"size" : 25590813424,
"avgObjSize" : 217.3957342502073,
"storageSize" : 29410230112,
"numExtents" : 33,
"nindexes" : 1,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 3819900784,
"indexSizes" : {
"_id_" : 3819900784
},
"ok" : 1
}
@kf0jvt
Copy link

kf0jvt commented Feb 17, 2014

Map reduce will probably get you what you want. But I've been using pymongo as a way to avoid learning map reduce.

import pymongo
SERVER = 'localhost'
DATABASE = 'kevin'
COLLECTION = 'what'

server = pymongo.Connection(SERVER)
server.drop_database(DATABASE)
db = server[DATABASE]
col = db[COLLECTION]

info = col.aggregate([ {'$match':{'port':445,'date':'1/1/2014'}}, {'$project':{'ip':1,'_id':0,'port':1}} ])['result']
ip_list = []
ips_and_ports = []
for record in info:
  ip_list.append(record['ip']
  ips_and_ports[record['ip']] = ips_and_ports.get(record['ip'],[]).append(record['port'])
unique_ips = set(ip_list) # unique_ips is a list of ip addresses seen on 1/1/2014 and port 445
for ip_address in ips_and_ports.keys():
  ips_and_ports[ip_address] = set(ips_and_ports[ip_address]) #each ip in the dict has an array of unique ports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment