Skip to content

Instantly share code, notes, and snippets.

@gabrielgrant
Forked from anonymous/munge.py
Last active August 29, 2015 14:19

Revisions

  1. gabrielgrant revised this gist Apr 25, 2015. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion munge.py
    Original file line number Diff line number Diff line change
    @@ -3,7 +3,7 @@
    Assumes you've already downloaded the raw data by running:
    wget -O - ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/ratings.list.gz | gunzip > ratings.list
    Details: http://www.imdb.com/interfaces
    See: http://www.imdb.com/interfaces
    """

    import pandas as pd
  2. gabrielgrant revised this gist Apr 25, 2015. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions munge.py
    Original file line number Diff line number Diff line change
    @@ -3,6 +3,7 @@
    Assumes you've already downloaded the raw data by running:
    wget -O - ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/ratings.list.gz | gunzip > ratings.list
    Details: http://www.imdb.com/interfaces
    """

    import pandas as pd
  3. gabrielgrant revised this gist Apr 25, 2015. 1 changed file with 7 additions and 0 deletions.
    7 changes: 7 additions & 0 deletions munge.py
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,10 @@
    """ Loads IMDB's Ratings data into Pandas
    Assumes you've already downloaded the raw data by running:
    wget -O - ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/ratings.list.gz | gunzip > ratings.list
    """

    import pandas as pd

    # First, get a clean version of just the ratings data
  4. gabrielgrant revised this gist Apr 25, 2015. 1 changed file with 0 additions and 1 deletion.
    1 change: 0 additions & 1 deletion munge.py
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,3 @@

    import pandas as pd

    # First, get a clean version of just the ratings data
  5. gabrielgrant revised this gist Apr 25, 2015. 1 changed file with 5 additions and 4 deletions.
    9 changes: 5 additions & 4 deletions munge.py
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,6 @@

    import pandas as pd

    # First, get a clean version of just the ratings data

    ratings = open('ratings.list').read()
    @@ -6,12 +9,10 @@
    open('ratings.clean.list', 'w').write(ratings)

    # Now play
    import pandas as pd
    titles, rating_data = ratings.split('\n', 1)
    titles = titles.split()
    rating_data_lines = rating_data.splitlines()
    # split the lines on whitespace, but not with str.split(), because we need to preserve leading spaces
    rating_data_split = [re.split(r"\s+", l, maxsplit=len(titles)-1) for l in rating_data_lines]

    ratings = pd.DataFrame(rating_data_split, columns=titles).convert_objects(convert_numeric=True)

    ratings = pd.read_csv('ratings.clean.list', delimiter=r"\s\s+")
    ratings = pd.DataFrame(rating_data_split, columns=titles).convert_objects(convert_numeric=True)
  6. @invalid-email-address Anonymous created this gist Apr 25, 2015.
    17 changes: 17 additions & 0 deletions munge.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,17 @@
    # First, get a clean version of just the ratings data

    ratings = open('ratings.list').read()
    _, ratings = ratings.split('MOVIE RATINGS REPORT\n\n')
    ratings, _ = ratings.split('\n\n------------------------------------------------------------------------------')
    open('ratings.clean.list', 'w').write(ratings)

    # Now play
    import pandas as pd
    titles, rating_data = ratings.split('\n', 1)
    titles = titles.split()
    rating_data_lines = rating_data.splitlines()
    rating_data_split = [re.split(r"\s+", l, maxsplit=len(titles)-1) for l in rating_data_lines]

    ratings = pd.DataFrame(rating_data_split, columns=titles).convert_objects(convert_numeric=True)

    ratings = pd.read_csv('ratings.clean.list', delimiter=r"\s\s+")