prabhaavp/datamining_881

Go to file

Prabhaav Pillai 47dd51af4a Clean up folder structure, remove unused data, add images for frontend use (maybe - keeping options open

2026-04-04 01:10:57 -04:00

Revisions to Zim parsing, netflix parsing, and updates to html scraping to include synopsis

2026-03-19 01:56:14 -04:00

Clean up folder structure, remove unused data, add images for frontend use (maybe - keeping options open

2026-04-04 01:10:57 -04:00

Updated page.tsx

2026-04-01 13:02:29 -04:00

Clean up folder structure, remove unused data, add images for frontend use (maybe - keeping options open

2026-04-04 01:10:57 -04:00

Clean up folder structure, remove unused data, add images for frontend use (maybe - keeping options open

2026-04-04 01:10:57 -04:00

.gitignore

Clean up folder structure, remove unused data, add images for frontend use (maybe - keeping options open

2026-04-04 01:10:57 -04:00

movies.json

Frontend changes

2026-03-26 12:35:58 -04:00

README.md

Clean up folder structure, remove unused data, add images for frontend use (maybe - keeping options open

2026-04-04 01:10:57 -04:00

requirements.txt

- Html -> TSV

2026-03-12 12:14:31 -04:00

README.md

CSE 881: Data Mining - Course Project on Gitea

Setup

Note: If you're using a IDE like PyCharm, step 1-2 may be done automatically.

Create a virtual environment:

python -m venv venv

Activate the virtual environment:

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Transfer Datasets - Raw data is too large to include in GitHub. Download the dataset from:

https://developer.imdb.com/non-commercial-datasets/

Unzip the .tsv files and place them in:

data/raw/imdb_dataset

Files to Place:

name.basics.tsv
title.akas.tsv
title.basics.tsv
title.crew.tsv
title.episode.tsv
title.principals.tsv
title.ratings.tsv

Run the pipeline in order:

Citations:

https://developer.imdb.com/non-commercial-datasets/

Languages

HTML 93.6%

Python 4.1%

TypeScript 1.9%

JavaScript 0.3%

CSS 0.1%