From a1beba67308be61a8ab3f972ac75c4143476fb84 Mon Sep 17 00:00:00 2001 From: IshaAtteri Date: Thu, 12 Mar 2026 12:11:37 -0400 Subject: [PATCH] beatifulsoup extract code --- data/tt0074885.html | 175 ++++++++++++++++++++++++++++++++++++++ data/tt0074888.html | 159 ++++++++++++++++++++++++++++++++++ sample_data.xlsx | Bin 12953 -> 10403 bytes scripts/dataset_create.py | 24 ++++++ scripts/scrape.py | 101 +++++++++++++++++----- updated_data.xlsx | Bin 0 -> 8840 bytes 6 files changed, 437 insertions(+), 22 deletions(-) create mode 100644 data/tt0074885.html create mode 100644 data/tt0074888.html create mode 100644 scripts/dataset_create.py create mode 100644 updated_data.xlsx diff --git a/data/tt0074885.html b/data/tt0074885.html new file mode 100644 index 000000000..ff2bbb463 --- /dev/null +++ b/data/tt0074885.html @@ -0,0 +1,175 @@ + + + + +Mean Johnny Barrows + + + + + + + + + + + + + +
+
+
+
+
+

Mean Johnny Barrows

+
+ +
+
+
+
+
+ +
Mean Johnny Barrows
Film poster by John Solie
Directed byFred Williamson
Written byJolivett Cato
Charles Walker
StarringFred Williamson
Roddy McDowall
Stuart Whitman
Luther Adler
Jenny Sherman
Elliott Gould
Music byColeridge-Taylor Perkinson
Distributed byRamana Productions Inc.
Release date
+
  • January 1976 (1976-01) (U.S.)
+
Running time
75 minutes
CountryUnited States
LanguageEnglish
+

Mean Johnny Barrows is a 1976 American crime drama film starring Fred Williamson, who also directed the film; Stuart Whitman; Luther Adler; Jenny Sherman; and Roddy McDowall also star.[1] +

+ +

Plot

+

Johnny Barrows (played by Fred "The Hammer" Williamson) a winner of the Silver Star is dishonorably discharged from the army for punching out his Captain. Shipped back home Stateside, Johnny promptly gets mugged and hauled in by some racist cops who believe him to be drunk. Unable to secure gainful employment, Johnny finds himself on the soup line (with a cameo from "Special Guest Star" Elliott Gould) and down on his luck. +

Walking into an Italian restaurant hoping for a handout, he's offered a job as a killer by Mafiosi Mario Racconi (Stuart Whitman) and his girlfriend Nancy (Jenny Sherman) but Johnny turns him down. It seems that he's not slipped so far as to start doing odd jobs for the Mob. Eventually, Johnny lands a job at a gas station cleaning toilets and scrubbing floors for the mean penny-pinching Richard (R.G. Armstrong), who receives a beating for ripping off Barrows. +

Meanwhile, a Mafia war starts brewing between the Racconi family and the Da Vincis (the family, not the painter). Seems the Da Vinci family wants to bring in all kinds of dope and start peddling it to black and Hispanic kids. The Racconis, being an upstanding Mob family, wants no part of that on their streets. And so it goes, with the Racconi family wiped out in a treacherous double-cross, with only Mario left standing. +

Nancy is kidnapped by the Da Vinci family and gets a message to Johnny claiming that she was made to do "terrible things". Brought to the brink by poverty, The Man constantly screwing him and his love for Nancy, Johnny agrees to become a hired killer for Mario to avenge the Racconis. And so the body count starts going up as Johnny in all his white-suited glory gets mean and starts killing his way through the Da Vinci family. +

+

Cast

+ +

Additional notes

+

The structure of the film was previously used a year before in the film The Farmer (which was shot in 1975 but released in 1977). +

+

References

+
+
    +
  1. ^ "Mean Johnny Barrows". afi.com. Retrieved 2024-02-02. +
  2. +
+
+ + +


+

+
+
+
+
+
+
+
+ + \ No newline at end of file diff --git a/data/tt0074888.html b/data/tt0074888.html new file mode 100644 index 000000000..c520e2dde --- /dev/null +++ b/data/tt0074888.html @@ -0,0 +1,159 @@ + + + + +The Best Way to Walk + + + + + + + + + + + + + +
+
+
+
+
+

The Best Way to Walk

+
+ +
+
+
+
+
The Best Way to Walk
Theatrical release poster
Directed byClaude Miller
Written byLuc Béraud
Claude Miller
Produced byMag Bodard
Jean-François Davy
StarringPatrick Dewaere
Patrick Bouchitey
Christine Pascal
Claude Piéplu
CinematographyBruno Nuytten
Edited byJean-Bernard Bonis
Music byAlain Jomy
Distributed byAMLF
Release dates
+
  • 3 March 1976 (1976-03-03) (France)
  • +
  • 15 January 1978 (1978-01-15) (U.S.)
+
Running time
82 minutes
CountryFrance
LanguageFrench
Box office$13,793[1] (2008 French reissue)
+

The Best Way to Walk (French: La meilleure façon de marcher) is a 1976 French film directed by Claude Miller, his directorial debut. It stars Patrick Dewaere, Patrick Bouchitey, Christine Pascal, Claude Piéplu and Michel Blanc.[2] +

+ +

Plot

+

Marc and Philippe are two teenage counselors at a summer vacation camp in the French countryside in 1960. Marc is very virile, while Philippe is more reserved. One night, Marc surprises Philippe dressed and made-up like a woman. He responds by continually humiliating Philippe. Despite their late-adolescent rivalries and sexual confusion, each achieves an awakening. +

+

Awards

+

The film won the César Award for Best Cinematography, and was nominated for Best Film, Best Actor, Best Director, Best Screenplay, Dialogue or Adaptation and Best Sound. +

+

Cast

+ +

References

+
+
    +
  1. ^ "The Best Way to Walk". +
  2. +
  3. ^ "The Best Way to Walk". unifrance.org. Retrieved 2014-03-10. +
  4. +
+
+ + +


+

+
+
+
+
+
+
+
+ + \ No newline at end of file diff --git a/sample_data.xlsx b/sample_data.xlsx index 4d3b3f5a077d25c8a84eaa9cb148216d88580509..a954ef15057a82bd070d884d50c49cbe4ed53cd5 100644 GIT binary patch delta 3802 zcmZ8kc{mho_a6Js5HH!wzOQ4aEMv`(osy-ogkdnY#28EVAw!6)C4&i(WwJ&}mKpmp zga|RV3Pt$(z2EP;-tYU)U(a=&`=4{peeUNzgr%NBL1J%&^*tI-HZ-ExlKzQ9elg@3YqQOuMXO8$A~QFcX=4~(ph z5Kf!9L3I`=<*-PBlOmw}3+~^pFSb#WiEHD2O(f`6!$4jS4vFUPcKzYe77zSum$`dN z-FX)JEa8X{dl7Rn?RFwnoU!(}#QCKR`Y7)BoqNoqh83m<5*{IuNa*u;wpWa3r(PMJ z-tG+J>n@!Rp7fvS#(s;a&vbP-r94^MedgadwWZ>Jqsos|i9fx#b zQT;eW^=e}IZ7tQp82zTa_|Gjp6;R{Bj%UnsSoCV zHAkQOkb6K}LX**zc{BvaIS>>+P$FDv?X!HxZPzioVWpM(8gejMwXEDQO*ld_l~F<< z8Eg1<;XTcvA)#-nmx-(YIn=*%!N1=f9l#i`q4}g?XC-ktQm$s^)Xmq0IyUzO+Uo0U zxt?XQs0DUaJW%o3qhuu6UQaX6}NDBOg*IVQ?$w)N|Ff<5v&}dHKkgyD4rr?p#e?-@($Q*6F z$@08wkMebF%Y2uEcNO;~-+gbju_Pa^BJD|g@CjMqj*PRb0jvP4o_MAMN8xOK{Bhz%w-!w*nnzi=(XV{okPpmVPu5J`8B+co4~sxXUmVc0+Dnr!@fP1bZM_(R(|(%yePi|W(T%=?jll;YRXbsN z7vYY@{=&sK#$`2^^a>%E`tQF9eO#MvRB0!PMUDAVS|8TF(E%{!RGdu}j4G6jsJSYQ z9ZCS;IUQz+SqN!QLKi6lW9lc74ZlkO3|II#ZIZI9Ppz?AN8iXim+AmjEuqf#XUFKB z5ggeNpT{BJcc2-nqoQ~Quajx^u?Zd%)!01cX?&PgtX@QoJjHM6x<6CB7h;q8Q0!-+ z)NhZY^Ti*+YYurC_o$gO&EeQEzUYz)vD(zF^v9t6XGq{Hao)f$1GYG*Xlwe`N4}r2 zYg38pvr7Hn1X*u#7QebdbRFT{F%_*%dHCCY@K%^-?N?m{=n=`4R9O*-@*I(QSt=)G z49XNz&&-CpqW10rInvM^u~a|Vtd!M0-K^2fGe|W!M$Lt4M(kL1q?&9NVz)2kp9udD zC577dFd?UwVNx`eDl^LlY7a!55#lN&u~h-WrQb5qJcNy#T4da~`y|tdhq6_ErRSJ2gD0u>8Gf|96vkD$GX?=H z`|JjKiqcGq8Hs*fK1>ew40qyf4=4P3#ONb^l91cmnE8)tKYUU{9I{=Xcc=8345&Y+ zoof(^zd0cld|s>sE(h!5uc5C2Tq~VzQ+_DkXnZ_R@>sZCxg> z-4UFvTex3z%UHG0^0|ia=4HiN7f*4;6p!bY)T8({L=huqdRsHo-K6?T0pCOFzA`E& zeFX9vQ`INI;R$Y|dGOEH6fn0^8yM=^zyDVw`RZGhw^Xe@Lsg!RkKGmzFq#RL4YhH> zMff;e^XqOns%QFxurfFO&P%r;1MVf6BHp^W`?isW_@CO@?LF0%lyZpHv|jO9bdqf_ zN=8ARXpdrTzxB>NsH3e1tQLxTY5s#vn%3WcV5RwPArr#v{RL^i(8Cmtn7gn3EwIkz zl}BmH!oG2N8drFn2MVFGd%WV3KAmyM6HuS@P+xa3Z$IX)HB9AOHPE0muqSsV#`FHn z)hn7EJY)%(UR<^d$4mVyYI`QX3P|6gHrU8ywtw%yD)1{ElARw-FHQ{ru(JUGEC2u? zGC&~|5f$JSDjx|8m~ z=A|unC>s|?mG5QgqOsRAlP?oSgWopdN%?`|QK_IsYvaS-69ZO)w_A&WbcYAvHS5)4 z%`ioWa@*QKcG4+S0HPu$;S(s)1@d04`0eSxH$i zw37|Mx8)2!qEZTc_ECQYAH$BisPb3xK|Uic+mG#xK6ouV!fWGije4i^B2tFN`{Bsb zcSMd{b|Xdo;xz=J7uJ;SLL7o?y3XO&3ccp-M_U>TWS=!z0?rA^=MpzI*~jM!VUhM1dvHQwP9`eL{opVGuidz0M8)JU!v)#DJ zuwZ;nWkkksLyZ>=^_;C#oW=DbGOX|4IYreawq zp9{1h@v#X^$)0u#EII9SwpCcg3Z+KwH;~|4MgAhu0X@G^z1?=q@U<{x{Y1E3QGSB` zzKF6Z1j5+ib8@6nPKwaP$iZAn^|7nIvW6^+qo1jT;twtu@?5cWK$O8rX zC&xJyLIl+>)YW5CHMPVB?lkHV?)ZJYdF0ux@PmKL38O6tJ%#_HAdD0kKLK7OG^;m_ zQVD3Z$8>JE`NrG9`mtCl9oQH6ewan9?XGB zCOFg3V%NpT=K8+2J&i@NXHNE`8FUw{-MPetAVwpzVh6;Uqorx{MNhQys^GmFv2APB zB9EK$y4(YvfhQN{qU_~h=u}Y8q)zsIr?|_KXRK|paClDq1box!>C`e==;Dp88%Nhi z0h=hywq($brL>bw{zjZYccS*+y2z>xQ?}^+Q-}ct03ZUuWGIOt6aN(RR*@5YOm%J; z8I74vVm7o@zuIQMLFIWePfN5#KqK@bW`+8eN35&ygMHI+qia@xyxx;ti*ZcQ(zS6uC!=G8bOI~8M@_#G63sNT-SsL=H9vrN zzQuX!Z$})J_{XqPSHXsdz8^9C?RTUlsVQkDDmX?>n^dI5DUc(OK1E>So@s~STJ&ep zGYV%jX^O2>b5l-K8;%U~PZ1bWOYG8x%s(s(S@tE6HwyF8-{sd-S~Ww|@S@H+IrN~Q z#7R<)4le%ylv+$DR~D%H7@CEatr<#>=NsTlXKPXN?KI-1$qCOW${}~CXV|=tsO}N# zEl+5>S4Sy?M2B0j>_^OA9hpEtxb6d&QQds2%oH3z# zliGz+s!rt)>95D)2(1BuCYoz038zDYy|XQ_amWeJ6{NA7!*N#3l+#wABfptF{b(C_ zKNGjW3S|ibqS)A>`X(-|?|F;elJbH?(~8d8&lU_gHe(9=E=g7ZvjjdC=WYz{EZ#S( z(HYqY)=F(cJRO-07_DlWk?*blCBM63AQ1XHx!|~8s_0H;io|t$Vv~YR4AtLlt&`QJ z`v239tYAa&0&`7Kj_)5^2>|f?;n6=ak4aVpQ=YM54uD)3uriSUe}CV9pzGWiHYFkc zf0haW;QS~1=cUe%2^9DXyJ3i-Wk^Zsl5QA~mJk@aVGxvtAw=nr5{aQ3kxofrMBeXv zvA*}-`RAUy&iU)ywfEX*U$|kuW77&2W;?=Vtq`_}#RDz8ATvNhaLQ`u>T1|mHzF)L zZQnLV`F(XYf}l2E6R>zn)4DDY{UHs0^cP>iSM|~D4$@A!Rsp6!yL)z+c`fkYS zObQ*(736_xU0_IUF%LHd#KIr{d(>-cSr(@AxL`3rnQOJv$2m^550ly^$Ut_9Iy5kg zgDxfiE)?BOGIOR&fc8Z|oJWeh@r#W-c@^%deiHS5QL1DQO9beKITK=x_SpWjFi@!$ z;kV_+%hd6VkNgIU`7C63-VfW2E`Z8fUQA^cg*UK4zPJSb9eTk=gB2!)tHY%+F9)lj zjH5#nJJKp0(tpnS(+cmlT3U*DOgU5D$7+y`4|*7po*guFY2AhkU5VHC5T4*q8*evAamlHQW1H^fZfsZGL|)50|4&G?R@(-l+k!}ra- zJGz2_FOaM>Uda8s2`lHWolah5$5*Ry4aawXZU2;Ge07k+_2<*J6fGzdUj3c)dk6Yy|LxId$T~sG&w`Q|X9Xg9e6ClEzMRpXx<$>Nlf_GVFBOwu zqw_a2PtUfWe8K|PNr|AE1&&(pEx-j(`8Q3ub%`$c+DGWXp4VfbGxmX3We1fQzFI&J z^^AQpuc9=o(xIy%U7N9^i)7*rEAqMS$qy?kHpe+<3+0T+!80Gkfh18t=%W9&$U=x> z*Q4kNrJ7B%XfbkClwrS=`gl0MXL0P|*ebN)%7b(marNjtv3t@OaJlb|@nA=&Vrc3u zA1I<8qa*l)i73a@078kQT*jBs0e~O4000dD00{OF3VQA9?&sp@=LGpH65Sn_?{^z10SB1+T%7!xSq?e_kDT z2b#Si}l{BcG7aHF%@WUx{5%uAiR8q0-_TVNtezOt|$xr*WJ z>^6YBN04-AC~wu*yf4%#;I7oOe&5EGK4)Kt=8O?IaKES@Nh^OsWQ`xxyD~-2_AKC| zJO#Q-I81m=gkU9HBIvZ^7>Wvfq7bvTImj~uu%IY$e5C>MsUxpkqsCx+u&k;WnxEs- zB_5mGs-Vo}Hd57FqqXZdUw#ipyD~AD$@=YN(HoUoJO}5KpNXfSI6i&3F5R1Xe5-%Z zw)#%o-&hJXYFWL=h4`3Tzy7f})XNQ3$li=(zc?K#t&h=d)v8EEscW4k8D7o4HN?+p zHk!r48}W&&FTHK8p-@oubXSV9KBS8`&Gz}%WK0B}nw-HX@+pzja zdp0Q!l+S1?FNh#HLp=s5P<7dBy)T|F(_bhQaCuP3>c1^ex;E>ZC+XIc2^#u`W z5#{t<2IO@0`8=|1@&&wq9}PbsAG}^hYH`L1)~k8N#XG17rgrRTu#~E&yqaFsDbyPm zS|Hz^-dXAEDkRH`=evcJ2qCSiRo$p|i{BCx7H|X80M84^f9T0QX z5G;K1R2(o(D8J(&;KrM^3XEkptecLf(EXesM0UYYzEVOA*(4lTKB!@*H;>%a%W~bc`%RhiKD~g35(V3q3lef=yB5jp#rtSkk{X=`C6|2 zY+Yi$|H-dTaU4flzL)oPFqmR$`dbaHE$e$qT{c;Inq!7OZb_`7do(6fqxXGZO$e}U zd(R5=q$b78ARgWQHUju-s&0g{ip7;Fb?=$!DeJ6etZ)}u*1|K2ky)O8&n@|!(L?v9 zKz@Ua*d4Bcox!R74$M%iE{v_G-~(u1)ESEv{c-_Wqi97r7KuQ4(Iy+m-|<_It7s~S zQ<5j+{sS8cRfByO;i|r^+m*R!truY6MRFOQYG`lED64vZP|#|)I>f_G08w~23=4Oe zoJ^8xP4(TU0AofuQ-zxtA=L^krVomdHx0LKJ9`JblJ&e{NT|(IYwzh1ov?|xZnrOQ z;HF1y`x-9sf)2*HbPAQpL;;7HG{O@K5DjEwIC&x&-eFKC%vAMB=2<*Z`MWYzc>KnjTIoNc zO6O0;)6ue4r_g1M#}pi8GQFhfF;!O#40bpiror2}NR1D!9sSFIM>L!te1hTE=J--H zy25ZVv;K#K`Gu>_p1G#S)?H_9xI zE&nE8UdvNQ?RFbQPDNv#_#Q8Du$rs}?RQ)ITlL$`4(Qb;XkvBgoN)cVN%8<$CW*H< zVg20fTkb^g&xf-g?)x(j4EiQlU$N9Lv34>C2TjDkOKhDJM`WzK&Ryu1V>W$R(uh3x z4xBn2j#xY?Dop*4pVNI(Hh! zu6&h3u7_KQyMKZG^_mF)D@UimGY}UqmikAO7%{{2>FJUL@i=i;GMWSRD7oHu`HGE9FC4bq;4YdS^Vusi@FsT8$KEBA?bRtqbWIhQzW zxk=QCoX^(6zIQtNUJpWExeO|}IKSf}c0!TcTYfTHJqo^jm|xz2>N|nqBw{K4Emw)IqwA-_2|Lu4h4rlFoaAUNXsv!{f^Z5417W(A9 z!R^Uq{Kt}7UPfOdb( zw037~d3i7)A)&h3cezyh845*Z&BwO5P%~3?EW|h(7s4o1asioZPfUEw?K-i2+Ui$k z+)uU+^t0luIsloQ=chBVuwoBzgH_{aO>)j8(XwR>W|Xdh5HF5Jm?2qbKbGQu=nWLH34N63su)1 z)&3cBD(s5AzJ&G|b2Y$A*3EVk?$F-q(-K3=(33imU5`XAJDREbML;x}uvNgL*(S!@ zI)JBwX3?)CT8Xc4YqjuzDjX+pbWik$G{8&XD)Wygpgl6Dx zx`Rxq@;Ux6SMI$9Q;@h%GZ`M;z?!|4sN5Jk^KiGC!>}l`U`cKp$=PvgG{t=lG|r&o zXoZ{x^8b43e>7#iprED7RF>*?C+QonO-<9ju%TtIch5R>x~|zvt1y+IQG2Pf$_x!; zLkk)vLb^4JGCS|HU_4I{Zt=_|!qP6cL7XUJ3d(7at&!RM&t*W*M5UbDwm~H&>%Fd9|>rqiG z7F2!lOn$^PbK37l`Ccq75`g3PF>X_25RSj*k_*hFy>7Q-W48&*t&zb`WPn;|XNX$V zi%l~Tj^Vsb|J=v+?Yb2DSSZ*k4BS)9)@^Z&+K9RMVz97h!?Itb5d)dgmUuQ9eQTtM zj-RmX$+PJ$P-5FWOMrbZl{$v&M+wHdN{c0b$5vkul;B@=wC*B zigj{oNzt6ZwNi4@(!G zLDV0o+MuMyw{ssht3D;JJ(4JI_;sA=3y-HfEi6rdf()(=G%3MDdfY|0qWnv>Ph-)yUcLSpO znGv~qXaz?6?5Pc4k2tD57MbJ?i84O}^N$ch4cpqTVWkfb+2CbhdbDVEqs|+inXR^A z=iiQ7>6mf^9t{kRP`CZ=x*rM)r~JCn%Z;2lje}bqr9iK=ap(tOz)-tT2$@TPQqRGE45d}N?^3;jIMUjpV zqg6(B#(=avsHt7#4;q5++qRQC43;LG3k>~XZ&rbz6k1Xl7d@r==#LyFn<-TJJtFLU z=QmhQB6Y9fF<(_%y4G!_*q^_QUf&Md7K=Shsy*a528T9mn|Qfn|JdA`jL~Ni9pXK6 ziX#V35wblWOqZ&T^2N&BRvRYy!3y_9y2XNjprHdyp@W`_!PGkEGIb;*oneYjZYz;J zBt^OG237q1ps27k1U>=53vq%TE@sm_3ojlY2K;lT;dtC2)}sMsf#B+xq};~-6w}p{ zoLkiDRG`EVcp*Qiy1zO71VVMEiczq(peiRU6E3t@w_*(p1V;sr=CBlfm3xtoZNZ)N zK%~MC<&?*?`zlxW2gp9vz$1by9XB-2C4wj-%}YSwT=Y@4a3xBoHC*Q9uH^hCB_9PL zqOpZinsJ(V>_ia9L8QHsnMB>5A!s%)n#Vv{IuYInJW?+PB3AC9spzXiYsJO~Z}_RW zUXPIH_lD()i^v(x1=aH3p8eWtbmCJhs^5k{FG=1KSXb0nNh^FZs7M;e1%giq>J7+* zru;i_lP@Q#^72+arsF?bDcPc}v~0%kP{p_Lejv73z(cj_T!V;z1_p5k9nB}CDY-Xo z#>8JfwuPo~HHj;?GUUg;e5$EH1$ft;Wp#M*N$rr(zKx5iHAM<5$|}*s*X~Zx42evC@bb<7 z9Lo|JDCmyw_Q_mpv82|3NH~MuMPF7>1MLAce=dA)_}m(gVYW5a^{wtHz7a-7&f8=8 z$g?=qmbo81L&|>LSpc8sb$0NEs^c{pv>v0!WCflva!U}VplYO5WSIQ^4MP->o%a41 z@a4>~>WYr0l9y`YDSoD zx)q;HR<=^4zj1bV_WF{o^xrG3C+|d|*eTD4ElRSa;J;D6&foM7CdwkZ_pMKc)|QC+ zw~HmyA${~uTw6bT|4Ua*+a0}OLIVIsfiPJH9_Ub}u>Rn4lDHuDzp6iRu}&&AamqQZ z@Qko&${6xk$!#)U6d@c7+wbqBvPoLq<}s6g`F{>N4+}yrw=a&-6=|qSPXv#Xm5p;( zayr-UA}8g+;pvr?5*X6Y`Vi(mA;J-u0;=?v0v4rel$X-(s!hZyziZ;GL%sM7PUgPe zh(L#C#RtY~V1$WU1sk|`mNArJ1!F3^-vs6az&SCgg;VZ@pTFYmRXvIhH7nM8x?R7_kB46fOYi^n{m!v_$y|7VUq@~M;RYK~MB5kOx7AKxIZ~h}CLL=Lbtz|_<&zorJ$#Js{bqQKB zxK$2{x)Kv;!qAt#p$O_*N>TGy@s(hmto95AIMdLjmvs%AXKph!V)Lz+qIZkB>dN85$&w9=>PpUaLZwYmIc$| z6nOUk?Z|&@i2sfeSO%v$aGnIV#K}Sb|E-GuX#xOxXkfHlO!WWW9smG}e^>v>j2K|X zT%7d(h5-P;_J10>IAMwm9I!83yg+(xm@*>|jD(wn{y(4pAK1JG0s#N`BmXOyd0}$g S^1u>)SQs}emX*-IO8)^}mIclL diff --git a/scripts/dataset_create.py b/scripts/dataset_create.py new file mode 100644 index 000000000..4511228d2 --- /dev/null +++ b/scripts/dataset_create.py @@ -0,0 +1,24 @@ +import pandas as pd +import os +from scrape import extract_movie_info + +script_dir = os.path.dirname(os.path.abspath(__file__)) +file_path = os.path.join(script_dir, "..", "sample_data.xlsx") +movie_data = pd.read_excel(file_path) +print(movie_data.columns) + +script_dir = os.path.dirname(os.path.abspath(__file__)) +movie_html = os.path.join(script_dir, "..", "data", "tt0074888.html") + +title, directed_by, cast, genre, plot = extract_movie_info(movie_html) +new_row = { + "Movie": title, + "Director": directed_by, + "Cast": ", ".join(cast), + "Genre": genre, + "Plot": plot +} + +movie_data.loc[len(movie_data)] = new_row +output_path = os.path.join(script_dir, "..", "updated_data.xlsx") +movie_data.to_excel(output_path, index=False) \ No newline at end of file diff --git a/scripts/scrape.py b/scripts/scrape.py index ac0a44926..4356d3c6f 100644 --- a/scripts/scrape.py +++ b/scripts/scrape.py @@ -1,32 +1,89 @@ -import requests +from bs4 import BeautifulSoup +import os -url = "https://en.wikipedia.org/w/api.php" +script_dir = os.path.dirname(os.path.abspath(__file__)) +file_path = os.path.join(script_dir, "..", "data", "tt0074888.html") -headers = { - "User-Agent": "CSE881-MovieProject/1.0 (ishaa@msu.edu)" -} +def extract_movie_info(file_path): -params = { - "action": "query", - "format": "json", - "titles": "Godfather", - "prop": "extracts", - "explaintext": True, - "redirects": 1 -} + with open(file_path, "r", encoding="utf-8") as f: + html = f.read() -response = requests.get(url, headers=headers, params=params) + soup = BeautifulSoup(html, "lxml") -print("Status:", response.status_code) -print("Content-Type:", response.headers.get("content-type")) -print("First 200 chars:\n", response.text[:1000]) + # ----------------------------- + # Title + # ----------------------------- + title_tag = soup.find("h1") + title = title_tag.get_text(strip=True) if title_tag else None -data = response.json() + # ----------------------------- + # Genre (first line) + # ----------------------------- + genre = None + content = soup.find("div", id="mw-content-text") + if content: + first_paragraph = content.find("p") + if first_paragraph: + genre = first_paragraph.get_text(" ", strip=True) + # ----------------------------- + # Infobox: Directed by + Starring + # ----------------------------- + infobox = soup.find("table", class_="infobox") -pages = data["query"]["pages"] -page = next(iter(pages.values())) + directed_by = None + cast = [] -print("\nTitle:", page["title"]) -print("\nPreview:\n", page["extract"]) + if infobox: + rows = infobox.find_all("tr") + + for row in rows: + header = row.find("th") + data = row.find("td") + + if not header or not data: + continue + + header_text = header.get_text(" ", strip=True) + + if header_text == "Directed by": + directed_by = data.get_text(" ", strip=True) + + elif header_text == "Starring": + # Get cast members split by
or links/text + cast_items = list(data.stripped_strings) + cast = cast_items[:5] + + # ----------------------------- + # Plot section + # ----------------------------- + plot = "" + + plot_header = soup.find(id="Plot") + + if plot_header: + # Move to the parent heading container if needed + current = plot_header.parent + + for sibling in current.find_next_siblings(): + # Stop when next h2 section begins + if sibling.name == "div" and "mw-heading2" in sibling.get("class", []): + break + if sibling.name == "p": + plot += sibling.get_text(" ", strip=True) + " " + + plot = plot.strip() + + return title, directed_by, cast, genre, plot + +# ----------------------------- +# Print results +# ----------------------------- +title, directed_by, cast, genre, plot = extract_movie_info(file_path) +print("Title:", title) +print("Directed by:", directed_by) +print("Cast:", cast) +print("Genre:", genre) +print("\nPlot:\n", plot) \ No newline at end of file diff --git a/updated_data.xlsx b/updated_data.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..fe854a26d0d7dbc7c524ef4b84d9f82cc8779c52 GIT binary patch literal 8840 zcmeHN1y@^Xw+`+Q+@V;3;_faDE`{P!DDDI(E-6-^Kyfeb?oyoMZpFP6m!d^)X70@G zOlR&d_`Z{s^RBF%y;shAo@dLmqpAQ0j|V^mAOipZDuBsRx}_lu01yoi0N?_UVfCaO z>|M<4U5qq59L>N6tnPNUPqN`*88QK|(Chzu{0HwqY21)3O4^>(G+lV0(aWliqXsPY?DO}KAhTXiNMV;PnAmh1*3_2lk-iY zLK~Q&k!!}-6n2F@IJy;cd2|)ArHzk5iJPRlDD1pgw>R2`soh)O+q0b^GW=ZirkNMt z(uhSbLp*>AGC2oOwF$2!DC4vr^XU=8&ruvY>3Xt!OSg_o*0W0bx;Y~%`n56!mSDD&6ep!N;Jxkp8Qgfe+FC*C-@B0YQRoFMB|y558BHghp^nfzSdgTbve>RCx+ zVV1(cDvj*asdN=^j8%&S1vif@m_Q^sK(|*(XT{{Y9A-xH#bH@+WdmR4UhG(^-&t@pA104DGOJi8S0-T>8Hq_`V1uq3zR}w(4BF& zWqae`Y-8f!VDnqv%GFSFSYQYGeyn{o{vk4lNt>mXMIC4f+7Ikq5LJH4(PM*SQ4gmv z^Ki?WHSYGw#SY$W6zOAo%k2?s#idz5y(DQ-F}Ar#*oQ*(_yKFKELE_kToM2<%^&6))b%f+RkbXj>0mMedQFz7qEX&te&q+TvspR{}gG`;*?9C$SWRl z?}51z%>6{y)=aM(m+7!QHr}u6`^QeAdv+2yD_30V^JTj7Tak{%}W~!j|YZ${TYDW>pUU~sGZX(rNxmQ8ujrcD#b)pw= zFQNvEyEC>?r6bzHSKm*=dLq%nuXhYiMOO?%7Rh*;`Dq8lwai)-^YXR&%XT!oa8~iH z>M!B8Z@C@sBgllhqW8aI=**Jb5H?L;$PKeR&nsQ?XRf`WQ&%tqC4Rgb@C!{WuscHD z+6PXp->)u7SPBZGy1e3nTzTB#ht+QBF%WknckK&$V)DE?peOg_*spn#|G-S5Z1E5% zYYT*b+e7YN>>3=`AFX5lI!`ShK!W`{O&PY~|Fi^z*Z)W%_?*GgZFwmk6%A5b}qf}iU(w<{jZ4LID3PNu>z9C%M$+hZXFhmo@p zIAoK*&dh$L_U$S0qY8V$&1iY zd%M9qPF_n?RI9eMK3SS{h<;MNnoB|uvYy7sswc>vyG@2^psC&=8D0mO#h^tbIb!$I z8yY>DkK+$fb~>v)VhwyruFh;mvkX>|e}}WXSv(WWMta^Fj>FnNx<}tY?_vFzV!*{Jn_u);W0@UDJO?uBfYZFIkINqVz$6+yvI63z9Jt%(+|^AG zxW{cHBXp0t15yP09I&xetD6M1GMQu!?%Ktz^u^)ptcr=z*&a@$@weR}Vjv?GVykT| zf{QL(Ggl`CS5MOC^yyyf0$ z>lMsYPyv4&+Rm+ht6z&GOrFq#h`cqo>$@i;%4Ni_CX2T+;(%o z2uN8Fy2G%u3?vILWt1aelnNf9a;tUqFxltl72|LryLB}A%J1fC zvcd1>6Z0_6CzUSP1{qr_eQhHLr@4wvS*dQmlfr(2R?yPMu5eg+a$no~kLjT?)R^7+|8$JG>7Kv7o@-So4>=+#nQ~qjP3XH??^n*8L=nj1GW+F2m!rR znCD|mh(Ew7By8|Hnas(Dkrf$OCt|;8dcr8wj6YB%9*}J{p%Ad*$*_P3@;96;a7xWq z3iu=$vL&ypt6@uzctc2T%W}1OH|7}J?0G0LI8KEf^Su;bgG~!ox;pwfceq|>y6^F% zr+;pjP9X6Zh$#t3tiR~}*jnN)MMq1Ad}-kf*8q_&@-{V-39gLzYL*d;zJk#Ns)9Vn zu8@YMgjtR&D}DLweLUV2vpBJGz`@~qbTIq{&ww#l{?pJJ#|w3~Kt(dk$)5RFl@o=< z*v`~KikVWwx-cj2uGKyYtHPLQKXaczeYTBTYHMiFN&5L|e!>CE;eN78RmB|e{YwlL zY%Fudl2+Zb!OMs?81}l*eYci$dso>pfFltk|78N_h^|BHLdGO>z`+-pg>?B=b?_M9Et0x2O(X|#0wbf2FAI9<%H5r$(VzJUPFk5pWJf-e zs&-lO>WbThTzVp05Wnu9k0flvJh_mE`Iz&B#C#%82i+q-;j?boF?a&bACa@vGI*Mn zUW!Z-?~S7%5YGnD{q5Q$X$#(95Zwscfa5HNu7rUF7bml!;V`g>m{&8sG=X=j_v$@O1M(MZSH^lkyg;Mm#*KcF@;{BOT(rj(s z-xq7>+1hT-uZ}6a*7RVp22Vtl-bpv~V=sJT+%ZSrlvhUF@D{p3$sff`1K45hmdlf^ zm7$o6>ZdmLwlHP7U;BUQyCT=Y!r^#1-Tgy0MzF(RR3{a!0><9bkPhvtPv^jbwbyJ( zv^PX_g*;85{4f$+m@<{=V!J9AO0!MTo!rGS5Gr{?PKmuSaYT_9Y4FC1p>azp=tyn4 zciJXKl$<|+d$m^T#mP`l0axQf)yR>-^!9MU1+Re1hiG7uN8mB;cG9UJ1ykA$=+bF$ z5V8H8<_7L}W9lx>awdFt!KV>9_R_iMhfOTvt1(j!#)Vpj4ae1Td(=VMG?XK$NoSnS zsZoL`%HWD&ZiT@)tSw|oyd05o?-PR~^H5(zu{GPGmSgpe#@lmA$B^Dg&>3LDp*Op5 zLHSay8nm_cLn&1ZK{LA%b0EroaKxODEsM@HK24X6BzG{F6Eq_LLf+wvpC#~;lt z%!y;VcHa;PQnoydY@4@StYU?8BghL;X!!DWW9i5U)UXJY;2iQB=f~sdPwE(sFj)!a z4kREYQHAoDIB8N*20S!I%xIHU(gKO>i^*nO-Z8kfBm8n%BfPP%493vwDL|!e9 z4A!hiL?q*xa``C}ms!DDnAN*|JxtOp(@+@ME`G_Aovi+?koN_#cVi(5^JpoG0WxF- zp>)X~+h7Q@2F09q(yak}HqDZ95!4KN%1<2>Rl5={T1-N(WNJntOF_9SaEwP8-^jx= z->~LJKp#IplTexJd2IKIo811J;iQKbHM!19W#xpdoR{OJ6|-=78=vseUzG+`SYa9_ zHlKDg2e)LYqL5w|Jgq_b(!`hgP+8ig6jXN$=~5=|57JoFe#*-|TtW@6Z654M^Jq;z z?5T!22l9hsg!TYL54?;Wk>xf6jvz@*H#e)#OKL>oYVtp{v7Mr&pIdhHnoz#IbuEO5 zV#hlQy@Kin{w3UYc4vbuqcW56Y?bJv<)#VrjSLNfG6l)D>rbOBIFsm^%CNCq&Iaw{ zgxa4y{mKT4cP4rq)0p^%qEA>Vvj9i6w1pbs(pC@4s?n8&`V|Gk3zoG>@zCEgU|-uE zLjvJyV7vWE$W8l*fI@dJkD`zD$6k;sf0PC_^8DNC*V)H-CNmoc)wDVi{n8q8`pJ4# zNL$gBZRG(NKJ`cp0jhj6xeRcKZ#W;8zEQtLtg?50C`yk~+CHmCY8I+wNm%ylHOzha z>?SWKb64b6QH6&IsX}0G%wu{XtSltwaC!;fmLvs*rOcp)T@7|V zy;%PCBf>y?uFAD(>LQ6G=JIve-le5rKT%(jQQe@J@M))y<2kSXX;gB|rS&6>{%L32 zD?eKQt%gw^J-$b1E%eVS$v?FVnF0;~I7b2iF#b0!fn7Xo&A`7mlA4S5i|pt=l$CzK z^P@qxO=c8ZLZugQ2+gm=g^Vl0Sm-5PIbP?11rJuA|A=X{Y(Sq^pjMlnpu&>t(|V7$ zlboDl6Ti0B)fZZ%PpEH)b9cybW==GdPz*sW(}rA?_G<0+VL91dK}2kcYmlF4q_36L z%|N0BS2^YO(F`~c=<|4EbmvkqikP}B)mjwPqsOlbq}9tELn@ZEs`&6 zJISb(@JYE&3taZ-gb>(rlKvA)nq9P%bBBl+K?q78Yf{vUaL=T zh7~KGgiY;J4>OL5XdGd=T)7+|`(!oZv~1t|#tQ#x$lgIla8B$BsnnGO(RLx_i1f+$1uQwvGqM&q``5aFh3Yx!I^&e+Rx5c@E; z7JN63^lqYoee?UF8KSaoiq2q*&g^{Qwa0fFfV&kMgv)i_Y)!h$a)&hOW&8PrJ|{nx zYgxJ3rd$qg`34z1yK6LkSMr2zfwwU5rnBFdbj0K8FH@?=06oI;2PL%>IDz`BwAbOJ zg(lMA*3B;G3uSqj8p}xPA4f|2!KO^B>-giv#&pXlx9p=i`JUlp5_s&K;Us^+4MboV8pp@aNBoh z^^WBI&Lh$IAt6FBM-sC&jOo&O=OvXWnH>s+o#0K`GjXqB9}4CBeTI|PgGsDU1979D z=Fd65S2G}MO$~hK9n8kQ+8LHOihZ&piW2EvTdB|4Ig>e)IX;I<-Fe9-5iNn@e5YT~ z{;!&hWYC4Z7?fgE(0KyhA4IZz1g zzL`1@rnC?*KgVww;l6cy<8>(t_ipDI_sRtKKB#(seqRgs1G4Ar0C$P0ZF!jFU<$MR z5vg;E&aE-Wp2ZyABz%{=b+)CF69djX6O|hNx9n-`HFi9!UB9>q8Cc#R* zXFhia?sjeP+MVnc#F~c3s;Z3Yz^_W|!%xsyg0h_p_UXgS{*pLP@r9WqE|cF_vt~V? zszV(3hJyM-HwFd`3Na--J1m6OhEY76PBsXgZUJ#lLZe|ZvvOY~XdFdr+G12pKN3uo zrhW6(W!%|Vie|2eNyN)$!Se!(7Eia>`FanT%wdmz0S$zzWK88Ds$o^9%A@$ID!XRd zf4xi@R>)!E>xq@B?WYp>BO$vn88?*oyP3)_zs<@mZD%HRW5{)@2~79Epvww&50b{j z=m&PNq;xDuVJYuO_l742LMQ(P! z9r^xsB>j%D*zm{6V@kII`#$r&6gG2TWrCaVv4?+cDCnkzw4OtuXbA-!HdGRM?O>|v z?BEDyGj(t_`z?Ay_45BJA<(Uhi;+<1W+w_hhCF6%t>jy|f~>ey=LZ7Tq)!QA9Q11C zq;L*zj+GY6+;%>jesA?~Ux|tvdRv^sR;kR}EJ-V;2A{$wxn~fR!B=vPC^O2%qP=Wc zsEUNivilXPcy^MaMb&c{x55)5#JC~w;V*bOphl3dzBOS*K_L%b?0vS~J#kd|>hA3! zz{0Az163Gzm}?2nlnZ+ToAEl3`(r%?0AgAw5pana8@Tqd6sctCBUbPLb5ZWG;*QI7 zdJO_)$cvk`Ej-&rUaK<+nrqy`5c%Q9)UKg#iVwl+ZJ(lLWM<4Xbdcq+gW3?Xj^^r zT&d5jq$HsfRer4gYoIN&fj0hzm4}xLD<>_k{HCeDq%F+nAW>n}Gdxnz7J#cBSvxeP zFj?+heQ;eTN(G!lanfvXqY@Kw$)HYS;{(LO&sf)$x;Lb4znFm{(937tt5e%1;8@7QFxP5WnONdL(<$_BNNYV5`R@W z0`%@SGMm0Sp_wVej?ZnCTtwpTRZ)#DTQHO*O>*Z*r!I<1TlO14M`yt2bUq%YN_yaeXT=oe5!BA^uDJLXxeq54zi6L%Un&4b-$UNs*&@J zIira*m-qG4zeo8KvM{hL&}{tgGygvy(Vy$T`~yH$;qMOqKArs2@V7M)T0{S3R{5*p zuM?Y}rk|ji&@c0wUyc7Re*ZKD0Lq~^`u|Jezxw%A5&h|D9qs>q#6R@WU%mV)Wc~Ew zi21Lw)~^nJZ8?5AK*Ims!OsTdSJPjMte>W;P>C116X0Jh<#(3d|nP=2TU?bH7NsyJ2v literal 0 HcmV?d00001