What View Do You Use In Excel To See The Header, Footer, And Margins?
Tutorial Using Excel with Python and Pandas
Published: December 8, 2017
Why learn to piece of work with Excel with Python? Excel is one of the almost popular and widely-used information tools; it'southward hard to find an organization that doesn't work with it in some mode. From analysts, to sales VPs, to CEOs, various professionals utilize Excel for both quick stats and serious information crunching.
With Excel being and so pervasive, data professionals must be familiar with it. Working with information in Python or R offers serious advantages over Excel'due south UI, so finding a way to work with Excel using code is critical. Thankfully, at that place's a swell tool already out there for using Excel with Python called pandas
.
Pandas has excellent methods for reading all kinds of data from Excel files. You tin too export your results from pandas back to Excel, if that'southward preferred by your intended audience. Pandas is slap-up for other routine data analysis tasks, such every bit:
- quick Exploratory Data Analysis (EDA)
- drawing bonny plots
- feeding information into machine learning tools like scikit-learn
- building machine learning models on your data
- taking cleaned and processed data to any number of information tools
Pandas is better at automating data processing tasks than Excel, including processing Excel files.
In this tutorial, we are going to testify you lot how to work with Excel files in pandas. We will cover the following concepts.
- setting up your computer with the necessary software
- reading in data from Excel files into pandas
- data exploration in pandas
- visualizing data in pandas using the matplotlib visualization library
- manipulating and reshaping data in pandas
- moving data from pandas into Excel
Note that this tutorial does not provide a deep dive into pandas. To explore pandas more, check out our course.
Organization Prerequisites
We will utilize Python three and Jupyter Notebook to demonstrate the lawmaking in this tutorial.In improver to Python and Jupyter Notebook, yous volition demand the post-obit Python modules:
- matplotlib – data visualization
- NumPy – numerical data functionality
- OpenPyXL – read/write Excel 2010 xlsx/xlsm files
- pandas – information import, make clean-up, exploration, and analysis
- xlrd – read Excel data
- xlwt – write to Excel
- XlsxWriter – write to Excel (xlsx) files
At that place are multiple ways to go fix with all the modules. We encompass three of the most common scenarios below.
- If you lot have Python installed via Anaconda package manager, yous can install the required modules using the command
conda install
. For case, to install pandas, you would execute the command –conda install pandas
. - If yous already have a regular, non-Anaconda Python installed on the computer, you can install the required modules using
pip
. Open your command line program and execute commandpip install <module proper noun>
to install a module. You should replace<module name>
with the bodily name of the module yous are trying to install. For example, to install pandas, you would execute command –pip install pandas
. - If you don't accept Python already installed, you should get it through the Anaconda packet manager. Anaconda provides installers for Windows, Mac, and Linux Computers. If you choose the total installer, you will become all the modules you need, along with Python and pandas within a single parcel. This is the easiest and fastest way to get started.
The Information Ready
In this tutorial, we will use a multi-canvas Excel file we created from Kaggle's IMDB Scores data. You tin can download the file here.
Our Excel file has three sheets: '1900s,' '2000s,' and '2010s.' Each sheet has data for movies from those years.
We volition utilise this data fix to find the ratings distribution for the movies, visualize movies with highest ratings and net earnings and calculate statistical data about the movies. We will exist analyzing and exploring this data using Python and pandas, thus demonstrating pandas capabilities for working with Excel data in Python.
Read data from the Excel file
We need to first import the information from the Excel file into pandas. To do that, we beginning past importing the pandas module.
import pandas as pd
Nosotros and then utilise the pandas' read_excel method to read in data from the Excel file. The easiest manner to call this method is to pass the file name. If no sheet proper name is specified then information technology will read the first canvass in the index (as shown below).
excel_file = 'movies.xls' movies = pd.read_excel(excel_file)
Here, the read_excel
method read the data from the Excel file into a pandas DataFrame object. Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a variable called movies
.
Pandas has a built-in DataFrame.head()
method that we can utilise to hands display the commencement few rows of our DataFrame. If no argument is passed, it will brandish offset five rows. If a number is passed, it volition brandish the equal number of rows from the top.
movies.head()
Title | Twelvemonth | Genres | Language | Land | Content Rating | Duration | Aspect Ratio | Budget | Gross Earnings | … | Facebook Likes – Role player ane | Facebook Likes – Actor ii | Facebook Likes – Actor 3 | Facebook Likes – cast Full | Facebook likes – Movie | Facenumber in posters | User Votes | Reviews by Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Intolerance: Beloved's Struggle Throughout the Ages | 1916 | Drama|History|State of war | NaN | USA | Not Rated | 123 | i.33 | 385907.0 | NaN | … | 436 | 22 | 9.0 | 481 | 691 | i | 10718 | 88 | 69.0 | 8.0 |
1 | Over the Hill to the Poorhouse | 1920 | Criminal offense|Drama | NaN | Usa | NaN | 110 | 1.33 | 100000.0 | 3000000.0 | … | ii | ii | 0.0 | 4 | 0 | 1 | five | 1 | 1.0 | 4.8 |
2 | The Big Parade | 1925 | Drama|Romance|State of war | NaN | United states of america | Non Rated | 151 | 1.33 | 245000.0 | NaN | … | 81 | 12 | 6.0 | 108 | 226 | 0 | 4849 | 45 | 48.0 | viii.3 |
iii | Metropolis | 1927 | Drama|Sci-Fi | German language | Germany | Not Rated | 145 | 1.33 | 6000000.0 | 26435.0 | … | 136 | 23 | xviii.0 | 203 | 12000 | one | 111841 | 413 | 260.0 | eight.three |
4 | Pandora's Box | 1929 | Crime|Drama|Romance | German language | Deutschland | Not Rated | 110 | i.33 | NaN | 9950.0 | … | 426 | xx | 3.0 | 455 | 926 | 1 | 7431 | 84 | 71.0 | viii.0 |
five rows × 25 columns
Excel files quite ofttimes take multiple sheets and the ability to read a specific canvas or all of them is very important. To make this piece of cake, the pandas read_excel
method takes an argument called sheetname
that tells pandas which sheet to read in the data from. For this, y'all tin can either utilise the sheet proper name or the sheet number. Sheet numbers commencement with aught. If the sheetname
argument is non given, it defaults to zero and pandas will import the first canvas.
By default, pandas volition automatically assign a numeric index or row label starting with zero. Y'all may want to leave the default index as such if your data doesn't have a cavalcade with unique values that can serve equally a better alphabetize. In case there is a column that you experience would serve every bit a better index, you can override the default behavior by setting index_col
holding to a column. It takes a numeric value for setting a single cavalcade as index or a list of numeric values for creating a multi-alphabetize.
In the below lawmaking, we are choosing the first column, 'Title', as index (index=0) by passing goose egg to the index_col
argument.
movies_sheet1 = pd.read_excel(excel_file, sheetname=0, index_col=0) movies_sheet1.caput()
Year | Genres | Language | Country | Content Rating | Duration | Attribute Ratio | Budget | Gross Earnings | Director | … | Facebook Likes – Role player ane | Facebook Likes – Actor two | Facebook Likes – Role player 3 | Facebook Likes – cast Full | Facebook likes – Moving picture | Facenumber in posters | User Votes | Reviews past Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Championship | |||||||||||||||||||||
Intolerance: Dear'southward Struggle Throughout the Ages | 1916 | Drama|History|State of war | NaN | Us | Non Rated | 123 | 1.33 | 385907.0 | NaN | D.W. Griffith | … | 436 | 22 | 9.0 | 481 | 691 | 1 | 10718 | 88 | 69.0 | 8.0 |
Over the Hill to the Poorhouse | 1920 | Crime|Drama | NaN | United states of america | NaN | 110 | ane.33 | 100000.0 | 3000000.0 | Harry F. Millarde | … | 2 | 2 | 0.0 | 4 | 0 | 1 | 5 | i | 1.0 | 4.8 |
The Big Parade | 1925 | Drama|Romance|War | NaN | USA | Non Rated | 151 | 1.33 | 245000.0 | NaN | King Vidor | … | 81 | 12 | 6.0 | 108 | 226 | 0 | 4849 | 45 | 48.0 | viii.three |
Metropolis | 1927 | Drama|Sci-Fi | High german | Germany | Not Rated | 145 | 1.33 | 6000000.0 | 26435.0 | Fritz Lang | … | 136 | 23 | 18.0 | 203 | 12000 | 1 | 111841 | 413 | 260.0 | 8.3 |
Pandora'south Box | 1929 | Crime|Drama|Romance | German | Deutschland | Not Rated | 110 | 1.33 | NaN | 9950.0 | Georg Wilhelm Pabst | … | 426 | 20 | 3.0 | 455 | 926 | 1 | 7431 | 84 | 71.0 | 8.0 |
5 rows × 24 columns
As you noticed above, our Excel data file has iii sheets. We already read the first sheet in a DataFrame to a higher place. Now, using the same syntax, nosotros volition read in rest of the two sheets also.
movies_sheet2 = pd.read_excel(excel_file, sheetname=one, index_col=0) movies_sheet2.head()
Twelvemonth | Genres | Language | Country | Content Rating | Duration | Aspect Ratio | Budget | Gross Earnings | Manager | … | Facebook Likes – Player 1 | Facebook Likes – Actor 2 | Facebook Likes – Actor three | Facebook Likes – cast Total | Facebook likes – Movie | Facenumber in posters | User Votes | Reviews by Users | Reviews past Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Championship | |||||||||||||||||||||
102 Dalmatians | 2000 | Adventure|Comedy|Family | English language | Usa | G | 100.0 | ane.85 | 85000000.0 | 66941559.0 | Kevin Lima | … | 2000.0 | 795.0 | 439.0 | 4182 | 372 | 1 | 26413 | 77.0 | 84.0 | four.8 |
28 Days | 2000 | Comedy|Drama | English | USA | PG-13 | 103.0 | i.37 | 43000000.0 | 37035515.0 | Betty Thomas | … | 12000.0 | 10000.0 | 664.0 | 23864 | 0 | ane | 34597 | 194.0 | 116.0 | 6.0 |
iii Strikes | 2000 | One-act | English | U.s. | R | 82.0 | 1.85 | 6000000.0 | 9821335.0 | DJ Pooh | … | 939.0 | 706.0 | 585.0 | 3354 | 118 | ane | 1415 | 10.0 | 22.0 | iv.0 |
Aberdeen | 2000 | Drama | English language | UK | NaN | 106.0 | 1.85 | 6500000.0 | 64148.0 | Hans Petter Moland | … | 844.0 | 2.0 | 0.0 | 846 | 260 | 0 | 2601 | 35.0 | 28.0 | 7.3 |
All the Pretty Horses | 2000 | Drama|Romance|Western | English | USA | PG-13 | 220.0 | 2.35 | 57000000.0 | 15527125.0 | Baton Bob Thornton | … | 13000.0 | 861.0 | 820.0 | 15006 | 652 | 2 | 11388 | 183.0 | 85.0 | v.eight |
five rows × 24 columns
movies_sheet3 = pd.read_excel(excel_file, sheetname=2, index_col=0) movies_sheet3.head()
Year | Genres | Language | State | Content Rating | Duration | Aspect Ratio | Budget | Gross Earnings | Director | … | Facebook Likes – Actor 1 | Facebook Likes – Histrion 2 | Facebook Likes – Actor three | Facebook Likes – cast Total | Facebook likes – Movie | Facenumber in posters | User Votes | Reviews by Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Title | |||||||||||||||||||||
127 Hours | 2010.0 | Take chances|Biography|Drama|Thriller | English | U.s. | R | 94.0 | ane.85 | 18000000.0 | 18329466.0 | Danny Boyle | … | 11000.0 | 642.0 | 223.0 | 11984 | 63000 | 0.0 | 279179 | 440.0 | 450.0 | 7.6 |
3 Backyards | 2010.0 | Drama | English | United states | R | 88.0 | NaN | 300000.0 | NaN | Eric Mendelsohn | … | 795.0 | 659.0 | 301.0 | 1884 | 92 | 0.0 | 554 | 23.0 | 20.0 | 5.2 |
three | 2010.0 | Comedy|Drama|Romance | German | Germany | Unrated | 119.0 | two.35 | NaN | 59774.0 | Tom Tykwer | … | 24.0 | 20.0 | 9.0 | 69 | 2000 | 0.0 | 4212 | xviii.0 | 76.0 | 6.viii |
8: The Mormon Proposition | 2010.0 | Documentary | English | USA | R | 80.0 | one.78 | 2500000.0 | 99851.0 | Reed Cowan | … | 191.0 | 12.0 | 5.0 | 210 | 0 | 0.0 | 1138 | 30.0 | 28.0 | seven.1 |
A Turtle'due south Tale: Sammy'due south Adventures | 2010.0 | Adventure|Animation|Family | English language | France | PG | 88.0 | 2.35 | NaN | NaN | Ben Stassen | … | 783.0 | 749.0 | 602.0 | 3874 | 0 | 2.0 | 5385 | 22.0 | 56.0 | 6.1 |
5 rows × 24 columns
Since all the three sheets have similar data merely for different recordsmovies, we will create a unmarried DataFrame from all the iii DataFrames we created above. We volition use the pandas concat
method for this and laissez passer in the names of the 3 DataFrames we but created and assign the results to a new DataFrame object, movies
. Past keeping the DataFrame name same as earlier, we are over-writing the previously created DataFrame.
movies = pd.concat([movies_sheet1, movies_sheet2, movies_sheet3])
We can check if this chain past checking the number of rows in the combined DataFrame by calling the method shape
on it that volition requite us the number of rows and columns.
movies.shape
(5042, 24)
Using the ExcelFile class to read multiple sheets
We tin also use the ExcelFile class to work with multiple sheets from the same Excel file. We first wrap the Excel file using ExcelFile
and so pass it to read_excel
method.
xlsx = pd.ExcelFile(excel_file) movies_sheets = [] for sheet in xlsx.sheet_names: movies_sheets.append(xlsx.parse(canvas)) movies = pd.concat(movies_sheets)
If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames, ExcelFile
is more convenient and efficient in comparison to read_excel
. With ExcelFile, y'all only demand to pass the Excel file once, and so you tin employ it to go the DataFrames. When using read_excel
, you pass the Excel file every time and hence the file is loaded again for every canvas. This tin can be a huge performance elevate if the Excel file has many sheets with a big number of rows.
Exploring the data
Now that we take read in the movies data set from our Excel file, we can start exploring it using pandas. A pandas DataFrame stores the information in a tabular format, but like the way Excel displays the data in a sheet. Pandas has a lot of built-in methods to explore the DataFrame we created from the Excel file we just read in.
We already introduced the method head
in the previous department that displays few rows from the top from the DataFrame. Let's expect at few more methods that come in handy while exploring the data set.
Nosotros tin can use the shape
method to find out the number of rows and columns for the DataFrame.
movies.shape
(5042, 25)
This tells u.s. our Excel file has 5042 records and 25 columns or observations. This can exist useful in reporting the number of records and columns and comparing that with the source data set.
We tin use the tail
method to view the lesser rows. If no parameter is passed, only the bottom five rows are returned.
movies.tail()
Title | Year | Genres | Language | Land | Content Rating | Elapsing | Aspect Ratio | Budget | Gross Earnings | … | Facebook Likes – Actor 1 | Facebook Likes – Histrion 2 | Facebook Likes – Actor 3 | Facebook Likes – cast Total | Facebook likes – Movie | Facenumber in posters | User Votes | Reviews past Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1599 | War & Peace | NaN | Drama|History|Romance|War | English | United kingdom | TV-14 | NaN | 16.00 | NaN | NaN | … | m.0 | 888.0 | 502.0 | 4528 | 11000 | 1.0 | 9277 | 44.0 | 10.0 | 8.two |
1600 | Wings | NaN | Comedy|Drama | English | USA | NaN | xxx.0 | one.33 | NaN | NaN | … | 685.0 | 511.0 | 424.0 | 1884 | 1000 | 5.0 | 7646 | 56.0 | 19.0 | vii.3 |
1601 | Wolf Creek | NaN | Drama|Horror|Thriller | English language | Australia | NaN | NaN | two.00 | NaN | NaN | … | 511.0 | 457.0 | 206.0 | 1617 | 954 | 0.0 | 726 | six.0 | 2.0 | 7.one |
1602 | Wuthering Heights | NaN | Drama|Romance | English | Uk | NaN | 142.0 | NaN | NaN | NaN | … | 27000.0 | 698.0 | 427.0 | 29196 | 0 | 2.0 | 6053 | 33.0 | ix.0 | 7.vii |
1603 | Yu-Gi-Oh! Duel Monsters | NaN | Action|Adventure|Animation|Family unit|Fantasy | Japanese | Japan | NaN | 24.0 | NaN | NaN | NaN | … | 0.0 | NaN | NaN | 0 | 124 | 0.0 | 12417 | 51.0 | 6.0 | 7.0 |
5 rows × 25 columns
In Excel, you're able to sort a canvas based on the values in 1 or more columns. In pandas, you can practise the aforementioned thing with the sort_values
method. For example, let's sort our movies DataFrame based on the Gross Earnings cavalcade.
sorted_by_gross = movies.sort_values(['Gross Earnings'], ascending=Faux)
Since nosotros have the data sorted past values in a column, we can practice few interesting things with it. For example, we can display the pinnacle x movies by Gross Earnings.
sorted_by_gross["Gross Earnings"].head(10)
1867 760505847.0 1027 658672302.0 1263 652177271.0 610 623279547.0 611 623279547.0 1774 533316061.0 1281 474544677.0 226 460935665.0 1183 458991599.0 618 448130642.0 Name: Gross Earnings, dtype: float64
We can also create a plot for the peak 10 movies by Gross Earnings. Pandas makes it piece of cake to visualize your data with plots and charts through matplotlib, a popular data visualization library. With a couple lines of code, you tin can start plotting. Moreover, matplotlib plots work well inside Jupyter Notebooks since you tin can displace the plots correct under the code.
First, we import the matplotlib module and fix matplotlib to brandish the plots correct in the Jupyter Notebook.
import matplotlib.pyplot as plt%matplotlib inline
Nosotros will depict a bar plot where each bar volition correspond one of the superlative 10 movies. We tin can do this by calling the plot method and setting the statement kind
to barh
. This tells matplotlib
to draw a horizontal bar plot.
sorted_by_gross['Gross Earnings'].head(10).plot(kind="barh") plt.testify()
Let's create a histogram of IMDB Scores to check the distribution of IMDB Scores across all movies. Histograms are a practiced style to visualize the distribution of a data gear up. We use the plot
method on the IMDB Scores serial from our movies DataFrame and laissez passer it the argument.
movies['IMDB Score'].plot(kind="hist") plt.prove()
This data visualization suggests that nearly of the IMDB Scores fall between six and eight.
Getting statistical information about the data
Pandas has some very handy methods to look at the statistical data near our data set. For example, nosotros tin can utilise the describe
method to get a statistical summary of the data ready.
movies.depict()
Twelvemonth | Duration | Aspect Ratio | Budget | Gross Earnings | Facebook Likes – Director | Facebook Likes – Actor 1 | Facebook Likes – Actor two | Facebook Likes – Actor 3 | Facebook Likes – cast Total | Facebook likes – Flick | Facenumber in posters | User Votes | Reviews past Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4935.000000 | 5028.000000 | 4714.000000 | iv.551000e+03 | 4.159000e+03 | 4938.000000 | 5035.000000 | 5029.000000 | 5020.000000 | 5042.000000 | 5042.000000 | 5029.000000 | v.042000e+03 | 5022.000000 | 4993.000000 | 5042.000000 |
mean | 2002.470517 | 107.201074 | 2.220403 | three.975262e+07 | iv.846841e+07 | 686.621709 | 6561.323932 | 1652.080533 | 645.009761 | 9700.959143 | 7527.457160 | 1.371446 | eight.368475e+04 | 272.770808 | 140.194272 | half dozen.442007 |
std | 12.474599 | 25.197441 | 1.385113 | 2.061149e+08 | six.845299e+07 | 2813.602405 | 15021.977635 | 4042.774685 | 1665.041728 | 18165.101925 | 19322.070537 | two.013683 | 1.384940e+05 | 377.982886 | 121.601675 | 1.125189 |
min | 1916.000000 | 7.000000 | 1.180000 | two.180000e+02 | 1.620000e+02 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000e+00 | 1.000000 | i.000000 | 1.600000 |
25% | 1999.000000 | 93.000000 | one.850000 | 6.000000e+06 | 5.340988e+06 | 7.000000 | 614.500000 | 281.000000 | 133.000000 | 1411.250000 | 0.000000 | 0.000000 | 8.599250e+03 | 65.000000 | l.000000 | five.800000 |
50% | 2005.000000 | 103.000000 | ii.350000 | 2.000000e+07 | 2.551750e+07 | 49.000000 | 988.000000 | 595.000000 | 371.500000 | 3091.000000 | 166.000000 | ane.000000 | 3.437100e+04 | 156.000000 | 110.000000 | 6.600000 |
75% | 2011.000000 | 118.000000 | 2.350000 | 4.500000e+07 | 6.230944e+07 | 194.750000 | 11000.000000 | 918.000000 | 636.000000 | 13758.750000 | 3000.000000 | 2.000000 | nine.634700e+04 | 326.000000 | 195.000000 | 7.200000 |
max | 2016.000000 | 511.000000 | xvi.000000 | ane.221550e+10 | vii.605058e+08 | 23000.000000 | 640000.000000 | 137000.000000 | 23000.000000 | 656730.000000 | 349000.000000 | 43.000000 | one.689764e+06 | 5060.000000 | 813.000000 | 9.500000 |
The describe
method displays below information for each of the columns.
- the count or number of values
- mean
- standard deviation
- minimum, maximum
- 25%, 50%, and 75% quantile
Please notation that this information will be calculated only for the numeric values.
We tin also utilize the corresponding method to admission this data one at a time. For example, to get the hateful of a particular cavalcade, you lot can apply the mean
method on that cavalcade.
movies["Gross Earnings"].mean()
48468407.526809327
Just like mean, there are methods bachelor for each of the statistical information nosotros want to admission. You can read about these methods in our costless pandas cheat canvas.
Reading files with no header and skipping records
Earlier in this tutorial, we saw some means to read a particular kind of Excel file that had headers and no rows that needed skipping. Sometimes, the Excel sheet doesn't accept any header row. For such instances, you tin tell pandas not to consider the kickoff row as header or columns names. And If the Excel sheet's first few rows comprise data that should non be read in, you can inquire the read_excel
method to skip a certain number of rows, starting from the peak.
For example, await at the meridian few rows of this Excel file.
This file obviously has no header and first 4 rows are not bodily records and hence should non exist read in. We can tell read_excel there is no header by setting argument header
to None
and we can skip first four rows by setting statement skiprows
to four.
movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls", header=None, skiprows=4) movies_skip_rows.head(5)
0 | 1 | ii | 3 | 4 | five | 6 | 7 | viii | 9 | … | xv | xvi | 17 | eighteen | 19 | twenty | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Metropolis | 1927 | Drama|Sci-Fi | German | Germany | Not Rated | 145 | 1.33 | 6000000.0 | 26435.0 | … | 136 | 23 | 18.0 | 203 | 12000 | 1 | 111841 | 413 | 260.0 | eight.three |
1 | Pandora'south Box | 1929 | Law-breaking|Drama|Romance | German | Germany | Non Rated | 110 | i.33 | NaN | 9950.0 | … | 426 | xx | iii.0 | 455 | 926 | 1 | 7431 | 84 | 71.0 | 8.0 |
2 | The Broadway Melody | 1929 | Musical|Romance | English language | Usa | Passed | 100 | 1.37 | 379000.0 | 2808000.0 | … | 77 | 28 | iv.0 | 109 | 167 | 8 | 4546 | 71 | 36.0 | half dozen.iii |
3 | Hell'southward Angels | 1930 | Drama|War | English | USA | Passed | 96 | ane.twenty | 3950000.0 | NaN | … | 431 | 12 | 4.0 | 457 | 279 | i | 3753 | 53 | 35.0 | 7.8 |
4 | A Farewell to Artillery | 1932 | Drama|Romance|War | English language | USA | Unrated | 79 | i.37 | 800000.0 | NaN | … | 998 | 164 | 99.0 | 1284 | 213 | one | 3519 | 46 | 42.0 | vi.six |
5 rows × 25 columns
We skipped four rows from the sheet and used none of the rows as the header. Likewise, find that one can combine different options in a single read statement. To skip rows at the lesser of the sheet, you tin use option skip_footer
, which works merely similar skiprows
, the just deviation being the rows are counted from the lesser upwards.
The column names in the previous DataFrame are numeric and were allotted equally default by the pandas. We can rename the cavalcade names to descriptive ones by calling the method columns
on the DataFrame and passing the column names as a list.
movies_skip_rows.columns = ['Championship', 'Year', 'Genres', 'Language', 'State', 'Content Rating', 'Elapsing', 'Attribute Ratio', 'Budget', 'Gross Earnings', 'Director', 'Thespian 1', 'Thespian 2', 'Actor 3', 'Facebook Likes - Director', 'Facebook Likes - Role player 1', 'Facebook Likes - Actor two', 'Facebook Likes - Player 3', 'Facebook Likes - cast Total', 'Facebook likes - Movie', 'Facenumber in posters', 'User Votes', 'Reviews by Users', 'Reviews past Crtiics', 'IMDB Score'] movies_skip_rows.head()
Title | Yr | Genres | Language | Country | Content Rating | Duration | Attribute Ratio | Budget | Gross Earnings | … | Facebook Likes – Actor ane | Facebook Likes – Actor 2 | Facebook Likes – Player three | Facebook Likes – cast Total | Facebook likes – Film | Facenumber in posters | User Votes | Reviews by Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Metropolis | 1927 | Drama|Sci-Fi | German language | Federal republic of germany | Not Rated | 145 | 1.33 | 6000000.0 | 26435.0 | … | 136 | 23 | 18.0 | 203 | 12000 | 1 | 111841 | 413 | 260.0 | 8.3 |
1 | Pandora'southward Box | 1929 | Criminal offence|Drama|Romance | German | Germany | Not Rated | 110 | i.33 | NaN | 9950.0 | … | 426 | 20 | three.0 | 455 | 926 | i | 7431 | 84 | 71.0 | viii.0 |
2 | The Broadway Melody | 1929 | Musical|Romance | English | USA | Passed | 100 | 1.37 | 379000.0 | 2808000.0 | … | 77 | 28 | 4.0 | 109 | 167 | 8 | 4546 | 71 | 36.0 | 6.three |
3 | Hell'south Angels | 1930 | Drama|War | English | United states of america | Passed | 96 | 1.twenty | 3950000.0 | NaN | … | 431 | 12 | 4.0 | 457 | 279 | 1 | 3753 | 53 | 35.0 | 7.8 |
4 | A Farewell to Arms | 1932 | Drama|Romance|War | English | Usa | Unrated | 79 | i.37 | 800000.0 | NaN | … | 998 | 164 | 99.0 | 1284 | 213 | 1 | 3519 | 46 | 42.0 | six.6 |
5 rows × 25 columns
Now that we have seen how to read a subset of rows from the Excel file, we can acquire how to read a subset of columns.
Reading a subset of columns
Although read_excel defaults to reading and importing all columns, y'all tin can choose to import only certain columns. Past passing parse_cols=6, we are telling the read_excel
method to read only the first columns till index half-dozen or outset seven columns (the kickoff cavalcade being indexed zero).
movies_subset_columns = pd.read_excel(excel_file, parse_cols=6) movies_subset_columns.head()
Title | Twelvemonth | Genres | Language | Country | Content Rating | Duration | |
---|---|---|---|---|---|---|---|
0 | Intolerance: Love's Struggle Throughout the Ages | 1916 | Drama|History|War | NaN | United states | Not Rated | 123 |
ane | Over the Colina to the Poorhouse | 1920 | Offense|Drama | NaN | U.s. | NaN | 110 |
2 | The Big Parade | 1925 | Drama|Romance|State of war | NaN | USA | Non Rated | 151 |
iii | Metropolis | 1927 | Drama|Sci-Fi | High german | Germany | Non Rated | 145 |
4 | Pandora'due south Box | 1929 | Crime|Drama|Romance | German | Federal republic of germany | Not Rated | 110 |
Alternatively, you can pass in a list of numbers, which volition let you import columns at particular indexes.
Applying formulas on the columns
One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. Nosotros can get Cyberspace earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas too every bit shown below.
movies["Net Earnings"] = movies["Gross Earnings"] - movies["Upkeep"]
Higher up, we used pandas to create a new column called Net Earnings, and populated it with the difference of Gross Earnings and Budget. It's worth noting the divergence hither in how formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data changes – with Python, the calculations happen and the values are stored – if Gross Earnings for one movie was manually changed, Net Earnings won't exist updated.
Let'due south use the sort_values
method to sort the data by the new column nosotros created and visualize the summit ten movies past Net Earnings.
sorted_movies = movies[['Net Earnings']].sort_values(['Internet Earnings'], ascending=[False])sorted_movies.head(10)['Net Earnings'].plot.barh() plt.testify()
Pivot Table in pandas
Avant-garde Excel users besides often use pivot tables. A pivot tabular array summarizes the data of another table by group the data on an index and applying operations such as sorting, summing, or averaging. Y'all can utilize this feature in pandas too.
We demand to first place the column or columns that volition serve as the index, and the column(due south) on which the summarizing formula will be applied. Let's outset pocket-sized, by choosing Twelvemonth as the alphabetize column and Gross Earnings as the summarization column and creating a separate DataFrame from this data.
movies_subset = movies[['Twelvemonth', 'Gross Earnings']] movies_subset.head()
Year | Gross Earnings | |
---|---|---|
0 | 1916.0 | NaN |
ane | 1920.0 | 3000000.0 |
ii | 1925.0 | NaN |
3 | 1927.0 | 26435.0 |
iv | 1929.0 | 9950.0 |
We now call pivot_table
on this subset of data. The method pivot_table
takes a parameter index
. As mentioned, we want to use Year every bit the index.
earnings_by_year = movies_subset.pivot_table(alphabetize=['Twelvemonth']) earnings_by_year.head()
Gross Earnings | |
---|---|
Year | |
1916.0 | NaN |
1920.0 | 3000000.0 |
1925.0 | NaN |
1927.0 | 26435.0 |
1929.0 | 1408975.0 |
This gave usa a pivot tabular array with grouping on Year and summarization on the sum of Gross Earnings. Discover, we didn't need to specify Gross Earnings cavalcade explicitly as pandas automatically identified it the values on which summarization should exist applied.
Nosotros can use this pin tabular array to create some data visualizations. Nosotros tin can call the plot
method on the DataFrame to create a line plot and call the testify
method to display the plot in the notebook.
earnings_by_year.plot() plt.show()
We saw how to pivot with a unmarried cavalcade as the index. Things will become more than interesting if we tin use multiple columns. Permit's create another DataFrame subset but this time we will choose the columns, State, Language and Gross Earnings.
movies_subset = movies[['Land', 'Language', 'Gross Earnings']] movies_subset.head()
Country | Language | Gross Earnings | |
---|---|---|---|
0 | USA | NaN | NaN |
1 | United states | NaN | 3000000.0 |
2 | USA | NaN | NaN |
3 | Frg | German | 26435.0 |
4 | Germany | High german | 9950.0 |
We will employ columns Land and Linguistic communication as the index for the pivot table. We will utilize Gross Earnings as summarization tabular array, even so, we do not need to specify this explicitly as we saw earlier.
earnings_by_co_lang = movies_subset.pivot_table(alphabetize=['Country', 'Linguistic communication']) earnings_by_co_lang.head()
Gross Earnings | ||
---|---|---|
Country | Language | |
Afghanistan | Dari | one.127331e+06 |
Argentina | Spanish | vii.230936e+06 |
Aruba | English | 1.007614e+07 |
Commonwealth of australia | Ancient | vi.165429e+06 |
Dzongkha | five.052950e+05 |
Let's visualize this pivot tabular array with a bar plot. Since at that place are still few hundred records in this pivot table, nosotros will plot just a few of them.
earnings_by_co_lang.head(xx).plot(kind='bar', figsize=(20,8)) plt.show()
Exporting the results to Excel
If you're going to exist working with colleagues who use Excel, saving Excel files out of pandas is important. You lot tin can export or write a pandas DataFrame to an Excel file using pandas to_excel
method. Pandas uses the xlwt
Python module internally for writing to Excel files. The to_excel
method is called on the DataFrame we desire to export.We also need to laissez passer a filename to which this DataFrame will exist written.
movies.to_excel('output.xlsx')
By default, the index is also saved to the output file. However, sometimes the index doesn't provide any useful information. For example, the movies
DataFrame has a numeric auto-increment index, that was non part of the original Excel data.
movies.head()
Championship | Yr | Genres | Language | Country | Content Rating | Duration | Attribute Ratio | Budget | Gross Earnings | … | Facebook Likes – Player 2 | Facebook Likes – Actor 3 | Facebook Likes – cast Total | Facebook likes – Moving-picture show | Facenumber in posters | User Votes | Reviews by Users | Reviews past Crtiics | IMDB Score | Net Earnings | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Intolerance: Love's Struggle Throughout the Ages | 1916.0 | Drama|History|War | NaN | U.s.a. | Non Rated | 123.0 | 1.33 | 385907.0 | NaN | … | 22.0 | nine.0 | 481 | 691 | one.0 | 10718 | 88.0 | 69.0 | 8.0 | NaN |
1 | Over the Hill to the Poorhouse | 1920.0 | Criminal offence|Drama | NaN | United states | NaN | 110.0 | 1.33 | 100000.0 | 3000000.0 | … | 2.0 | 0.0 | iv | 0 | 1.0 | five | 1.0 | 1.0 | 4.8 | 2900000.0 |
two | The Big Parade | 1925.0 | Drama|Romance|War | NaN | USA | Not Rated | 151.0 | 1.33 | 245000.0 | NaN | … | 12.0 | 6.0 | 108 | 226 | 0.0 | 4849 | 45.0 | 48.0 | 8.3 | NaN |
three | Urban center | 1927.0 | Drama|Sci-Fi | German language | Germany | Not Rated | 145.0 | 1.33 | 6000000.0 | 26435.0 | … | 23.0 | 18.0 | 203 | 12000 | i.0 | 111841 | 413.0 | 260.0 | viii.three | -5973565.0 |
4 | Pandora'south Box | 1929.0 | Crime|Drama|Romance | German | Germany | Not Rated | 110.0 | one.33 | NaN | 9950.0 | … | 20.0 | 3.0 | 455 | 926 | one.0 | 7431 | 84.0 | 71.0 | viii.0 | NaN |
5 rows × 26 columns
You can choose to skip the alphabetize by passing forth alphabetize-Fake.
movies.to_excel('output.xlsx', index=Imitation)
We demand to be able to make our output files look nice before we tin send it out to our co-workers. We can apply pandas ExcelWriter
class forth with the XlsxWriter
Python module to apply the formatting.
We can do employ these advanced output options by creating a ExcelWriter
object and utilise this object to write to the EXcel file.
writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter') movies.to_excel(author, alphabetize=Fake, sheet_name='report') workbook = writer.bookworksheet = writer.sheets['report']
We can utilise customizations by calling add_format
on the workbook we are writing to. Here we are setting header format as assuming.
header_fmt = workbook.add_format({'bold': True}) worksheet.set_row(0, None, header_fmt)
Finally, we save the output file by calling the method save
on the writer object.
author.relieve()
As an example, we saved the data with column headers ready as bold. And the saved file looks similar the image below.
Like this, one can use XlsxWriter
to apply various formatting to the output Excel file.
Conclusion
Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and can exist very bully companion tools. As we demonstrated, pandas tin do a lot of complex data analysis and manipulations, which depending on your demand and expertise, tin go beyond what you lot tin achieve if you are only using Excel. I of the major benefits of using Python and pandas over Excel is that it helps you automate Excel file processing by writing scripts and integrating with your automated data workflow. Pandas also has excellent methods for reading all kinds of data from Excel files. You lot tin can export your results from pandas back to Excel too if that'south preferred by your intended audience.
On the other paw, Excel is a such a widely used data tool, it's non a wise to ignore information technology. Acquiring expertise in both pandas and Excel and making them work together gives y'all skills that can help you stand out in your organisation.
Larn Data Skills
Get that adjacent heighten or to switch to a career in data science past learning information skills.
Sign up for a complimentary business relationship and endeavor our interactive courses in Python, R, SQL, and more!
Tags
Source: https://www.dataquest.io/blog/excel-and-pandas/
Posted by: thomasprel1989.blogspot.com
0 Response to "What View Do You Use In Excel To See The Header, Footer, And Margins?"
Post a Comment