banner



What View Do You Use In Excel To See The Header, Footer, And Margins?

Tutorial Using Excel with Python and Pandas

Why learn to piece of work with Excel with Python? Excel is one of the almost popular and widely-used information tools; it'southward hard to find an organization that doesn't work with it in some mode. From analysts, to sales VPs, to CEOs, various professionals utilize Excel for both quick stats and serious information crunching.

With Excel being and so pervasive, data professionals must be familiar with it. Working with information in Python or R offers serious advantages over Excel'due south UI, so finding a way to work with Excel using code is critical. Thankfully, at that place's a swell tool already out there for using Excel with Python called pandas.

Pandas has excellent methods for reading all kinds of data from Excel files. You tin too export your results from pandas back to Excel, if that'southward preferred by your intended audience. Pandas is slap-up for other routine data analysis tasks, such every bit:

  • quick Exploratory Data Analysis (EDA)
  • drawing bonny plots
  • feeding information into machine learning tools like scikit-learn
  • building machine learning models on your data
  • taking cleaned and processed data to any number of information tools

Pandas is better at automating data processing tasks than Excel, including processing Excel files.

In this tutorial, we are going to testify you lot how to work with Excel files in pandas. We will cover the following concepts.

  • setting up your computer with the necessary software
  • reading in data from Excel files into pandas
  • data exploration in pandas
  • visualizing data in pandas using the matplotlib visualization library
  • manipulating and reshaping data in pandas
  • moving data from pandas into Excel

Note that this tutorial does not provide a deep dive into pandas. To explore pandas more, check out our course.

Organization Prerequisites

We will utilize Python three and Jupyter Notebook to demonstrate the lawmaking in this tutorial.In improver to Python and Jupyter Notebook, yous volition demand the post-obit Python modules:

  • matplotlib – data visualization
  • NumPy – numerical data functionality
  • OpenPyXL – read/write Excel 2010 xlsx/xlsm files
  • pandas – information import, make clean-up, exploration, and analysis
  • xlrd – read Excel data
  • xlwt – write to Excel
  • XlsxWriter – write to Excel (xlsx) files

At that place are multiple ways to go fix with all the modules. We encompass three of the most common scenarios below.

  • If you lot have Python installed via Anaconda package manager, yous can install the required modules using the command conda install. For case, to install pandas, you would execute the command – conda install pandas.
  • If yous already have a regular, non-Anaconda Python installed on the computer, you can install the required modules using pip. Open your command line program and execute command pip install <module proper noun> to install a module. You should replace <module name> with the bodily name of the module yous are trying to install. For example, to install pandas, you would execute command – pip install pandas.
  • If you don't accept Python already installed, you should get it through the Anaconda packet manager. Anaconda provides installers for Windows, Mac, and Linux Computers. If you choose the total installer, you will become all the modules you need, along with Python and pandas within a single parcel. This is the easiest and fastest way to get started.

The Information Ready

In this tutorial, we will use a multi-canvas Excel file we created from Kaggle's IMDB Scores data. You tin can download the file here.

img-excel-1

Our Excel file has three sheets: '1900s,' '2000s,' and '2010s.' Each sheet has data for movies from those years.

We volition utilise this data fix to find the ratings distribution for the movies, visualize movies with highest ratings and net earnings and calculate statistical data about the movies. We will exist analyzing and exploring this data using Python and pandas, thus demonstrating pandas capabilities for working with Excel data in Python.

Read data from the Excel file

We need to first import the information from the Excel file into pandas. To do that, we beginning past importing the pandas module.

            import pandas as pd          

Nosotros and then utilise the pandas' read_excel method to read in data from the Excel file. The easiest manner to call this method is to pass the file name. If no sheet proper name is specified then information technology will read the first canvass in the index (as shown below).

            excel_file = 'movies.xls' movies = pd.read_excel(excel_file)          

Here, the read_excel method read the data from the Excel file into a pandas DataFrame object. Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a variable called movies.

Pandas has a built-in DataFrame.head() method that we can utilise to hands display the commencement few rows of our DataFrame. If no argument is passed, it will brandish offset five rows. If a number is passed, it volition brandish the equal number of rows from the top.

            movies.head()          
Title Twelvemonth Genres Language Land Content Rating Duration Aspect Ratio Budget Gross Earnings Facebook Likes – Role player ane Facebook Likes – Actor ii Facebook Likes – Actor 3 Facebook Likes – cast Full Facebook likes – Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
0 Intolerance: Beloved's Struggle Throughout the Ages 1916 Drama|History|State of war NaN USA Not Rated 123 i.33 385907.0 NaN 436 22 9.0 481 691 i 10718 88 69.0 8.0
1 Over the Hill to the Poorhouse 1920 Criminal offense|Drama NaN Usa NaN 110 1.33 100000.0 3000000.0 ii ii 0.0 4 0 1 five 1 1.0 4.8
2 The Big Parade 1925 Drama|Romance|State of war NaN United states of america Non Rated 151 1.33 245000.0 NaN 81 12 6.0 108 226 0 4849 45 48.0 viii.3
iii Metropolis 1927 Drama|Sci-Fi German language Germany Not Rated 145 1.33 6000000.0 26435.0 136 23 xviii.0 203 12000 one 111841 413 260.0 eight.three
4 Pandora's Box 1929 Crime|Drama|Romance German language Deutschland Not Rated 110 i.33 NaN 9950.0 426 xx 3.0 455 926 1 7431 84 71.0 viii.0

five rows × 25 columns

Excel files quite ofttimes take multiple sheets and the ability to read a specific canvas or all of them is very important. To make this piece of cake, the pandas read_excel method takes an argument called sheetname that tells pandas which sheet to read in the data from. For this, y'all tin can either utilise the sheet proper name or the sheet number. Sheet numbers commencement with aught. If the sheetname argument is non given, it defaults to zero and pandas will import the first canvas.

By default, pandas volition automatically assign a numeric index or row label starting with zero. Y'all may want to leave the default index as such if your data doesn't have a cavalcade with unique values that can serve equally a better alphabetize. In case there is a column that you experience would serve every bit a better index, you can override the default behavior by setting index_col holding to a column. It takes a numeric value for setting a single cavalcade as index or a list of numeric values for creating a multi-alphabetize.

In the below lawmaking, we are choosing the first column, 'Title', as index (index=0) by passing goose egg to the index_col argument.

            movies_sheet1 = pd.read_excel(excel_file, sheetname=0, index_col=0) movies_sheet1.caput()          
Year Genres Language Country Content Rating Duration Attribute Ratio Budget Gross Earnings Director Facebook Likes – Role player ane Facebook Likes – Actor two Facebook Likes – Role player 3 Facebook Likes – cast Full Facebook likes – Moving picture Facenumber in posters User Votes Reviews past Users Reviews by Crtiics IMDB Score
Championship
Intolerance: Dear'southward Struggle Throughout the Ages 1916 Drama|History|State of war NaN Us Non Rated 123 1.33 385907.0 NaN D.W. Griffith 436 22 9.0 481 691 1 10718 88 69.0 8.0
Over the Hill to the Poorhouse 1920 Crime|Drama NaN United states of america NaN 110 ane.33 100000.0 3000000.0 Harry F. Millarde 2 2 0.0 4 0 1 5 i 1.0 4.8
The Big Parade 1925 Drama|Romance|War NaN USA Non Rated 151 1.33 245000.0 NaN King Vidor 81 12 6.0 108 226 0 4849 45 48.0 viii.three
Metropolis 1927 Drama|Sci-Fi High german Germany Not Rated 145 1.33 6000000.0 26435.0 Fritz Lang 136 23 18.0 203 12000 1 111841 413 260.0 8.3
Pandora'south Box 1929 Crime|Drama|Romance German Deutschland Not Rated 110 1.33 NaN 9950.0 Georg Wilhelm Pabst 426 20 3.0 455 926 1 7431 84 71.0 8.0

5 rows × 24 columns

As you noticed above, our Excel data file has iii sheets. We already read the first sheet in a DataFrame to a higher place. Now, using the same syntax, nosotros volition read in rest of the two sheets also.

            movies_sheet2 = pd.read_excel(excel_file, sheetname=one, index_col=0) movies_sheet2.head()          
Twelvemonth Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings Manager Facebook Likes – Player 1 Facebook Likes – Actor 2 Facebook Likes – Actor three Facebook Likes – cast Total Facebook likes – Movie Facenumber in posters User Votes Reviews by Users Reviews past Crtiics IMDB Score
Championship
102 Dalmatians 2000 Adventure|Comedy|Family English language Usa G 100.0 ane.85 85000000.0 66941559.0 Kevin Lima 2000.0 795.0 439.0 4182 372 1 26413 77.0 84.0 four.8
28 Days 2000 Comedy|Drama English USA PG-13 103.0 i.37 43000000.0 37035515.0 Betty Thomas 12000.0 10000.0 664.0 23864 0 ane 34597 194.0 116.0 6.0
iii Strikes 2000 One-act English U.s. R 82.0 1.85 6000000.0 9821335.0 DJ Pooh 939.0 706.0 585.0 3354 118 ane 1415 10.0 22.0 iv.0
Aberdeen 2000 Drama English language UK NaN 106.0 1.85 6500000.0 64148.0 Hans Petter Moland 844.0 2.0 0.0 846 260 0 2601 35.0 28.0 7.3
All the Pretty Horses 2000 Drama|Romance|Western English USA PG-13 220.0 2.35 57000000.0 15527125.0 Baton Bob Thornton 13000.0 861.0 820.0 15006 652 2 11388 183.0 85.0 v.eight

five rows × 24 columns

            movies_sheet3 = pd.read_excel(excel_file, sheetname=2, index_col=0) movies_sheet3.head()          
Year Genres Language State Content Rating Duration Aspect Ratio Budget Gross Earnings Director Facebook Likes – Actor 1 Facebook Likes – Histrion 2 Facebook Likes – Actor three Facebook Likes – cast Total Facebook likes – Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
Title
127 Hours 2010.0 Take chances|Biography|Drama|Thriller English U.s. R 94.0 ane.85 18000000.0 18329466.0 Danny Boyle 11000.0 642.0 223.0 11984 63000 0.0 279179 440.0 450.0 7.6
3 Backyards 2010.0 Drama English United states R 88.0 NaN 300000.0 NaN Eric Mendelsohn 795.0 659.0 301.0 1884 92 0.0 554 23.0 20.0 5.2
three 2010.0 Comedy|Drama|Romance German Germany Unrated 119.0 two.35 NaN 59774.0 Tom Tykwer 24.0 20.0 9.0 69 2000 0.0 4212 xviii.0 76.0 6.viii
8: The Mormon Proposition 2010.0 Documentary English USA R 80.0 one.78 2500000.0 99851.0 Reed Cowan 191.0 12.0 5.0 210 0 0.0 1138 30.0 28.0 seven.1
A Turtle'due south Tale: Sammy'due south Adventures 2010.0 Adventure|Animation|Family English language France PG 88.0 2.35 NaN NaN Ben Stassen 783.0 749.0 602.0 3874 0 2.0 5385 22.0 56.0 6.1

5 rows × 24 columns

Since all the three sheets have similar data merely for different recordsmovies, we will create a unmarried DataFrame from all the iii DataFrames we created above. We volition use the pandas concat method for this and laissez passer in the names of the 3 DataFrames we but created and assign the results to a new DataFrame object, movies. Past keeping the DataFrame name same as earlier, we are over-writing the previously created DataFrame.

            movies = pd.concat([movies_sheet1, movies_sheet2, movies_sheet3])          

We can check if this chain past checking the number of rows in the combined DataFrame by calling the method shape on it that volition requite us the number of rows and columns.

            movies.shape          
            (5042, 24)          

Using the ExcelFile class to read multiple sheets

We tin also use the ExcelFile class to work with multiple sheets from the same Excel file. We first wrap the Excel file using ExcelFile and so pass it to read_excel method.

            xlsx = pd.ExcelFile(excel_file) movies_sheets = [] for sheet in xlsx.sheet_names:    movies_sheets.append(xlsx.parse(canvas)) movies = pd.concat(movies_sheets)          

If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames, ExcelFile is more convenient and efficient in comparison to read_excel. With ExcelFile, y'all only demand to pass the Excel file once, and so you tin employ it to go the DataFrames. When using read_excel, you pass the Excel file every time and hence the file is loaded again for every canvas. This tin can be a huge performance elevate if the Excel file has many sheets with a big number of rows.

Exploring the data

Now that we take read in the movies data set from our Excel file, we can start exploring it using pandas. A pandas DataFrame stores the information in a tabular format, but like the way Excel displays the data in a sheet. Pandas has a lot of built-in methods to explore the DataFrame we created from the Excel file we just read in.

We already introduced the method head in the previous department that displays few rows from the top from the DataFrame. Let's expect at few more methods that come in handy while exploring the data set.

Nosotros tin can use the shape method to find out the number of rows and columns for the DataFrame.

            movies.shape          
            (5042, 25)          

This tells u.s. our Excel file has 5042 records and 25 columns or observations. This can exist useful in reporting the number of records and columns and comparing that with the source data set.

We tin use the tail method to view the lesser rows. If no parameter is passed, only the bottom five rows are returned.

            movies.tail()          
Title Year Genres Language Land Content Rating Elapsing Aspect Ratio Budget Gross Earnings Facebook Likes – Actor 1 Facebook Likes – Histrion 2 Facebook Likes – Actor 3 Facebook Likes – cast Total Facebook likes – Movie Facenumber in posters User Votes Reviews past Users Reviews by Crtiics IMDB Score
1599 War & Peace NaN Drama|History|Romance|War English United kingdom TV-14 NaN 16.00 NaN NaN m.0 888.0 502.0 4528 11000 1.0 9277 44.0 10.0 8.two
1600 Wings NaN Comedy|Drama English USA NaN xxx.0 one.33 NaN NaN 685.0 511.0 424.0 1884 1000 5.0 7646 56.0 19.0 vii.3
1601 Wolf Creek NaN Drama|Horror|Thriller English language Australia NaN NaN two.00 NaN NaN 511.0 457.0 206.0 1617 954 0.0 726 six.0 2.0 7.one
1602 Wuthering Heights NaN Drama|Romance English Uk NaN 142.0 NaN NaN NaN 27000.0 698.0 427.0 29196 0 2.0 6053 33.0 ix.0 7.vii
1603 Yu-Gi-Oh! Duel Monsters NaN Action|Adventure|Animation|Family unit|Fantasy Japanese Japan NaN 24.0 NaN NaN NaN 0.0 NaN NaN 0 124 0.0 12417 51.0 6.0 7.0

5 rows × 25 columns

In Excel, you're able to sort a canvas based on the values in 1 or more columns. In pandas, you can practise the aforementioned thing with the sort_values method. For example, let's sort our movies DataFrame based on the Gross Earnings cavalcade.

            sorted_by_gross = movies.sort_values(['Gross Earnings'], ascending=Faux)          

Since nosotros have the data sorted past values in a column, we can practice few interesting things with it. For example, we can display the pinnacle x movies by Gross Earnings.

            sorted_by_gross["Gross Earnings"].head(10)          
            1867 760505847.0 1027 658672302.0 1263 652177271.0 610 623279547.0 611 623279547.0 1774 533316061.0 1281 474544677.0 226 460935665.0 1183 458991599.0 618 448130642.0 Name: Gross Earnings, dtype: float64          

We can also create a plot for the peak 10 movies by Gross Earnings. Pandas makes it piece of cake to visualize your data with plots and charts through matplotlib, a popular data visualization library. With a couple lines of code, you tin can start plotting. Moreover, matplotlib plots work well inside Jupyter Notebooks since you tin can displace the plots correct under the code.

First, we import the matplotlib module and fix matplotlib to brandish the plots correct in the Jupyter Notebook.

            import matplotlib.pyplot as plt%matplotlib inline          

Nosotros will depict a bar plot where each bar volition correspond one of the superlative 10 movies. We tin can do this by calling the plot method and setting the statement kind to barh. This tells matplotlib to draw a horizontal bar plot.

            sorted_by_gross['Gross Earnings'].head(10).plot(kind="barh") plt.testify()          

python-pandas-and-excel_28_0

Let's create a histogram of IMDB Scores to check the distribution of IMDB Scores across all movies. Histograms are a practiced style to visualize the distribution of a data gear up. We use the plot method on the IMDB Scores serial from our movies DataFrame and laissez passer it the argument.

            movies['IMDB Score'].plot(kind="hist") plt.prove()          

python-pandas-and-excel_30_0

This data visualization suggests that nearly of the IMDB Scores fall between six and eight.

Getting statistical information about the data

Pandas has some very handy methods to look at the statistical data near our data set. For example, nosotros tin can utilise the describe method to get a statistical summary of the data ready.

            movies.depict()          
Twelvemonth Duration Aspect Ratio Budget Gross Earnings Facebook Likes – Director Facebook Likes – Actor 1 Facebook Likes – Actor two Facebook Likes – Actor 3 Facebook Likes – cast Total Facebook likes – Flick Facenumber in posters User Votes Reviews past Users Reviews by Crtiics IMDB Score
count 4935.000000 5028.000000 4714.000000 iv.551000e+03 4.159000e+03 4938.000000 5035.000000 5029.000000 5020.000000 5042.000000 5042.000000 5029.000000 v.042000e+03 5022.000000 4993.000000 5042.000000
mean 2002.470517 107.201074 2.220403 three.975262e+07 iv.846841e+07 686.621709 6561.323932 1652.080533 645.009761 9700.959143 7527.457160 1.371446 eight.368475e+04 272.770808 140.194272 half dozen.442007
std 12.474599 25.197441 1.385113 2.061149e+08 six.845299e+07 2813.602405 15021.977635 4042.774685 1665.041728 18165.101925 19322.070537 two.013683 1.384940e+05 377.982886 121.601675 1.125189
min 1916.000000 7.000000 1.180000 two.180000e+02 1.620000e+02 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000e+00 1.000000 i.000000 1.600000
25% 1999.000000 93.000000 one.850000 6.000000e+06 5.340988e+06 7.000000 614.500000 281.000000 133.000000 1411.250000 0.000000 0.000000 8.599250e+03 65.000000 l.000000 five.800000
50% 2005.000000 103.000000 ii.350000 2.000000e+07 2.551750e+07 49.000000 988.000000 595.000000 371.500000 3091.000000 166.000000 ane.000000 3.437100e+04 156.000000 110.000000 6.600000
75% 2011.000000 118.000000 2.350000 4.500000e+07 6.230944e+07 194.750000 11000.000000 918.000000 636.000000 13758.750000 3000.000000 2.000000 nine.634700e+04 326.000000 195.000000 7.200000
max 2016.000000 511.000000 xvi.000000 ane.221550e+10 vii.605058e+08 23000.000000 640000.000000 137000.000000 23000.000000 656730.000000 349000.000000 43.000000 one.689764e+06 5060.000000 813.000000 9.500000

The describe method displays below information for each of the columns.

  • the count or number of values
  • mean
  • standard deviation
  • minimum, maximum
  • 25%, 50%, and 75% quantile

Please notation that this information will be calculated only for the numeric values.

We tin also utilize the corresponding method to admission this data one at a time. For example, to get the hateful of a particular cavalcade, you lot can apply the mean method on that cavalcade.

            movies["Gross Earnings"].mean()          
            48468407.526809327          

Just like mean, there are methods bachelor for each of the statistical information nosotros want to admission. You can read about these methods in our costless pandas cheat canvas.

Reading files with no header and skipping records

Earlier in this tutorial, we saw some means to read a particular kind of Excel file that had headers and no rows that needed skipping. Sometimes, the Excel sheet doesn't accept any header row. For such instances, you tin tell pandas not to consider the kickoff row as header or columns names. And If the Excel sheet's first few rows comprise data that should non be read in, you can inquire the read_excel method to skip a certain number of rows, starting from the peak.

For example, await at the meridian few rows of this Excel file.img-excel-no-header-1

This file obviously has no header and first 4 rows are not bodily records and hence should non exist read in. We can tell read_excel there is no header by setting argument header to None and we can skip first four rows by setting statement skiprows to four.

            movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls", header=None, skiprows=4) movies_skip_rows.head(5)          
0 1 ii 3 4 five 6 7 viii 9 xv xvi 17 eighteen 19 twenty 21 22 23 24
0 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 136 23 18.0 203 12000 1 111841 413 260.0 eight.three
1 Pandora'south Box 1929 Law-breaking|Drama|Romance German Germany Non Rated 110 i.33 NaN 9950.0 426 xx iii.0 455 926 1 7431 84 71.0 8.0
2 The Broadway Melody 1929 Musical|Romance English language Usa Passed 100 1.37 379000.0 2808000.0 77 28 iv.0 109 167 8 4546 71 36.0 half dozen.iii
3 Hell'southward Angels 1930 Drama|War English USA Passed 96 ane.twenty 3950000.0 NaN 431 12 4.0 457 279 i 3753 53 35.0 7.8
4 A Farewell to Artillery 1932 Drama|Romance|War English language USA Unrated 79 i.37 800000.0 NaN 998 164 99.0 1284 213 one 3519 46 42.0 vi.six

5 rows × 25 columns

We skipped four rows from the sheet and used none of the rows as the header. Likewise, find that one can combine different options in a single read statement. To skip rows at the lesser of the sheet, you tin use option skip_footer, which works merely similar skiprows, the just deviation being the rows are counted from the lesser upwards.

The column names in the previous DataFrame are numeric and were allotted equally default by the pandas. We can rename the cavalcade names to descriptive ones by calling the method columns on the DataFrame and passing the column names as a list.

            movies_skip_rows.columns = ['Championship', 'Year', 'Genres', 'Language', 'State', 'Content Rating', 'Elapsing', 'Attribute Ratio', 'Budget', 'Gross Earnings', 'Director', 'Thespian 1', 'Thespian 2', 'Actor 3', 'Facebook Likes - Director', 'Facebook Likes - Role player 1', 'Facebook Likes - Actor two', 'Facebook Likes - Player 3', 'Facebook Likes - cast Total', 'Facebook likes - Movie', 'Facenumber in posters', 'User Votes', 'Reviews by Users', 'Reviews past Crtiics', 'IMDB Score'] movies_skip_rows.head()          
Title Yr Genres Language Country Content Rating Duration Attribute Ratio Budget Gross Earnings Facebook Likes – Actor ane Facebook Likes – Actor 2 Facebook Likes – Player three Facebook Likes – cast Total Facebook likes – Film Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
0 Metropolis 1927 Drama|Sci-Fi German language Federal republic of germany Not Rated 145 1.33 6000000.0 26435.0 136 23 18.0 203 12000 1 111841 413 260.0 8.3
1 Pandora'southward Box 1929 Criminal offence|Drama|Romance German Germany Not Rated 110 i.33 NaN 9950.0 426 20 three.0 455 926 i 7431 84 71.0 viii.0
2 The Broadway Melody 1929 Musical|Romance English USA Passed 100 1.37 379000.0 2808000.0 77 28 4.0 109 167 8 4546 71 36.0 6.three
3 Hell'south Angels 1930 Drama|War English United states of america Passed 96 1.twenty 3950000.0 NaN 431 12 4.0 457 279 1 3753 53 35.0 7.8
4 A Farewell to Arms 1932 Drama|Romance|War English Usa Unrated 79 i.37 800000.0 NaN 998 164 99.0 1284 213 1 3519 46 42.0 six.6

5 rows × 25 columns

Now that we have seen how to read a subset of rows from the Excel file, we can acquire how to read a subset of columns.

Reading a subset of columns

Although read_excel defaults to reading and importing all columns, y'all tin can choose to import only certain columns. Past passing parse_cols=6, we are telling the read_excel method to read only the first columns till index half-dozen or outset seven columns (the kickoff cavalcade being indexed zero).

            movies_subset_columns = pd.read_excel(excel_file, parse_cols=6) movies_subset_columns.head()          
Title Twelvemonth Genres Language Country Content Rating Duration
0 Intolerance: Love's Struggle Throughout the Ages 1916 Drama|History|War NaN United states Not Rated 123
ane Over the Colina to the Poorhouse 1920 Offense|Drama NaN U.s. NaN 110
2 The Big Parade 1925 Drama|Romance|State of war NaN USA Non Rated 151
iii Metropolis 1927 Drama|Sci-Fi High german Germany Non Rated 145
4 Pandora'due south Box 1929 Crime|Drama|Romance German Federal republic of germany Not Rated 110

Alternatively, you can pass in a list of numbers, which volition let you import columns at particular indexes.

Applying formulas on the columns

One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. Nosotros can get Cyberspace earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas too every bit shown below.

            movies["Net Earnings"] = movies["Gross Earnings"] - movies["Upkeep"]          

Higher up, we used pandas to create a new column called Net Earnings, and populated it with the difference of Gross Earnings and Budget. It's worth noting the divergence hither in how formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data changes – with Python, the calculations happen and the values are stored – if Gross Earnings for one movie was manually changed, Net Earnings won't exist updated.

Let'due south use the sort_values method to sort the data by the new column nosotros created and visualize the summit ten movies past Net Earnings.

            sorted_movies = movies[['Net Earnings']].sort_values(['Internet Earnings'], ascending=[False])sorted_movies.head(10)['Net Earnings'].plot.barh() plt.testify()          

python-pandas-and-excel_44_0

Pivot Table in pandas

Avant-garde Excel users besides often use pivot tables. A pivot tabular array summarizes the data of another table by group the data on an index and applying operations such as sorting, summing, or averaging. Y'all can utilize this feature in pandas too.

We demand to first place the column or columns that volition serve as the index, and the column(due south) on which the summarizing formula will be applied. Let's outset pocket-sized, by choosing Twelvemonth as the alphabetize column and Gross Earnings as the summarization column and creating a separate DataFrame from this data.

            movies_subset = movies[['Twelvemonth', 'Gross Earnings']] movies_subset.head()          
Year Gross Earnings
0 1916.0 NaN
ane 1920.0 3000000.0
ii 1925.0 NaN
3 1927.0 26435.0
iv 1929.0 9950.0

We now call pivot_table on this subset of data. The method pivot_table takes a parameter index. As mentioned, we want to use Year every bit the index.

            earnings_by_year = movies_subset.pivot_table(alphabetize=['Twelvemonth']) earnings_by_year.head()          
Gross Earnings
Year
1916.0 NaN
1920.0 3000000.0
1925.0 NaN
1927.0 26435.0
1929.0 1408975.0

This gave usa a pivot tabular array with grouping on Year and summarization on the sum of Gross Earnings. Discover, we didn't need to specify Gross Earnings cavalcade explicitly as pandas automatically identified it the values on which summarization should exist applied.

Nosotros can use this pin tabular array to create some data visualizations. Nosotros tin can call the plot method on the DataFrame to create a line plot and call the testify method to display the plot in the notebook.

            earnings_by_year.plot() plt.show()          

python-pandas-and-excel_50_0

We saw how to pivot with a unmarried cavalcade as the index. Things will become more than interesting if we tin use multiple columns. Permit's create another DataFrame subset but this time we will choose the columns, State, Language and Gross Earnings.

            movies_subset = movies[['Land', 'Language', 'Gross Earnings']] movies_subset.head()          
Country Language Gross Earnings
0 USA NaN NaN
1 United states NaN 3000000.0
2 USA NaN NaN
3 Frg German 26435.0
4 Germany High german 9950.0

We will employ columns Land and Linguistic communication as the index for the pivot table. We will utilize Gross Earnings as summarization tabular array, even so, we do not need to specify this explicitly as we saw earlier.

            earnings_by_co_lang = movies_subset.pivot_table(alphabetize=['Country', 'Linguistic communication']) earnings_by_co_lang.head()          
Gross Earnings
Country Language
Afghanistan Dari one.127331e+06
Argentina Spanish vii.230936e+06
Aruba English 1.007614e+07
Commonwealth of australia Ancient vi.165429e+06
Dzongkha five.052950e+05

Let's visualize this pivot tabular array with a bar plot. Since at that place are still few hundred records in this pivot table, nosotros will plot just a few of them.

            earnings_by_co_lang.head(xx).plot(kind='bar', figsize=(20,8)) plt.show()          

python-pandas-and-excel_56_0

Exporting the results to Excel

If you're going to exist working with colleagues who use Excel, saving Excel files out of pandas is important. You lot tin can export or write a pandas DataFrame to an Excel file using pandas to_excel method. Pandas uses the xlwt Python module internally for writing to Excel files. The to_excel method is called on the DataFrame we desire to export.We also need to laissez passer a filename to which this DataFrame will exist written.

            movies.to_excel('output.xlsx')          

By default, the index is also saved to the output file. However, sometimes the index doesn't provide any useful information. For example, the movies DataFrame has a numeric auto-increment index, that was non part of the original Excel data.

            movies.head()          
Championship Yr Genres Language Country Content Rating Duration Attribute Ratio Budget Gross Earnings Facebook Likes – Player 2 Facebook Likes – Actor 3 Facebook Likes – cast Total Facebook likes – Moving-picture show Facenumber in posters User Votes Reviews by Users Reviews past Crtiics IMDB Score Net Earnings
0 Intolerance: Love's Struggle Throughout the Ages 1916.0 Drama|History|War NaN U.s.a. Non Rated 123.0 1.33 385907.0 NaN 22.0 nine.0 481 691 one.0 10718 88.0 69.0 8.0 NaN
1 Over the Hill to the Poorhouse 1920.0 Criminal offence|Drama NaN United states NaN 110.0 1.33 100000.0 3000000.0 2.0 0.0 iv 0 1.0 five 1.0 1.0 4.8 2900000.0
two The Big Parade 1925.0 Drama|Romance|War NaN USA Not Rated 151.0 1.33 245000.0 NaN 12.0 6.0 108 226 0.0 4849 45.0 48.0 8.3 NaN
three Urban center 1927.0 Drama|Sci-Fi German language Germany Not Rated 145.0 1.33 6000000.0 26435.0 23.0 18.0 203 12000 i.0 111841 413.0 260.0 viii.three -5973565.0
4 Pandora'south Box 1929.0 Crime|Drama|Romance German Germany Not Rated 110.0 one.33 NaN 9950.0 20.0 3.0 455 926 one.0 7431 84.0 71.0 viii.0 NaN

5 rows × 26 columns

You can choose to skip the alphabetize by passing forth alphabetize-Fake.

            movies.to_excel('output.xlsx', index=Imitation)          

We demand to be able to make our output files look nice before we tin send it out to our co-workers. We can apply pandas ExcelWriter class forth with the XlsxWriter Python module to apply the formatting.

We can do employ these advanced output options by creating a ExcelWriter object and utilise this object to write to the EXcel file.

            writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter') movies.to_excel(author, alphabetize=Fake, sheet_name='report') workbook = writer.bookworksheet = writer.sheets['report']          

We can utilise customizations by calling add_format on the workbook we are writing to. Here we are setting header format as assuming.

            header_fmt = workbook.add_format({'bold': True}) worksheet.set_row(0, None, header_fmt)          

Finally, we save the output file by calling the method save on the writer object.

            author.relieve()          

As an example, we saved the data with column headers ready as bold. And the saved file looks similar the image below.

img-excel-output-bold-1

Like this, one can use XlsxWriter to apply various formatting to the output Excel file.

Conclusion

Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and can exist very bully companion tools. As we demonstrated, pandas tin do a lot of complex data analysis and manipulations, which depending on your demand and expertise, tin go beyond what you lot tin achieve if you are only using Excel. I of the major benefits of using Python and pandas over Excel is that it helps you automate Excel file processing by writing scripts and integrating with your automated data workflow. Pandas also has excellent methods for reading all kinds of data from Excel files. You lot tin can export your results from pandas back to Excel too if that'south preferred by your intended audience.

On the other paw, Excel is a such a widely used data tool, it's non a wise to ignore information technology. Acquiring expertise in both pandas and Excel and making them work together gives y'all skills that can help you stand out in your organisation.

Larn Data Skills

Get that adjacent heighten or to switch to a career in data science past learning information skills.

Sign up for a complimentary business relationship and endeavor our interactive courses in Python, R, SQL, and more!

Tags

Source: https://www.dataquest.io/blog/excel-and-pandas/

Posted by: thomasprel1989.blogspot.com

0 Response to "What View Do You Use In Excel To See The Header, Footer, And Margins?"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel