Performance Comparisons between Series and Numpy

I am building a Performance comparison Table between Numpy and Series:

Two Instances caught my Eye. Any help will be really helpful.

  1. We say that we should avoid using Loops in Numpy and Series, but I came across one scenario where for loop is performing better

In Below Code I am Calculating Density of Planets using for Loops and without for Loop

    mass=  pd.Series([0.330, 4.87, 5.97, 0.073, 0.642, 1898, 568, 86.8, 102, 0.0146], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
    diameter = pd.Series([4879, 12104, 12756, 3475, 6792, 142984, 120536, 51118, 49528, 2370], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
     
    %%timeit -n 1000
     
    density = mass / (np.pi * np.power(diameter, 3) /6)
     
    1000 loops, best of 3: 617 µs per loop
     
    %%timeit -n 1000
     
    density = pd.Series()
    
    for planet in mass.index:

        density[planet] = mass[planet] / ((np.pi * np.power(diameter[planet], 3)) / 6)
     
    1000 loops, best of 3: 183 µs per loop
  1. Second, I am trying to replace nan values in Series using Two approaches

Why do the Second approach works Faster??? My Guess is that second approach is converting Series Object in N-d array

    sample2 = pd.Series([1, 2, 3, 4325, 23, 3, 4213, 102, 89, 4, np.nan, 6, 803, 43, np.nan, np.nan, np.nan])
     
    x = np.mean(sample2)
     
    x
     
    %%timeit -n 10000
     
    sample3 = pd.Series(np.where(np.isnan(sample2), x, sample2))
     
    10000 loops, best of 3: 166 µs per loop
     
    %%timeit -n 10000
     
    sample2[np.isnan(sample2)] =x
     
    10000 loops, best of 3: 1.08 ms per loop
     
    %%timeit -n 10000
     
    sample2[pd.isnull(sample2)] = x
     
    10000 loops, best of 3: 1.02 ms per loop

Hi Abhishek,

  1. Looks a bit strange, the way you’ve used %%timeit magic function is incorrect i guess. As you’re replicating same task for 1000 times, and 617 micro seconds is for the complete task. You don’t need to loop it for 1000 times.
    Please note that the comparison you’re looking at is between computing density for whole series in above case, which is 617 micro seconds, whereas in the loop, it’s just for a single iteration (i.e computing density for just one planet) 183 micro sec.

  2. Yes, the overhead may also be there in first method as it’s creating a whole new pd.series.