1. 08 Aug, 2017 1 commit
  2. 11 Jul, 2017 1 commit
    • Krister Johansen's avatar
      Teach irqbalance about Intel CoD. · 7bc1244f
      Krister Johansen authored
      This originally surfaced as a bug in placing network interrupts.  In
      the case that this submitter observed, the NIC card was in NUMA domain
      0, but each RSS interrupt was getting an affinity list for all CPUs in
      the domain.  The expected behavior is for a single cpu to be chosen when
      attempting to fan out NIC interrupts.  Due to other implementation
      details of interrupt placement, this effectively caused all interrupt
      mappings for this NIC to end up on CPU 0.
      
      The bug turns out ot have been caused by Intel Cluster on Die breaking
      an assumption in irqbalance about the design of the component hierarchy.
      The CoD topology allows a CPU package to belong to more than one NUMA
      node, which is not expected.
      
      The RCA was that when the second NUMA node was wired up to the existing
      physical package, it overwrote the mappings that were placed there by
      the first.
      
      This patch attempts to solve that problem by permitting a package to
      have multiple NUMA nodes.  The CPU component hierarchy is preserved, in
      case other parts of the code depend upon walking it.  When a CoD
      topology is detected, the NUMA node -> CPU component mapping is moved
      down a level, so that the nodes point to the first level where the
      affinity becomes distinct.  In practice, this has been observed to be
      the LLC.
      
      A quick illustration (now, with COD, it looks like this):
      
                       +-----------+
                       | NUMA Node |
                       |     0     |
                       +-----------+
                             |
                             |        +-------+
                            \|/     / | CPU 0 |
                         +---------+  +-------+
                         | Cache 0 |
                         +---------+  +-------+
                         /          \ | CPU 1 |
            +-----------+             +-------+
            | Package 0 |
            +-----------+             +-------+
                        \           / | CPU 2 |
                         +---------+  +-------+
                         | Cache 1 |
                         +---------+
                             ^      \ +-------+
                             |        | CPU 3 |
                             |        +-------+
                       +-----------+
                       | NUMA Node |
                       |     1     |
                       +-----------+
      
      Whereas, previously only NUMA Node 1 would end up pointing to package 0.
      The topology should not be different on platforms that do not enable
      CoD.
      Signed-off-by: 's avatarKrister Johansen <kjlx@templeofstupid.com>
      7bc1244f
  3. 04 Aug, 2015 1 commit
    • Seiichi Ikarashi's avatar
      irqbalance: fix irq_info->load miscalculation for cache domain and others · 8922ff13
      Seiichi Ikarashi authored
      I found miscalculation of irq_info->load for any irqs assigned to
      topo_obj other than CPUs. With the debug option, you can see it as
      follows:
      
      --
      Package 0:  numa_node is 0 cpu mask is 0000003f (load 1770000000)
      (snip)
                        Interrupt 102 node_num is -1 (ethernet/459999226:11117)
                Interrupt 33 node_num is -1 (legacy/1:0)
        Interrupt 0 node_num is -1 (other/1770000000:1)
        Interrupt 64 node_num is -1 (other/1:0)
      --
      
      Though IRQ0 had just one interruption during the period, its load value
      was wrongly calculated to be 1,770,000,000 nsec.
      This is because compute_irq_branch_load_share() does not take into
      account any irq counts of children and grandchildren, and does mis-
      understand that only IRQ0 has consumed all the (irq + softirq) CPU time.
      
      To fix the problem, I introduce following two changes:
      
       - Add topo_obj->irq_count in order to accumulate interrupts effectively,
       - Change topo_obj->load from the average of children's load to the sum
         of them. I believe it would make sense from the topological point of
         view.
      
      With the modification, the debug log shows:
      
      --
      Package 0:  numa_node is 0 cpu mask is 0000003f (load 7340266498)
      (snip)
                        Interrupt 102 node_num is -1 (ethernet/2459985766:61406)
                Interrupt 33 node_num is -1 (legacy/1:0)
        Interrupt 64 node_num is -1 (other/1:0)
        Interrupt 62 node_num is -1 (other/1:0)
        Interrupt 60 node_num is -1 (other/1:0)
        Interrupt 58 node_num is -1 (other/1:0)
        Interrupt 9 node_num is -1 (other/1:0)
        Interrupt 4 node_num is -1 (other/1:0)
        Interrupt 0 node_num is -1 (other/20815:1)
      --
      
      The load of IRQ0 is now calculated to be 20,815 nsec. It looks more
      accurate.
      Signed-off-by: 's avatarSeiichi Ikarashi <s.ikarashi@jp.fujitsu.com>
      Signed-off-by: 's avatarNeil Horman <nhorman@tuxdriver.com>
      8922ff13
  4. 19 Mar, 2015 1 commit
  5. 22 Jan, 2015 1 commit
  6. 20 Jan, 2015 1 commit
  7. 20 May, 2014 1 commit
    • Neil Horman's avatar
      track hint policy on a per-irq basis · b6da319b
      Neil Horman authored
      Currently the hintpolicy for irqbalance is a global setting, applied equally to
      all irqs.  Thats undesireable however, as different devices may want to follow
      different policies.  Track the hint policy in each irq_info struct instead.
      This still just follows the global policy, but paves the way to allow overriding
      through the policyscript option
      Signed-off-by: 's avatarNeil Horman <nhorman@tuxdriver.com>
      b6da319b
  8. 10 Oct, 2013 1 commit
  9. 24 Jun, 2013 2 commits
  10. 24 May, 2013 1 commit
  11. 11 Oct, 2011 1 commit
    • Neil Horman's avatar
      Fix up load estimator · 34ac21b1
      Neil Horman authored
      Found a few bugs in the load estimator - we're were attributing load to multiple
      irqs errneously, and in the process of fixing that I found that we had a bogus
      topology map - the same package was getting added multiple times to a given numa
      node, since we didn't already detect that was part of that nodes child list.
      34ac21b1
  12. 10 Oct, 2011 6 commits
    • Neil Horman's avatar
      Add powersave settings · e2f6588b
      Neil Horman authored
      Add an optional heuristic to allow cpus to not service interrupts during periods
      of low activity, to help power conservation. If more than power_thresh cpus are
      more then a standard deviation below the average load, and no cpus are
      overloaded by more than a standard deviation and have more than one irq on them,
      then we stop balancing to a single cpu.  If at any time we have a cpu go over a
      standard deviation of load, we re-enable all the cpus for balancing
      e2f6588b
    • Neil Horman's avatar
      add master list pointer to topo_obj · f06001f6
      Neil Horman authored
      Its convienient to know how many objects of a given type you have without having
      to know the specific object type.  We can get this info with a pointer to a
      pointer in each topo object assigned to that objects type master list (cpus,
      cache_domains, packages, numa_nodes) when we build the tree
      f06001f6
    • Neil Horman's avatar
      Add object type ennumeration to topo map · 1a287acc
      Neil Horman authored
      Since we use a common object for our topology now, add some ennumeration so we
      can tell what type of object wer'e looking at when debugging
      1a287acc
    • Neil Horman's avatar
      Rename common_obj_data to topo_obj · 587ba2f4
      Neil Horman authored
      Since consolodating the topology objects to a single structure, it seems
      Better to rename it to something more descriptive
      587ba2f4
    • Neil Horman's avatar
      Merge all topology objects to a common structure · decce934
      Neil Horman authored
      Theres no need to treat topology objects differently.  We can merge them all
      down to a common structure.  This will make balancing code a great deal more
      concise.
      decce934
    • Neil Horman's avatar
      Clean up some unused data members · c10d9540
      Neil Horman authored
      Some of our data structures had dangling unused fields.  Get rid of them
      c10d9540
  13. 06 Oct, 2011 2 commits
    • Neil Horman's avatar
      Add back improved affinity_hint handling · 32521899
      Neil Horman authored
      The new balancer can now deal with affintiy hinting again, this time in a
      reasonably sane manner.  Whereas before having an affintiy hint caused
      irqbalance to just assign that hint as the affinity, we now have a policy based
      operation, controlled by the hintpolicy option.  The policy can be one of:
      
      exact - affinity_hint is applied for that irq without balancing consideration
      subset - balancing takes place, but assigned affinity will be a subset of the
      	 hint
      ignore - affinity_hint is ignored entirely
      32521899
    • Neil Horman's avatar
      Cut over to base irq placement using new algorithm · 93f959c9
      Neil Horman authored
      This is the big move.  The main loop now uses the new balance alg based on
      standard deviation away from the average softirq+irq time as read from
      /proc/stat.  Initial results look good.
      
      Also cleaned out old data from previous algorithm, so we don't have any dangling
      mess
      93f959c9
  14. 04 Oct, 2011 1 commit
  15. 03 Oct, 2011 5 commits
    • Neil Horman's avatar
      Remove mask and old_mask values from irq_info · 0ee17356
      Neil Horman authored
      We don't need them anymore, because unroutable irqs just don't get touched
      anymore, and we have the assigned_obj pointer to gather our mask value from
      during activation.
      
      Note - This removal does necessecitate the removal of affinity hint, but we'll
      be reimplementing that soon, as the prior policy was rather inflexible.
      0ee17356
    • Neil Horman's avatar
      Build a list of irqs to be migrated · e8b40b53
      Neil Horman authored
      Currently we re-examine all irqs each iteration.  instead we should build a list
      of irqs we want to move, and only rebalance those.  Currently we still rebalance
      all irqs every iteration, but this will soon give us a chance to be more
      selective than that.
      e8b40b53
    • Neil Horman's avatar
      Merge common fields of objects to a single struct · e53524aa
      Neil Horman authored
      Numa nodes, pacakges, cache domains and cores have lots of common fields
      Merge those and place the common struct at the head of each field so code
      may have the opportunity to tread each object as a generic type.
      e53524aa
    • Neil Horman's avatar
      Remove unneeded enum irq_prop · c80a1db7
      Neil Horman authored
      c80a1db7
    • Neil Horman's avatar
      Migrate to use of irq_info and remove struct interrupt · 57159ea2
      Neil Horman authored
      Migrate core workload calculation code to use new irq_info struct and for_each_*
      helper functions
      57159ea2
  16. 28 Sep, 2011 3 commits
  17. 27 Sep, 2011 3 commits
  18. 23 Sep, 2011 1 commit
    • Neil Horman's avatar
      Complete rework of how we detect and classify irqs · 32a7757a
      Neil Horman authored
      irqbalance has been broken for a long time.  Its ability to properly detect msi
      irqs and to correctly identify interrupt types (net vs. storage vs. other, etc),
      has been based on some tenuous string comparison logic that was easily broken by
      administrative name changes for interfaces.  I've recently submitted this patch:
      https://lkml.org/lkml/2011/9/19/176
      Which lets us use sysfs exclusively for finding device interrupts, which in
      turns lets us definitavely identify irq types (legacy pci vs. msi), as well as
      properly classifying them using the pci device class value.
      
      Additionally, this patch rips out the code that attemtps to bias interrupt
      count volumes using network statistics, since theres no sane way to be certain a
      single network interrupt is responsible for the number of packets received on a
      given interface.  Workload computation is now done on soley on irq count.  This
      may change in the future, adding /proc/stat irq and softirq time to the biasing
      mechanism.
      
      Note that without the above kernel change, this doesn't work right.  Irqbalance
      contains a self check in which it identifies MSI interrupts in /proc/interrupts
      still.  If it sees MSI irqs in /proc/interrupts, but none in sysfs, then it will
      issue a loud warning about irqs being missclassified until the kernel is
      updated.
      Signed-off-by: 's avatarNeil Horman <nhorman@tuxdriver.com>
      32a7757a
  19. 01 Aug, 2011 1 commit
  20. 04 May, 2010 1 commit
  21. 14 Dec, 2006 2 commits
  22. 13 Dec, 2006 1 commit
  23. 09 Dec, 2006 1 commit