blackbird-op-linux - Blackbird™ Linux sources for OpenPOWER

	Commit message (Collapse)	Author	Age	Files	Lines
*	cpufreq: Fix NULL reference crash while accessing policy->governor_data	Viresh Kumar	2016-01-27	1	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a race discovered by Juri, where we are able to: - create and read a sysfs file before policy->governor_data is being set to a non NULL value. OR - set policy->governor_data to NULL, and reading a file before being destroyed. And so such a crash is reported: Unable to handle kernel NULL pointer dereference at virtual address 0000000c pgd = edfc8000 [0000000c] pgd=bfc8c835 Internal error: Oops: 17 [#1] SMP ARM Modules linked in: CPU: 4 PID: 1730 Comm: cat Not tainted 4.5.0-rc1+ #463 Hardware name: ARM-Versatile Express task: ee8e8480 ti: ee930000 task.ti: ee930000 PC is at show_ignore_nice_load_gov_pol+0x24/0x34 LR is at show+0x4c/0x60 pc : [<c058f1bc>] lr : [<c058ae88>] psr: a0070013 sp : ee931dd0 ip : ee931de0 fp : ee931ddc r10: ee4bc290 r9 : 00001000 r8 : ef2cb000 r7 : ee4bc200 r6 : ef2cb000 r5 : c0af57b0 r4 : ee4bc2e0 r3 : 00000000 r2 : 00000000 r1 : c0928df4 r0 : ef2cb000 Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: adfc806a DAC: 00000051 Process cat (pid: 1730, stack limit = 0xee930210) Stack: (0xee931dd0 to 0xee932000) 1dc0: ee931dfc ee931de0 c058ae88 c058f1a4 1de0: edce3bc0 c07bfca4 edce3ac0 00001000 ee931e24 ee931e00 c01fcb90 c058ae48 1e00: 00000001 edce3bc0 00000000 00000001 ee931e50 ee8ff480 ee931e34 ee931e28 1e20: c01fb33c c01fcb0c ee931e8c ee931e38 c01a5210 c01fb314 ee931e9c ee931e48 1e40: 00000000 edce3bf0 befe4a00 ee931f78 00000000 00000000 000001e4 00000000 1e60: c00545a8 edce3ac0 00001000 00001000 befe4a00 ee931f78 00000000 00001000 1e80: ee931ed4 ee931e90 c01fbed8 c01a5038 ed085a58 00020000 00000000 00000000 1ea0: c0ad72e4 ee931f78 ee8ff488 ee8ff480 c077f3fc 00001000 befe4a00 ee931f78 1ec0: 00000000 00001000 ee931f44 ee931ed8 c017c328 c01fbdc4 00001000 00000000 1ee0: ee8ff480 00001000 ee931f44 ee931ef8 c017c65c c03deb10 ee931fac ee931f08 1f00: c0009270 c001f290 c0a8d968 ef2cb000 ef2cb000 ee8ff480 00000020 ee8ff480 1f20: ee8ff480 befe4a00 00001000 ee931f78 00000000 00000000 ee931f74 ee931f48 1f40: c017d1ec c017c2f8 c019c724 c019c684 ee8ff480 ee8ff480 00001000 befe4a00 1f60: 00000000 00000000 ee931fa4 ee931f78 c017d2a8 c017d160 00000000 00000000 1f80: 000a9f20 00001000 befe4a00 00000003 c000ffe4 ee930000 00000000 ee931fa8 1fa0: c000fe40 c017d264 000a9f20 00001000 00000003 befe4a00 00001000 00000000 Unable to handle kernel NULL pointer dereference at virtual address 0000000c 1fc0: 000a9f20 00001000 befe4a00 00000003 00000000 00000000 00000003 00000001 pgd = edfc4000 [0000000c] pgd=bfcac835 1fe0: 00000000 befe49dc 000197f8 b6e35dfc 60070010 00000003 3065b49d 134ac2c9 [<c058f1bc>] (show_ignore_nice_load_gov_pol) from [<c058ae88>] (show+0x4c/0x60) [<c058ae88>] (show) from [<c01fcb90>] (sysfs_kf_seq_show+0x90/0xfc) [<c01fcb90>] (sysfs_kf_seq_show) from [<c01fb33c>] (kernfs_seq_show+0x34/0x38) [<c01fb33c>] (kernfs_seq_show) from [<c01a5210>] (seq_read+0x1e4/0x4e4) [<c01a5210>] (seq_read) from [<c01fbed8>] (kernfs_fop_read+0x120/0x1a0) [<c01fbed8>] (kernfs_fop_read) from [<c017c328>] (__vfs_read+0x3c/0xe0) [<c017c328>] (__vfs_read) from [<c017d1ec>] (vfs_read+0x98/0x104) [<c017d1ec>] (vfs_read) from [<c017d2a8>] (SyS_read+0x50/0x90) [<c017d2a8>] (SyS_read) from [<c000fe40>] (ret_fast_syscall+0x0/0x1c) Code: e5903044 e1a00001 e3081df4 e34c1092 (e593300c) ---[ end trace 5994b9a5111f35ee ]--- Fix that by making sure, policy->governor_data is updated at the right places only. Cc: 4.2+ <stable@vger.kernel.org> # 4.2+ Reported-and-tested-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Fix negative idle_time when configured with ↵	Chen Yu	2016-01-05	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CONFIG_HZ_PERIODIC It is reported that, with CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequency even if the usage goes to 100%, neither ondemand nor conservative governor works, however performance and userspace work as expected. If set with CONFIG_NO_HZ_FULL=y, everything goes well. This problem is caused by improper calculation of the idle_time when the load is extremely high(near 100%). Firstly, cpufreq_governor uses get_cpu_idle_time to get the total idle time for specific cpu, then: 1.If the system is configured with CONFIG_NO_HZ_FULL, the idle time is returned by ktime_get, which is always increasing, it's OK. 2.However, if the system is configured with CONFIG_HZ_PERIODIC, get_cpu_idle_time might not guarantee to be always increasing, because it will leverage get_cpu_idle_time_jiffy to calculate the idle_time, consider the following scenario: At T1: idle_tick_1 = total_tick_1 - user_tick_1 sample period(80ms)... At T2: ( T2 = T1 + 80ms): idle_tick_2 = total_tick_2 - user_tick_2 Currently the algorithm is using (idle_tick_2 - idle_tick_1) to get the delta idle_time during the past sample period, however it CAN NOT guarantee that idle_tick_2 >= idle_tick_1, especially when cpu load is high. (Yes, total_tick_2 >= total_tick_1, and user_tick_2 >= user_tick_1, but how about idle_tick_2 and idle_tick_1? No guarantee.) So governor might get a negative value of idle_time during the past sample period, which might mislead the system that the idle time is very big(converted to unsigned int), and the busy time is nearly zero, which causes the governor to always choose the lowest cpufreq, then cause this problem. In theory there are two solutions: 1.The logic should not rely on the idle tick during every sample period, but be based on the busy tick directly, as this is how 'top' is implemented. 2.Or the logic must make sure that the idle_time is strictly increasing during each sample period, then there would be no negative idle_time anymore. This solution requires minimum modification to current code and this patch uses method 2. Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821 Reported-by: Jan Fikar <j.fikar@gmail.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Use lockless timer function	Rafael J. Wysocki	2015-12-09	1	-27/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is possible to get rid of the timer_lock spinlock used by the governor timer function for synchronization, but a couple of races need to be avoided. The first race is between multiple dbs_timer_handler() instances that may be running in parallel with each other on different CPUs. Namely, one of them has to queue up the work item, but it cannot be queued up more than once. To achieve that, atomic_inc_return() can be used on the skip_work field of struct cpu_common_dbs_info. The second race is between an already running dbs_timer_handler() and gov_cancel_work(). In that case the dbs_timer_handler() might not notice the skip_work incrementation in gov_cancel_work() and it might queue up its work item after gov_cancel_work() had returned (and that work item would corrupt skip_work going forward). To prevent that from happening, gov_cancel_work() can be made wait for the timer function to complete (on all CPUs) right after skip_work has been incremented. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
*	cpufreq: governor: replace per-CPU delayed work with timers	Viresh Kumar	2015-12-09	1	-59/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cpufreq governors evaluate load at sampling rate and based on that they update frequency for a group of CPUs belonging to the same cpufreq policy. This is required to be done in a single thread for all policy->cpus, but because we don't want to wakeup idle CPUs to do just that, we use deferrable work for this. If we would have used a single delayed deferrable work for the entire policy, there were chances that the CPU required to run the handler can be in idle and we might end up not changing the frequency for the entire group with load variations. And so we were forced to keep per-cpu works, and only the one that expires first need to do the real work and others are rescheduled for next sampling time. We have been using the more complex solution until now, where we used a delayed deferrable work for this, which is a combination of a timer and a work. This could be made lightweight by keeping per-cpu deferred timers with a single work item, which is scheduled by the first timer that expires. This patch does just that and here are important changes: - The timer handler will run in irq context and so we need to use a spin_lock instead of the timer_mutex. And so a separate timer_lock is created. This also makes the use of the mutex and lock quite clear, as we know what exactly they are protecting. - A new field 'skip_work' is added to track when the timer handlers can queue a work. More comments present in code. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Ashwin Chaugule <ashwin.chaugule@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: initialize/destroy timer_mutex with 'shared'	Viresh Kumar	2015-12-07	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	timer_mutex is required to be initialized only while memory for 'shared' is allocated and in a similar way it is required to be destroyed only when memory for 'shared' is freed. There is no need to do the same every time we start/stop the governor. Move code to initialize/destroy timer_mutex to the relevant places. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Pass policy as argument to ->gov_dbs_timer()	Viresh Kumar	2015-12-07	1	-1/+1
\| \| \| \| \| \| \| \|	Pass 'policy' as argument to ->gov_dbs_timer() instead of cdbs and dbs_data. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Quit work-handlers early if governor is stopped	Viresh Kumar	2015-11-02	1	-10/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	gov_queue_work() acquires cpufreq_governor_lock to allow cpufreq_governor_stop() to drain delayed work items possibly scheduled on CPUs that share the policy with a CPU being taken offline. However, the same goal may be achieved in a more straightforward way if the policy pointer in the struct cpu_dbs_info matching the policy CPU is reset upfront by cpufreq_governor_stop() under the timer_mutex belonging to it and checked against NULL, under the same lock, at the beginning of dbs_timer(). In that case every instance of dbs_timer() run for a struct cpu_dbs_info sharing the policy pointer in question after cpufreq_governor_stop() has started will notice that that pointer is NULL and bail out immediately without queuing up any new work items. In turn, gov_cancel_work() called by cpufreq_governor_stop() before destroying timer_mutex will wait for all of the delayed work items currently running on the CPUs sharing the policy to drop the mutex, so it may be destroyed safely. Make cpufreq_governor_stop() and dbs_timer() work as described and modify gov_queue_work() so it does not acquire cpufreq_governor_lock any more. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: create cpu/cpufreq at boot time	Viresh Kumar	2015-10-28	1	-15/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Later patches will need to create policy specific directories in /sys/devices/system/cpu/cpufreq/ directory and so the cpufreq directory wouldn't be ever empty. And so no fun creating/destroying it on need basis anymore. Create it once on system boot. Reviewed-by: Saravana Kannan <skannan@codeaurora.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: conservative: remove 'enable' field	Viresh Kumar	2015-09-26	1	-11/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conservative governor has its own 'enable' field to check if conservative governor is used for a CPU or not This can be checked by policy->governor with 'cpufreq_gov_conservative' and so this field can be dropped. Because its not guaranteed that dbs_info->cdbs.shared will a valid pointer for all CPUs (will be NULL for CPUs that don't use ondemand/conservative governors), we can't use it anymore. Lets get policy with cpufreq_cpu_get_raw() instead. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Don't WARN on invalid states	Viresh Kumar	2015-07-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	With previous commit, governors have started to return errors on invalid state-transition requests. We already have a WARN for an invalid state-transition request in cpufreq_governor_dbs(). This does trigger today, as the sequence of events isn't guaranteed by cpufreq core. Lets stop warning on that for now, and make sure we don't enter an invalid state. Reviewed-and-tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Avoid invalid states with additional checks	Viresh Kumar	2015-07-21	1	-11/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	There can be races where the request has come to a wrong state. For example INIT followed by STOP (instead of START) or START followed by EXIT (instead of STOP). Address these races by making sure the state-machine never gets into any invalid state. Also return an error if an invalid state-transition is requested. Reviewed-and-tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: split out common part of {cs\|od}_dbs_timer()	Viresh Kumar	2015-07-21	1	-4/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some part of cs_dbs_timer() and od_dbs_timer() is exactly same and is unnecessarily duplicated. Create the real work-handler in cpufreq_governor.c and put the common code in this routine (dbs_timer()). Shouldn't make any functional change. Reviewed-and-tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Keep single copy of information common to policy->cpus	Viresh Kumar	2015-07-21	1	-24/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some information is common to all CPUs belonging to a policy, but are kept on per-cpu basis. Lets keep that in another structure common to all policy->cpus. That will make updates/reads to that less complex and less error prone. The memory for cpu_common_dbs_info is allocated/freed at INIT/EXIT, so that it we don't reallocate it for STOP/START sequence. It will be also be used (in next patch) while the governor is stopped and so must not be freed that early. Reviewed-and-tested-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: rename cur_policy as policy	Viresh Kumar	2015-07-17	1	-9/+9
\| \| \| \| \| \| \| \| \|	Just call it 'policy', cur_policy is unnecessarily long and doesn't have any special meaning. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: name pointer to cpu_dbs_info as 'cdbs'	Viresh Kumar	2015-07-17	1	-13/+13
\| \| \| \| \| \| \| \| \|	It is called as 'cdbs' at most of the places and 'cpu_dbs' at others. Lets use 'cdbs' consistently for better readability. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Rename 'cpu_dbs_common_info' to 'cpu_dbs_info'	Viresh Kumar	2015-07-17	1	-10/+9
\| \| \| \| \| \| \| \| \| \|	Its not common info to all CPUs, but a structure representing common type of cpu info to both governor types. Lets drop 'common_' from its name. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Drop unused field 'cpu'	Viresh Kumar	2015-07-17	1	-1/+0
\| \| \| \| \| \| \| \|	Its not used at all, drop it. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Name delayed-work as dwork	Viresh Kumar	2015-07-17	1	-3/+3
\| \| \| \| \| \| \| \| \|	Delayed work was named as 'work' and to access work within it we do work.work. Not much readable. Rename delayed_work as 'dwork'. Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Serialize governor callbacks	Viresh Kumar	2015-06-15	1	-13/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are several races reported in cpufreq core around governors (only ondemand and conservative) by different people. There are at least two race scenarios present in governor code: (a) Concurrent access/updates of governor internal structures. It is possible that fields such as 'dbs_data->usage_count', etc. are accessed simultaneously for different policies using same governor structure (i.e. CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag unset). And because of this we can dereference bad pointers. For example consider a system with two CPUs with separate 'struct cpufreq_policy' instances. CPU0 governor: ondemand and CPU1: powersave. CPU0 switching to powersave and CPU1 to ondemand: CPU0 CPU1 store* store* cpufreq_governor_exit() cpufreq_governor_init() dbs_data = cdata->gdbs_data; if (!--dbs_data->usage_count) kfree(dbs_data); dbs_data->usage_count++; Bad pointer dereference There are other races possible between EXIT and START/STOP/LIMIT as well. Its really complicated. (b) Switching governor state in bad sequence: For example trying to switch a governor to START state, when the governor is in EXIT state. There are some checks present in __cpufreq_governor() but they aren't sufficient as they compare events against 'policy->governor_enabled', where as we need to take governor's state into account, which can be used by multiple policies. These two issues need to be solved separately and the responsibility should be properly divided between cpufreq and governor core. The first problem is more about the governor core, as it needs to protect its structures properly. And the second problem should be fixed in cpufreq core instead of governor, as its all about sequence of events. This patch is trying to solve only the first problem. There are two types of data we need to protect, - 'struct common_dbs_data': No matter what, there is going to be a single copy of this per governor. - 'struct dbs_data': With CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag set, we will have per-policy copy of this data, otherwise a single copy. Because of such complexities, the mutex present in 'struct dbs_data' is insufficient to solve our problem. For example we need to protect fetching of 'dbs_data' from different structures at the beginning of cpufreq_governor_dbs(), to make sure it isn't currently being updated. This can be fixed if we can guarantee serialization of event parsing code for an individual governor. This is best solved with a mutex per governor, and the placeholder for that is 'struct common_dbs_data'. And so this patch moves the mutex from 'struct dbs_data' to 'struct common_dbs_data' and takes it at the beginning and drops it at the end of cpufreq_governor_dbs(). Tested with and without following configuration options: CONFIG_LOCKDEP_SUPPORT=y CONFIG_DEBUG_RT_MUTEXES=y CONFIG_DEBUG_PI_LIST=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y CONFIG_DEBUG_ATOMIC_SLEEP=y Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: split cpufreq_governor_dbs()	Viresh Kumar	2015-06-15	1	-140/+189
\| \| \| \| \| \| \| \| \| \| \| \| \|	cpufreq_governor_dbs() is hardly readable, it is just too big and complicated. Lets make it more readable by splitting out event specific routines. Order of statements is changed at few places, but that shouldn't bring any functional change. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: register notifier from cs_init()	Viresh Kumar	2015-06-15	1	-19/+3
\| \| \| \| \| \| \| \| \| \|	Notifiers are required only for conservative governor and the common governor code is unnecessarily polluted with that. Handle that from cs_init/exit() instead of cpufreq_governor_dbs(). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: remove copy_prev_load from 'struct cpu_dbs_common_info'	Viresh Kumar	2014-06-09	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	'copy_prev_load' was recently added by commit: 18b46ab (cpufreq: governor: Be friendly towards latency-sensitive bursty workloads). It actually is a bit redundant as we also have 'prev_load' which can store any integer value and can be used instead of 'copy_prev_load' by setting it zero. True load can also turn out to be zero during long idle intervals (and hence the actual value of 'prev_load' and the overloaded value can clash). However this is not a problem because, if the true load was really zero in the previous interval, it makes sense to evaluate the load afresh for the current interval rather than copying the previous load. So, drop 'copy_prev_load' and use 'prev_load' instead. Update comments as well to make it more clear. There is another change here which was probably missed by Srivatsa during the last version of updates he made. The unlikely in the 'if' statement was covering only half of the condition and the whole line should actually come under it. Also checkpatch is made more silent as it was reporting this (--strict option): CHECK: Alignment should match open parenthesis + if (unlikely(wall_time > (2 * sampling_rate) && + j_cdbs->prev_load)) { Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Be friendly towards latency-sensitive bursty workloads	Srivatsa S. Bhat	2014-06-07	1	-3/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cpufreq governors like the ondemand governor calculate the load on the CPU periodically by employing deferrable timers. A deferrable timer won't fire if the CPU is completely idle (and there are no other timers to be run), in order to avoid unnecessary wakeups and thus save CPU power. However, the load calculation logic is agnostic to all this, and this can lead to the problem described below. Time (ms) CPU 1 100 Task-A running 110 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. 110.5 Task-A running 120 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. 125 Task-A went to sleep. With nothing else to do, CPU 1 went completely idle. 200 Task-A woke up and started running again. 200.5 Governor's deferred timer (which was originally programmed to fire at time 130) fires now. It calculates load for the time period 120 to 200.5, and finds the load is almost zero. Hence it decreases the CPU frequency to the minimum. 210 Governor's timer fires, finds load as 100% in the last 10ms interval and increases the CPU frequency. So, after the workload woke up and started running, the frequency was suddenly dropped to absolute minimum, and after that, there was an unnecessary delay of 10ms (sampling period) to increase the CPU frequency back to a reasonable value. And this pattern repeats for every wake-up-from-cpu-idle for that workload. This can be quite undesirable for latency- or response-time sensitive bursty workloads. So we need to fix the governor's logic to detect such wake-up-from- cpu-idle scenarios and start the workload at a reasonably high CPU frequency. One extreme solution would be to fake a load of 100% in such scenarios. But that might lead to undesirable side-effects such as frequency spikes (which might also need voltage changes) especially if the previous frequency happened to be very low. We just want to avoid the stupidity of dropping down the frequency to a minimum and then enduring a needless (and long) delay before ramping it up back again. So, let us simply carry forward the previous load - that is, let us just pretend that the 'load' for the current time-window is the same as the load for the previous window. That way, the frequency and voltage will continue to be set to whatever values they were set at previously. This means that bursty workloads will get a chance to influence the CPU frequency at which they wake up from cpu-idle, based on their past execution history. Thus, they might be able to avoid suffering from slow wakeups and long response-times. However, we should take care not to over-do this. For example, such a "copy previous load" logic will benefit cases like this: (where # represents busy and . represents idle) ##########.........#########.........###########...........##########........ but it will be detrimental in cases like the one shown below, because it will retain the high frequency (copied from the previous interval) even in a mostly idle system: ##########.........#.................#.....................#............... (i.e., the workload finished and the remaining tasks are such that their busy periods are smaller than the sampling interval, which causes the timer to always get deferred. So, this will make the copy-previous-load logic copy the initial high load to subsequent idle periods over and over again, thus keeping the frequency high unnecessarily). So, we modify this copy-previous-load logic such that it is used only once upon every wakeup-from-idle. Thus if we have 2 consecutive idle periods, the previous load won't get blindly copied over; cpufreq will freshly evaluate the load in the second idle interval, thus ensuring that the system comes back to its normal state. [ The right way to solve this whole problem is to teach the CPU frequency governors to also track load on a per-task basis, not just a per-CPU basis, and then use both the data sources intelligently to set the appropriate frequency on the CPUs. But that involves redesigning the cpufreq subsystem, so this patch should make the situation bearable until then. ] Experimental results: +-------------------+ I ran a modified version of ebizzy (called 'sleeping-ebizzy') that sleeps in between its execution such that its total utilization can be a user-defined value, say 10% or 20% (higher the utilization specified, lesser the amount of sleeps injected). This ebizzy was run with a single-thread, tied to CPU 8. Behavior observed with tracing (sample taken from 40% utilization runs): ------------------------------------------------------------------------ Without patch: ~~~~~~~~~~~~~~ kworker/8:2-12137 416.335742: cpu_frequency: state=2061000 cpu_id=8 kworker/8:2-12137 416.335744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.345741: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.345744: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-12137 416.345746: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.355738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 <snip> --------------------------------------------------------------------- <snip> <...>-40753 416.402202: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 <idle>-0 416.502130: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy <...>-40753 416.505738: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.505739: cpu_frequency: state=2061000 cpu_id=8 kworker/8:2-12137 416.505741: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40753 416.515739: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-12137 416.515742: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-12137 416.515744: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy Observation: Ebizzy went idle at 416.402202, and started running again at 416.502130. But cpufreq noticed the long idle period, and dropped the frequency at 416.505739, only to increase it back again at 416.515742, realizing that the workload is in-fact CPU bound. Thus ebizzy needlessly ran at the lowest frequency for almost 13 milliseconds (almost 1 full sample period), and this pattern repeats on every sleep-wakeup. This could hurt latency-sensitive workloads quite a lot. With patch: ~~~~~~~~~~~ kworker/8:2-29802 464.832535: cpu_frequency: state=2061000 cpu_id=8 <snip> --------------------------------------------------------------------- <snip> kworker/8:2-29802 464.962538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 464.972533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 464.972536: cpu_frequency: state=4123000 cpu_id=8 kworker/8:2-29802 464.972538: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 464.982531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 <snip> --------------------------------------------------------------------- <snip> kworker/8:2-29802 465.022533: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.032531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 465.032532: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.035797: sched_switch: prev_comm=ebizzy ==> next_comm=swapper/8 <idle>-0 465.240178: sched_switch: prev_comm=swapper/8 ==> next_comm=ebizzy <...>-40738 465.242533: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 kworker/8:2-29802 465.242535: sched_switch: prev_comm=kworker/8:2 ==> next_comm=ebizzy <...>-40738 465.252531: sched_switch: prev_comm=ebizzy ==> next_comm=kworker/8:2 Observation: Ebizzy went idle at 465.035797, and started running again at 465.240178. Since ebizzy was the only real workload running on this CPU, cpufreq retained the frequency at 4.1Ghz throughout the run of ebizzy, no matter how many times ebizzy slept and woke-up in-between. Thus, ebizzy got the 10ms worth of 4.1 Ghz benefit during every sleep-wakeup (as compared to the run without the patch) and this boost gave a modest improvement in total throughput, as shown below. Sleeping-ebizzy records-per-second: ----------------------------------- Utilization Without patch With patch Difference (Absolute and % values) 10% 274767 277046 + 2279 (+0.829%) 20% 543429 553484 + 10055 (+1.850%) 40% 1090744 1107959 + 17215 (+1.578%) 60% 1634908 1662018 + 27110 (+1.658%) A rudimentary and somewhat approximately latency-sensitive workload such as sleeping-ebizzy itself showed a consistent, noticeable performance improvement with this patch. Hence, workloads that are truly latency-sensitive will benefit quite a bit from this change. Moreover, this is an overall win-win since this patch does not hurt power-savings at all (because, this patch does not reduce the idle time or idle residency; and the high frequency of the CPU when it goes to cpu-idle does not affect/hurt the power-savings of deep idle states). Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: remove race while accessing cur_policy	Bibek Basu	2014-05-20	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While accessing cur_policy during executing events CPUFREQ_GOV_START, CPUFREQ_GOV_STOP, CPUFREQ_GOV_LIMITS, same mutex lock is not taken, dbs_data->mutex, which leads to race and data corruption while running continious suspend resume test. This is seen with ondemand governor with suspend resume test using rtcwake. Unable to handle kernel NULL pointer dereference at virtual address 00000028 pgd = ed610000 [00000028] pgd=adf11831, pte=00000000, *ppte=00000000 Internal error: Oops: 17 [#1] PREEMPT SMP ARM Modules linked in: nvhost_vi CPU: 1 PID: 3243 Comm: rtcwake Not tainted 3.10.24-gf5cf9e5 #1 task: ee708040 ti: ed61c000 task.ti: ed61c000 PC is at cpufreq_governor_dbs+0x400/0x634 LR is at cpufreq_governor_dbs+0x3f8/0x634 pc : [<c05652b8>] lr : [<c05652b0>] psr: 600f0013 sp : ed61dcb0 ip : 000493e0 fp : c1cc14f0 r10: 00000000 r9 : 00000000 r8 : 00000000 r7 : eb725280 r6 : c1cc1560 r5 : eb575200 r4 : ebad7740 r3 : ee708040 r2 : ed61dca8 r1 : 001ebd24 r0 : 00000000 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 10c5387d Table: ad61006a DAC: 00000015 [<c05652b8>] (cpufreq_governor_dbs+0x400/0x634) from [<c055f700>] (__cpufreq_governor+0x98/0x1b4) [<c055f700>] (__cpufreq_governor+0x98/0x1b4) from [<c0560770>] (__cpufreq_set_policy+0x250/0x320) [<c0560770>] (__cpufreq_set_policy+0x250/0x320) from [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168) [<c0561dcc>] (cpufreq_update_policy+0xcc/0x168) from [<c0561ed0>] (cpu_freq_notify+0x68/0xdc) [<c0561ed0>] (cpu_freq_notify+0x68/0xdc) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c) [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310) [<c00aac6c>] (pm_qos_update_bounded_target+0xd8/0x310) from [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70) [<c00ab3b0>] (__pm_qos_update_request+0x64/0x70) from [<c004b4b8>] (tegra_pm_notify+0x114/0x134) [<c004b4b8>] (tegra_pm_notify+0x114/0x134) from [<c008eff8>] (notifier_call_chain+0x4c/0x8c) [<c008eff8>] (notifier_call_chain+0x4c/0x8c) from [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) [<c008f3d4>] (__blocking_notifier_call_chain+0x50/0x68) from [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) [<c008f40c>] (blocking_notifier_call_chain+0x20/0x28) from [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34) [<c00ac228>] (pm_notifier_call_chain+0x1c/0x34) from [<c00ad38c>] (enter_state+0xec/0x128) [<c00ad38c>] (enter_state+0xec/0x128) from [<c00ad400>] (pm_suspend+0x38/0xa4) [<c00ad400>] (pm_suspend+0x38/0xa4) from [<c00ac114>] (state_store+0x70/0xc0) [<c00ac114>] (state_store+0x70/0xc0) from [<c027b1e8>] (kobj_attr_store+0x14/0x20) [<c027b1e8>] (kobj_attr_store+0x14/0x20) from [<c019cd9c>] (sysfs_write_file+0x104/0x184) [<c019cd9c>] (sysfs_write_file+0x104/0x184) from [<c0143038>] (vfs_write+0xd0/0x19c) [<c0143038>] (vfs_write+0xd0/0x19c) from [<c0143414>] (SyS_write+0x4c/0x78) [<c0143414>] (SyS_write+0x4c/0x78) from [<c000f080>] (ret_fast_syscall+0x0/0x30) Code: e1a00006 eb084346 e59b0020 e5951024 (e5903028) ---[ end trace 0488523c8f6b0f9d ]--- Signed-off-by: Bibek Basu <bbasu@nvidia.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 3.11+ <stable@vger.kernel.org> # 3.11+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: Fix timer/workqueue corruption by protecting reading governor_enabled	Jane Li	2014-01-06	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a CPU is hot removed we'll cancel all the delayed work items via gov_cancel_work(). Sometimes the delayed work function determines that it should adjust the delay for all other CPUs that the policy is managing. If this scenario occurs, the canceling CPU will cancel its own work but queue up the other CPUs works to run. Commit 3617f2 (cpufreq: Fix timer/workqueue corruption due to double queueing) has tried to fix this, but reading governor_enabled is not protected by cpufreq_governor_lock. Even though od_dbs_timer() checks governor_enabled before gov_queue_work(), this scenario may occur. For example: CPU0 CPU1 ---- ---- cpu_down() ... <work runs> __cpufreq_remove_dev() od_dbs_timer() __cpufreq_governor() policy->governor_enabled policy->governor_enabled = false; cpufreq_governor_dbs() case CPUFREQ_GOV_STOP: gov_cancel_work(dbs_data, policy); cpu0 work is canceled timer is canceled cpu1 work is canceled <waits for cpu1> gov_queue_work(, , true); cpu0 work queued cpu1 work queued cpu2 work queued ... cpu1 work is canceled cpu2 work is canceled ... At the end of the GOV_STOP case cpu0 still has a work queued to run although the code is expecting all of the works to be canceled. __cpufreq_remove_dev() will then proceed to re-initialize all the other CPUs works except for the CPU that is going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs() will trample over the queued work and debugobjects will spit out a warning: WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc() ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x14 Modules linked in: CPU: 1 PID: 1205 Comm: sh Tainted: G W 3.10.0 #200 [<c01144f0>] (unwind_backtrace+0x0/0xf8) from [<c0111d98>] (show_stack+0x10/0x14) [<c0111d98>] (show_stack+0x10/0x14) from [<c01272cc>] (warn_slowpath_common+0x4c/0x68) [<c01272cc>] (warn_slowpath_common+0x4c/0x68) from [<c012737c>] (warn_slowpath_fmt+0x30/0x40) [<c012737c>] (warn_slowpath_fmt+0x30/0x40) from [<c034c640>] (debug_print_object+0x94/0xbc) [<c034c640>] (debug_print_object+0x94/0xbc) from [<c034c7f8>] (__debug_object_init+0xc8/0x3c0) [<c034c7f8>] (__debug_object_init+0xc8/0x3c0) from [<c01360e0>] (init_timer_key+0x20/0x104) [<c01360e0>] (init_timer_key+0x20/0x104) from [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c) [<c04872ac>] (cpufreq_governor_dbs+0x1dc/0x68c) from [<c04833a8>] (__cpufreq_governor+0x80/0x1b0) [<c04833a8>] (__cpufreq_governor+0x80/0x1b0) from [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380) [<c0483704>] (__cpufreq_remove_dev.isra.12+0x22c/0x380) from [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c) [<c0692f38>] (cpufreq_cpu_callback+0x48/0x5c) from [<c014fb40>] (notifier_call_chain+0x44/0x84) [<c014fb40>] (notifier_call_chain+0x44/0x84) from [<c012ae44>] (__cpu_notify+0x2c/0x48) [<c012ae44>] (__cpu_notify+0x2c/0x48) from [<c068dd40>] (_cpu_down+0x80/0x258) [<c068dd40>] (_cpu_down+0x80/0x258) from [<c068df40>] (cpu_down+0x28/0x3c) [<c068df40>] (cpu_down+0x28/0x3c) from [<c068e4c0>] (store_online+0x30/0x74) [<c068e4c0>] (store_online+0x30/0x74) from [<c03a7308>] (dev_attr_store+0x18/0x24) [<c03a7308>] (dev_attr_store+0x18/0x24) from [<c0256fe0>] (sysfs_write_file+0x100/0x180) [<c0256fe0>] (sysfs_write_file+0x100/0x180) from [<c01fec9c>] (vfs_write+0xbc/0x184) [<c01fec9c>] (vfs_write+0xbc/0x184) from [<c01ff034>] (SyS_write+0x40/0x68) [<c01ff034>] (SyS_write+0x40/0x68) from [<c010e200>] (ret_fast_syscall+0x0/0x48) In gov_queue_work(), lock cpufreq_governor_lock before gov_queue_work, and unlock it after __gov_queue_work(). In this way, governor_enabled is guaranteed not changed in gov_queue_work(). Signed-off-by: Jane Li <jiel@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()	lan,Tianyu	2013-11-16	1	-4/+0
\| \| \| \| \| \| \| \| \|	The related code has been changed and the comment is out of date. So remove it. Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: Don't use smp_processor_id() in preemptible context	Stephen Boyd	2013-08-29	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Workqueues are preemptible even if works are queued on them with queue_work_on(). Let's use raw_smp_processor_id() here to silence the warning. BUG: using smp_processor_id() in preemptible [00000000] code: kworker/3:2/674 caller is gov_queue_work+0x28/0xb0 CPU: 0 PID: 674 Comm: kworker/3:2 Tainted: G W 3.10.0 #30 Workqueue: events od_dbs_timer [<c010c178>] (unwind_backtrace+0x0/0x11c) from [<c0109dec>] (show_stack+0x10/0x14) [<c0109dec>] (show_stack+0x10/0x14) from [<c03885a4>] (debug_smp_processor_id+0xbc/0xf0) [<c03885a4>] (debug_smp_processor_id+0xbc/0xf0) from [<c0635864>] (gov_queue_work+0x28/0xb0) [<c0635864>] (gov_queue_work+0x28/0xb0) from [<c0635618>] (od_dbs_timer+0x108/0x134) [<c0635618>] (od_dbs_timer+0x108/0x134) from [<c01aa8f8>] (process_one_work+0x25c/0x444) [<c01aa8f8>] (process_one_work+0x25c/0x444) from [<c01aaf88>] (worker_thread+0x200/0x344) [<c01aaf88>] (worker_thread+0x200/0x344) from [<c01b03bc>] (kthread+0xa0/0xb0) [<c01b03bc>] (kthread+0xa0/0xb0) from [<c01061b8>] (ret_from_fork+0x14/0x3c) Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Fix typos in comments	Stratos Karafotis	2013-08-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	- 'Governer' should be 'Governor'. - 'S' is used for Siemens (electrical conductance) in SI units, so use small 's' for seconds. Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: Fix timer/workqueue corruption due to double queueing	Stephen Boyd	2013-08-28	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a CPU is hot removed we'll cancel all the delayed work items via gov_cancel_work(). Normally this will just cancels a delayed timer on each CPU that the policy is managing and the work won't run, but if the work is already running the workqueue code will wait for the work to finish before continuing to prevent the work items from re-queuing themselves like they normally do. This scheme will work most of the time, except for the case where the work function determines that it should adjust the delay for all other CPUs that the policy is managing. If this scenario occurs, the canceling CPU will cancel its own work but queue up the other CPUs works to run. For example: CPU0 CPU1 ---- ---- cpu_down() ... __cpufreq_remove_dev() cpufreq_governor_dbs() case CPUFREQ_GOV_STOP: gov_cancel_work(dbs_data, policy); cpu0 work is canceled timer is canceled cpu1 work is canceled <work runs> <waits for cpu1> od_dbs_timer() gov_queue_work(, , true); cpu0 work queued cpu1 work queued cpu2 work queued ... cpu1 work is canceled cpu2 work is canceled ... At the end of the GOV_STOP case cpu0 still has a work queued to run although the code is expecting all of the works to be canceled. __cpufreq_remove_dev() will then proceed to re-initialize all the other CPUs works except for the CPU that is going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs() will trample over the queued work and debugobjects will spit out a warning: WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc() ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x10 Modules linked in: CPU: 0 PID: 1491 Comm: sh Tainted: G W 3.10.0 #19 [<c010c178>] (unwind_backtrace+0x0/0x11c) from [<c0109dec>] (show_stack+0x10/0x14) [<c0109dec>] (show_stack+0x10/0x14) from [<c01904cc>] (warn_slowpath_common+0x4c/0x6c) [<c01904cc>] (warn_slowpath_common+0x4c/0x6c) from [<c019056c>] (warn_slowpath_fmt+0x2c/0x3c) [<c019056c>] (warn_slowpath_fmt+0x2c/0x3c) from [<c0388a7c>] (debug_print_object+0x94/0xbc) [<c0388a7c>] (debug_print_object+0x94/0xbc) from [<c0388e34>] (__debug_object_init+0x2d0/0x340) [<c0388e34>] (__debug_object_init+0x2d0/0x340) from [<c019e3b0>] (init_timer_key+0x14/0xb0) [<c019e3b0>] (init_timer_key+0x14/0xb0) from [<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8) [<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8) from [<c06325a0>] (__cpufreq_governor+0xdc/0x1a4) [<c06325a0>] (__cpufreq_governor+0xdc/0x1a4) from [<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434) [<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434) from [<c08989f4>] (cpufreq_cpu_callback+0x60/0x80) [<c08989f4>] (cpufreq_cpu_callback+0x60/0x80) from [<c08a43c0>] (notifier_call_chain+0x38/0x68) [<c08a43c0>] (notifier_call_chain+0x38/0x68) from [<c01938e0>] (__cpu_notify+0x28/0x40) [<c01938e0>] (__cpu_notify+0x28/0x40) from [<c0892ad4>] (_cpu_down+0x7c/0x2c0) [<c0892ad4>] (_cpu_down+0x7c/0x2c0) from [<c0892d3c>] (cpu_down+0x24/0x40) [<c0892d3c>] (cpu_down+0x24/0x40) from [<c0893ea8>] (store_online+0x2c/0x74) [<c0893ea8>] (store_online+0x2c/0x74) from [<c04519d8>] (dev_attr_store+0x18/0x24) [<c04519d8>] (dev_attr_store+0x18/0x24) from [<c02a69d4>] (sysfs_write_file+0x100/0x148) [<c02a69d4>] (sysfs_write_file+0x100/0x148) from [<c0255c18>] (vfs_write+0xcc/0x174) [<c0255c18>] (vfs_write+0xcc/0x174) from [<c0255f70>] (SyS_write+0x38/0x64) [<c0255f70>] (SyS_write+0x38/0x64) from [<c0106120>] (ret_fast_syscall+0x0/0x30) Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	Merge back earlier 'pm-cpufreq' material	Rafael J. Wysocki	2013-08-14	1	-15/+1
\|\
\| *	cpufreq: Clean up header files included in the core	Viresh Kumar	2013-08-07	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch addresses the following issues in the header files in the cpufreq core: - Include headers in ascending order, so that we don't add same many times by mistake. - <asm/> must be included after <linux/>, so that they override whatever they need to. - Remove unnecessary includes. - Don't include files already included by cpufreq.h or cpufreq_governor.h. [rjw: Changelog] Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
\| *	cpufreq: ondemand: Change the calculation of target frequency	Stratos Karafotis	2013-07-26	1	-9/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ondemand governor calculates load in terms of frequency and increases it only if load_freq is greater than up_threshold multiplied by the current or average frequency. This appears to produce oscillations of frequency between min and max because, for example, a relatively small load can easily saturate minimum frequency and lead the CPU to the max. Then, it will decrease back to the min due to small load_freq. Change the calculation method of load and target frequency on the basis of the following two observations: - Load computation should not depend on the current or average measured frequency. For example, absolute load of 80% at 100MHz is not necessarily equivalent to 8% at 1000MHz in the next sampling interval. - It should be possible to increase the target frequency to any value present in the frequency table proportional to the absolute load, rather than to the max only, so that: Target frequency = C * load where we take C = policy->cpuinfo.max_freq / 100. Tested on Intel i7-3770 CPU @ 3.40GHz and on Quad core 1500MHz Krait. Phoronix benchmark of Linux Kernel Compilation 3.1 test shows an increase ~1.5% in performance. cpufreq_stats (time_in_state) shows that middle frequencies are used more, with this patch. Highest and lowest frequencies were used less by ~9%. [rjw: We have run multiple other tests on kernels with this change applied and in the vast majority of cases it turns out that the resulting performance improvement also leads to reduced consumption of energy. The change is additionally justified by the overall simplification of the code in question.] Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* \|	cpufreq: rename ignore_nice as ignore_nice_load	Viresh Kumar	2013-08-07	1	-4/+4
\|/ \| \| \| \| \| \| \| \| \| \| \| \|	This sysfs file was called ignore_nice_load earlier and commit 4d5dcc4 (cpufreq: governor: Implement per policy instances of governors) changed its name to ignore_nice by mistake. Lets get it renamed back to its original name. Reported-by: Martin von Gagern <Martin.vGagern@gmx.net> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 3.10+ <stable@vger.kernel.org> # 3.10+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: Revert commit 2f7021a8 to fix CPU hotplug regression	Srivatsa S. Bhat	2013-07-16	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit 2f7021a8 "cpufreq: protect 'policy->cpus' from offlining during __gov_queue_work()" caused a regression in CPU hotplug, because it lead to a deadlock between cpufreq governor worker thread and the CPU hotplug writer task. Lockdep splat corresponding to this deadlock is shown below: [ 60.277396] ====================================================== [ 60.277400] [ INFO: possible circular locking dependency detected ] [ 60.277407] 3.10.0-rc7-dbg-01385-g241fd04-dirty #1744 Not tainted [ 60.277411] ------------------------------------------------------- [ 60.277417] bash/2225 is trying to acquire lock: [ 60.277422] ((&(&j_cdbs->work)->work)){+.+...}, at: [<ffffffff810621b5>] flush_work+0x5/0x280 [ 60.277444] but task is already holding lock: [ 60.277449] (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81042d8b>] cpu_hotplug_begin+0x2b/0x60 [ 60.277465] which lock already depends on the new lock. [ 60.277472] the existing dependency chain (in reverse order) is: [ 60.277477] -> #2 (cpu_hotplug.lock){+.+.+.}: [ 60.277490] [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200 [ 60.277503] [<ffffffff815b6157>] mutex_lock_nested+0x67/0x410 [ 60.277514] [<ffffffff81042cbc>] get_online_cpus+0x3c/0x60 [ 60.277522] [<ffffffff814b842a>] gov_queue_work+0x2a/0xb0 [ 60.277532] [<ffffffff814b7891>] cs_dbs_timer+0xc1/0xe0 [ 60.277543] [<ffffffff8106302d>] process_one_work+0x1cd/0x6a0 [ 60.277552] [<ffffffff81063d31>] worker_thread+0x121/0x3a0 [ 60.277560] [<ffffffff8106ae2b>] kthread+0xdb/0xe0 [ 60.277569] [<ffffffff815bb96c>] ret_from_fork+0x7c/0xb0 [ 60.277580] -> #1 (&j_cdbs->timer_mutex){+.+...}: [ 60.277592] [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200 [ 60.277600] [<ffffffff815b6157>] mutex_lock_nested+0x67/0x410 [ 60.277608] [<ffffffff814b785d>] cs_dbs_timer+0x8d/0xe0 [ 60.277616] [<ffffffff8106302d>] process_one_work+0x1cd/0x6a0 [ 60.277624] [<ffffffff81063d31>] worker_thread+0x121/0x3a0 [ 60.277633] [<ffffffff8106ae2b>] kthread+0xdb/0xe0 [ 60.277640] [<ffffffff815bb96c>] ret_from_fork+0x7c/0xb0 [ 60.277649] -> #0 ((&(&j_cdbs->work)->work)){+.+...}: [ 60.277661] [<ffffffff810ab826>] __lock_acquire+0x1766/0x1d30 [ 60.277669] [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200 [ 60.277677] [<ffffffff810621ed>] flush_work+0x3d/0x280 [ 60.277685] [<ffffffff81062d8a>] __cancel_work_timer+0x8a/0x120 [ 60.277693] [<ffffffff81062e53>] cancel_delayed_work_sync+0x13/0x20 [ 60.277701] [<ffffffff814b89d9>] cpufreq_governor_dbs+0x529/0x6f0 [ 60.277709] [<ffffffff814b76a7>] cs_cpufreq_governor_dbs+0x17/0x20 [ 60.277719] [<ffffffff814b5df8>] __cpufreq_governor+0x48/0x100 [ 60.277728] [<ffffffff814b6b80>] __cpufreq_remove_dev.isra.14+0x80/0x3c0 [ 60.277737] [<ffffffff815adc0d>] cpufreq_cpu_callback+0x38/0x4c [ 60.277747] [<ffffffff81071a4d>] notifier_call_chain+0x5d/0x110 [ 60.277759] [<ffffffff81071b0e>] __raw_notifier_call_chain+0xe/0x10 [ 60.277768] [<ffffffff815a0a68>] _cpu_down+0x88/0x330 [ 60.277779] [<ffffffff815a0d46>] cpu_down+0x36/0x50 [ 60.277788] [<ffffffff815a2748>] store_online+0x98/0xd0 [ 60.277796] [<ffffffff81452a28>] dev_attr_store+0x18/0x30 [ 60.277806] [<ffffffff811d9edb>] sysfs_write_file+0xdb/0x150 [ 60.277818] [<ffffffff8116806d>] vfs_write+0xbd/0x1f0 [ 60.277826] [<ffffffff811686fc>] SyS_write+0x4c/0xa0 [ 60.277834] [<ffffffff815bbbbe>] tracesys+0xd0/0xd5 [ 60.277842] other info that might help us debug this: [ 60.277848] Chain exists of: (&(&j_cdbs->work)->work) --> &j_cdbs->timer_mutex --> cpu_hotplug.lock [ 60.277864] Possible unsafe locking scenario: [ 60.277869] CPU0 CPU1 [ 60.277873] ---- ---- [ 60.277877] lock(cpu_hotplug.lock); [ 60.277885] lock(&j_cdbs->timer_mutex); [ 60.277892] lock(cpu_hotplug.lock); [ 60.277900] lock((&(&j_cdbs->work)->work)); [ 60.277907] * DEADLOCK * [ 60.277915] 6 locks held by bash/2225: [ 60.277919] #0: (sb_writers#6){.+.+.+}, at: [<ffffffff81168173>] vfs_write+0x1c3/0x1f0 [ 60.277937] #1: (&buffer->mutex){+.+.+.}, at: [<ffffffff811d9e3c>] sysfs_write_file+0x3c/0x150 [ 60.277954] #2: (s_active#61){.+.+.+}, at: [<ffffffff811d9ec3>] sysfs_write_file+0xc3/0x150 [ 60.277972] #3: (x86_cpu_hotplug_driver_mutex){+.+...}, at: [<ffffffff81024cf7>] cpu_hotplug_driver_lock+0x17/0x20 [ 60.277990] #4: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff815a0d32>] cpu_down+0x22/0x50 [ 60.278007] #5: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff81042d8b>] cpu_hotplug_begin+0x2b/0x60 [ 60.278023] stack backtrace: [ 60.278031] CPU: 3 PID: 2225 Comm: bash Not tainted 3.10.0-rc7-dbg-01385-g241fd04-dirty #1744 [ 60.278037] Hardware name: Acer Aspire 5741G /Aspire 5741G , BIOS V1.20 02/08/2011 [ 60.278042] ffffffff8204e110 ffff88014df6b9f8 ffffffff815b3d90 ffff88014df6ba38 [ 60.278055] ffffffff815b0a8d ffff880150ed3f60 ffff880150ed4770 3871c4002c8980b2 [ 60.278068] ffff880150ed4748 ffff880150ed4770 ffff880150ed3f60 ffff88014df6bb00 [ 60.278081] Call Trace: [ 60.278091] [<ffffffff815b3d90>] dump_stack+0x19/0x1b [ 60.278101] [<ffffffff815b0a8d>] print_circular_bug+0x2b6/0x2c5 [ 60.278111] [<ffffffff810ab826>] __lock_acquire+0x1766/0x1d30 [ 60.278123] [<ffffffff81067e08>] ? __kernel_text_address+0x58/0x80 [ 60.278134] [<ffffffff810ac6d4>] lock_acquire+0xa4/0x200 [ 60.278142] [<ffffffff810621b5>] ? flush_work+0x5/0x280 [ 60.278151] [<ffffffff810621ed>] flush_work+0x3d/0x280 [ 60.278159] [<ffffffff810621b5>] ? flush_work+0x5/0x280 [ 60.278169] [<ffffffff810a9b14>] ? mark_held_locks+0x94/0x140 [ 60.278178] [<ffffffff81062d77>] ? __cancel_work_timer+0x77/0x120 [ 60.278188] [<ffffffff810a9cbd>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [ 60.278196] [<ffffffff81062d8a>] __cancel_work_timer+0x8a/0x120 [ 60.278206] [<ffffffff81062e53>] cancel_delayed_work_sync+0x13/0x20 [ 60.278214] [<ffffffff814b89d9>] cpufreq_governor_dbs+0x529/0x6f0 [ 60.278225] [<ffffffff814b76a7>] cs_cpufreq_governor_dbs+0x17/0x20 [ 60.278234] [<ffffffff814b5df8>] __cpufreq_governor+0x48/0x100 [ 60.278244] [<ffffffff814b6b80>] __cpufreq_remove_dev.isra.14+0x80/0x3c0 [ 60.278255] [<ffffffff815adc0d>] cpufreq_cpu_callback+0x38/0x4c [ 60.278265] [<ffffffff81071a4d>] notifier_call_chain+0x5d/0x110 [ 60.278275] [<ffffffff81071b0e>] __raw_notifier_call_chain+0xe/0x10 [ 60.278284] [<ffffffff815a0a68>] _cpu_down+0x88/0x330 [ 60.278292] [<ffffffff81024cf7>] ? cpu_hotplug_driver_lock+0x17/0x20 [ 60.278302] [<ffffffff815a0d46>] cpu_down+0x36/0x50 [ 60.278311] [<ffffffff815a2748>] store_online+0x98/0xd0 [ 60.278320] [<ffffffff81452a28>] dev_attr_store+0x18/0x30 [ 60.278329] [<ffffffff811d9edb>] sysfs_write_file+0xdb/0x150 [ 60.278337] [<ffffffff8116806d>] vfs_write+0xbd/0x1f0 [ 60.278347] [<ffffffff81185950>] ? fget_light+0x320/0x4b0 [ 60.278355] [<ffffffff811686fc>] SyS_write+0x4c/0xa0 [ 60.278364] [<ffffffff815bbbbe>] tracesys+0xd0/0xd5 [ 60.280582] smpboot: CPU 1 is now offline The intention of that commit was to avoid warnings during CPU hotplug, which indicated that offline CPUs were getting IPIs from the cpufreq governor's work items. But the real root-cause of that problem was commit a66b2e5 (cpufreq: Preserve sysfs files across suspend/resume) because it totally skipped all the cpufreq callbacks during CPU hotplug in the suspend/resume path, and hence it never actually shut down the cpufreq governor's worker threads during CPU offline in the suspend/resume path. Reflecting back, the reason why we never suspected that commit as the root-cause earlier, was that the original issue was reported with just the halt command and nobody had brought in suspend/resume to the equation. The reason for _that_ in turn, as it turns out, is that earlier halt/shutdown was being done by disabling non-boot CPUs while tasks were frozen, just like suspend/resume.... but commit cf7df378a (reboot: migrate shutdown/reboot to boot cpu) which came somewhere along that very same time changed that logic: shutdown/halt no longer takes CPUs offline. Thus, the test-cases for reproducing the bug were vastly different and thus we went totally off the trail. Overall, it was one hell of a confusion with so many commits affecting each other and also affecting the symptoms of the problems in subtle ways. Finally, now since the original problematic commit (a66b2e5) has been completely reverted, revert this intermediate fix too (2f7021a8), to fix the CPU hotplug deadlock. Phew! Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Tested-by: Peter Wu <lekensteyn@gmail.com> Cc: 3.10+ <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: don't leave stale policy pointer in cdbs->cur_policy	Jacob Shin	2013-06-27	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Clear ->cur_policy when stopping a governor, or the ->cur_policy pointer may be stale on systems with have_governor_per_policy when a new policy is allocated due to CPU hotplug offline/online. [rjw: Changelog] Suggested-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Jacob Shin <jacob.shin@amd.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	Merge branch 'pm-cpufreq-assorted' into pm-cpufreq	Rafael J. Wysocki	2013-06-27	1	-44/+6
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* pm-cpufreq-assorted: (21 commits) cpufreq: powernow-k8: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: pcc: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: e_powersaver: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: ACPI: call CPUFREQ_POSTCHANGE notfier in error cases cpufreq: make __cpufreq_notify_transition() static cpufreq: Fix minor formatting issues cpufreq: Fix governor start/stop race condition cpufreq: Simplify userspace governor cpufreq: powerpc: move cpufreq driver to drivers/cpufreq cpufreq: kirkwood: Select CPU_FREQ_TABLE option cpufreq: big.LITTLE needs cpufreq table cpufreq: SPEAr needs cpufreq table cpufreq: powerpc: Add cpufreq driver for Freescale e500mc SoCs cpufreq: remove unnecessary cpufreq_cpu_{get\|put}() calls cpufreq: MAINTAINERS: Add git tree path for ARM specific updates cpufreq: rename index as driver_data in cpufreq_frequency_table cpufreq: Don't create empty /sys/devices/system/cpu/cpufreq directory cpufreq: Move get_cpu_idle_time() to cpufreq.c cpufreq: governors: Move get_governor_parent_kobj() to cpufreq.c cpufreq: Add EXPORT_SYMBOL_GPL for have_governor_per_policy ...
\| *	cpufreq: Don't create empty /sys/devices/system/cpu/cpufreq directory	Viresh Kumar	2013-05-27	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we don't have any file in cpu/cpufreq directory we shouldn't create it. Specially with the introduction of per-policy governor instance patchset, even governors are moved to cpu/cpu*/cpufreq/governor-name directory and so this directory is just not required. Lets have it only when required. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
\| *	cpufreq: Move get_cpu_idle_time() to cpufreq.c	Viresh Kumar	2013-05-27	1	-36/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Governors other than ondemand and conservative can also use get_cpu_idle_time() and they aren't required to compile cpufreq_governor.c. So, move these independent routines to cpufreq.c instead. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
\| *	cpufreq: governors: Move get_governor_parent_kobj() to cpufreq.c	Viresh Kumar	2013-05-27	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	get_governor_parent_kobj() can be used by any governor, generic cpufreq governors or platform specific ones and so must be present in cpufreq.c instead of cpufreq_governor.c. This patch moves it to cpufreq.c. This also adds EXPORT_SYMBOL_GPL(get_governor_parent_kobj) so that modules can use this function too. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* \|	cpufreq: protect 'policy->cpus' from offlining during __gov_queue_work()	Michael Wang	2013-06-05	1	-0/+3
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Jiri Kosina <jkosina@suse.cz> and Borislav Petkov <bp@alien8.de> reported the warning: [ 51.616759] ------------[ cut here ]------------ [ 51.621460] WARNING: at arch/x86/kernel/smp.c:123 native_smp_send_reschedule+0x58/0x60() [ 51.629638] Modules linked in: ext2 vfat fat loop snd_hda_codec_hdmi usbhid snd_hda_codec_realtek coretemp kvm_intel kvm snd_hda_intel snd_hda_codec crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hwdep snd_pcm aesni_intel sb_edac aes_x86_64 ehci_pci snd_page_alloc glue_helper snd_timer xhci_hcd snd iTCO_wdt iTCO_vendor_support ehci_hcd edac_core lpc_ich acpi_cpufreq lrw gf128mul ablk_helper cryptd mperf usbcore usb_common soundcore mfd_core dcdbas evdev pcspkr processor i2c_i801 button microcode [ 51.675581] CPU: 0 PID: 244 Comm: kworker/1:1 Tainted: G W 3.10.0-rc1+ #10 [ 51.683407] Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A08 01/24/2013 [ 51.690901] Workqueue: events od_dbs_timer [ 51.695069] 0000000000000009 ffff88043a2f5b68 ffffffff8161441c ffff88043a2f5ba8 [ 51.702602] ffffffff8103e540 0000000000000033 0000000000000001 ffff88043d5f8000 [ 51.710136] 00000000ffff0ce1 0000000000000001 ffff88044fc4fc08 ffff88043a2f5bb8 [ 51.717691] Call Trace: [ 51.720191] [<ffffffff8161441c>] dump_stack+0x19/0x1b [ 51.725396] [<ffffffff8103e540>] warn_slowpath_common+0x70/0xa0 [ 51.731473] [<ffffffff8103e58a>] warn_slowpath_null+0x1a/0x20 [ 51.737378] [<ffffffff81025628>] native_smp_send_reschedule+0x58/0x60 [ 51.744013] [<ffffffff81072cfd>] wake_up_nohz_cpu+0x2d/0xa0 [ 51.749745] [<ffffffff8104f6bf>] add_timer_on+0x8f/0x110 [ 51.755214] [<ffffffff8105f6fe>] __queue_delayed_work+0x16e/0x1a0 [ 51.761470] [<ffffffff8105f251>] ? try_to_grab_pending+0xd1/0x1a0 [ 51.767724] [<ffffffff8105f78a>] mod_delayed_work_on+0x5a/0xa0 [ 51.773719] [<ffffffff814f6b5d>] gov_queue_work+0x4d/0xc0 [ 51.779271] [<ffffffff814f60cb>] od_dbs_timer+0xcb/0x170 [ 51.784734] [<ffffffff8105e75d>] process_one_work+0x1fd/0x540 [ 51.790634] [<ffffffff8105e6f2>] ? process_one_work+0x192/0x540 [ 51.796711] [<ffffffff8105ef22>] worker_thread+0x122/0x380 [ 51.802350] [<ffffffff8105ee00>] ? rescuer_thread+0x320/0x320 [ 51.808264] [<ffffffff8106634a>] kthread+0xea/0xf0 [ 51.813200] [<ffffffff81066260>] ? flush_kthread_worker+0x150/0x150 [ 51.819644] [<ffffffff81623d5c>] ret_from_fork+0x7c/0xb0 [ 51.918165] nouveau E[ DRM] GPU lockup - switching to software fbcon [ 51.930505] [<ffffffff81066260>] ? flush_kthread_worker+0x150/0x150 [ 51.936994] ---[ end trace f419538ada83b5c5 ]--- It was caused by the policy->cpus changed during the process of __gov_queue_work(), in other word, cpu offline happened. Use get/put_online_cpus() to prevent the offline from happening while __gov_queue_work() is running. [rjw: The problem has been present since recent commit 031299b (cpufreq: governors: Avoid unnecessary per cpu timer interrupts)] References: https://lkml.org/lkml/2013/6/5/88 Reported-by: Borislav Petkov <bp@alien8.de> Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: Fix CPUFREQ_GOV_POLICY_{INIT\|EXIT} notifiers	Viresh Kumar	2013-05-12	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are two types of INIT/EXIT activities that we need to do for governors: - Done only once per governor (doesn't depend how many instances of the governor there are). eg: cpufreq_register_notifier() for conservative governor. - Done per governor instance, eg: sysfs_{create\|remove}_group(). There were some corner cases where current code isn't able to handle them separately and so failing for some test cases. We use two separate variables now for keeping track of above two requirements. - governor->initialized for first one - dbs_data->usage_count for per governor instance Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: Calculate iowait time only when necessary	Stratos Karafotis	2013-04-01	1	-30/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently we always calculate the CPU iowait time and add it to idle time. If we are in ondemand and we use io_is_busy, we re-calculate iowait time and we subtract it from idle time. With this patch iowait time is calculated only when necessary avoiding the double call to get_cpu_iowait_time_us. We use a parameter in function get_cpu_idle_time to distinguish when the iowait time will be added to idle time or not, without the need of keeping the prev_io_wait. Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr> Acked-by: Viresh Kumar <viresh.kumar@linaro.,org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: Avoid unnecessary per cpu timer interrupts	Viresh Kumar	2013-04-01	1	-11/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following patch has introduced per cpu timers or works for ondemand and conservative governors. commit 2abfa876f1117b0ab45f191fb1f82c41b1cbc8fe Author: Rickard Andersson <rickard.andersson@stericsson.com> Date: Thu Dec 27 14:55:38 2012 +0000 cpufreq: handle SW coordinated CPUs This causes additional unnecessary interrupts on all cpus when the load is recently evaluated by any other cpu. i.e. When load is recently evaluated by cpu x, we don't really need any other cpu to evaluate this load again for the next sampling_rate time. Some sort of code is present to avoid that but we are still getting timer interrupts for all cpus. A good way of avoiding this would be to modify delays for all cpus (policy->cpus) whenever any cpu has evaluated load. This patch does this change and some related code cleanup. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governor: Implement per policy instances of governors	Viresh Kumar	2013-04-01	1	-70/+142
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, there can't be multiple instances of single governor_type. If we have a multi-package system, where we have multiple instances of struct policy (per package), we can't have multiple instances of same governor. i.e. We can't have multiple instances of ondemand governor for multiple packages. Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/ governor-name/. Which again reflects that there can be only one instance of a governor_type in the system. This is a bottleneck for multicluster system, where we want different packages to use same governor type, but with different tunables. This patch uses the infrastructure provided by earlier patch and implements init/exit routines for ondemand and conservative governors. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: Fix WARN_ON() for multi-policy platforms	Viresh Kumar	2013-02-09	1	-13/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On multi-policy systems there is a single instance of governor for both the policies (if same governor is chosen for both policies). With the code update from following patches: 8eeed09 cpufreq: governors: Get rid of dbs_data->enable field b394058 cpufreq: governors: Reset tunables only for cpufreq_unregister_governor() We are creating/removing sysfs directory of governor for for every call to GOV_START and STOP. This would fail for multi-policy system as there is a per-policy call to START/STOP. This patch reuses the governor->initialized variable to detect total users of governor. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: Don't check cpu_online(policy->cpu)	Viresh Kumar	2013-02-09	1	-1/+1
\| \| \| \| \| \| \| \|	policy->cpu or cpus in policy->cpus can't be offline anymore. And so we don't need to check if they are online or not. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: Reset tunables only for cpufreq_unregister_governor()	Viresh Kumar	2013-02-02	1	-8/+16
\| \| \| \| \| \| \| \| \| \|	Currently, whenever governor->governor() is called for CPUFRREQ_GOV_START event we reset few tunables of governor. Which isn't correct, as this routine is called for every cpu hot-[un]plugging event. We should actually be resetting these only when the governor module is removed and re-installed. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: Remove code redundancy between governors	Viresh Kumar	2013-02-02	1	-0/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With the inclusion of following patches: 9f4eb10 cpufreq: conservative: call dbs_check_cpu only when necessary 772b4b1 cpufreq: ondemand: call dbs_check_cpu only when necessary code redundancy between the conservative and ondemand governors is introduced again, so get rid of it. [rjw: Changelog] Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Fabio Baltieri <fabio.baltieri@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: Get rid of dbs_data->enable field	Viresh Kumar	2013-02-02	1	-38/+20
\| \| \| \| \| \| \| \| \| \| \| \|	CPUFREQ_GOV_START/STOP are called only once for all policy->cpus and hence we don't need to adapt cpufreq_governor_dbs() routine for multiple calls. So, this patch removes dbs_data->enable field entirely. And rearrange code a bit. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Fabio Baltieri <fabio.baltieri@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
*	cpufreq: governors: fix misuse of cdbs.cpu	Fabio Baltieri	2013-02-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	Fix governors code to set all cpu's cdbs->cpu to the the actual cpu id and use cur_policy->cpu istead of cdbs->cpu to track current governor's leader cpu. Reported-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>