Commits · e5eca6d41f53db48edd8cf88a3f59d2c30227f8e · Mirrors / github.com / raspberrypi_linux

Jun 13, 2014

rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 · e5eca6d4

Michal Schmidt authored May 28, 2014

When running RHEL6 userspace on a current upstream kernel, "ip link"
fails to show VF information.

The reason is a kernel<->userspace API change introduced by commit
88c5b5ce ("rtnetlink: Call nlmsg_parse() with correct header length"),
after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
in the netlink request.

iproute2 adjusted for the API change in its commit 63338dca4513
("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").

The problem has been noticed before:
http://marc.info/?l=linux-netdev&m=136692296022182&w=2


(Subject: Re: getting VF link info seems to be broken in 3.9-rc8)

We can do better than tell those with old userspace to upgrade. We can
recognize the old iproute2 in the kernel by checking the netlink message
length. Even when including the IFLA_EXT_MASK attribute, its netlink
message is shorter than struct ifinfomsg.

With this patch "ip link" shows VF information in both old and new
iproute2 versions.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e5eca6d4

tcp: fixing TLP's FIN recovery · bef1909e

Per Hurtig authored Jun 12, 2014



Fix to a problem observed when losing a FIN segment that does not
contain data.  In such situations, TLP is unable to recover from
*any* tail loss and instead adds at least PTO ms to the
retransmission process, i.e., RTO = RTO + PTO.

Signed-off-by: Per Hurtig <per.hurtig@kau.se>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bef1909e

Merge branch 'fec' · fba0e1a3

David S. Miller authored Jun 12, 2014



Fugang Duan says:

====================
net: fec: Enable Software TSO to improve the tx performance

Add SG and software TSO support for FEC.
This feature allows to improve outbound throughput performance.
Tested on imx6dl sabresd board, running iperf tcp tests shows:
        * 82% improvement comparing with NO SG & TSO patch

$ ethtool -K eth0 sg on
$ ethtool -K eth0 tso on
[  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec
* cpu loading is 30%

$ ethtool -K eth0 sg off
$ ethtool -K eth0 tso off
[  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec

FEC HW support IP header and TCP/UDP hw checksum, support multi buffer descriptor transfer
one frame, but don't support HW TSO. And imx6q/dl SOC FEC Gbps speed has HW bus Bandwidth
limitation (400Mbps ~ 700Mbps), imx6sx SOC FEC Gbps speed has no HW bandwidth limitation.

The patch set just enable TSO feature, which is done following the mv643xx_eth driver.

Test result analyze:
imx6dl sabresd board: there have 82% improvement, since imx6dl FEC HW has bandwidth limitation,
                      the performance with SW TSO is a milestone.

Addition test:
imx6sx sdb board:
upstream still don't support imx6sx due to some patches being upstream... they use same FEC IP.
Use the SW TSO patches test imx6sx sdb board in internal kernel tree:
No SW TSO patch: tx bandwidth 840Mbps, cpu loading is 100%.
SW TSO patch:    tx bandwidth 942Mbps, cpu loading is 65%.
It means the patch set have great improvement for imx6sx FEC performance.

V2:
* From Frank Li's suggestion:
	Change the API "fec_enet_txdesc_entry_free" name to "fec_enet_get_free_txdesc_num".
* Summary David Laight and Eric Dumazet's thoughts:
	RX BD entry number change to 256.
* From ezequiel's suggestion:
	Follow the latest TSO fixes from his solution to rework the queue stop/wake-up.
	Avoid unmapping the TSO header buffers.
* From Eric Dumazet's suggestion:
	Avoid more bytes copy, just copying the unaligned part of the payload into first
	descriptor. The suggestion will bring more complex for the driver, and imx6dl FEC
	DMA need 16 bytes alignment, but cpu loading is not problem that cpu loading is
	30%, the current performance is so better. Later chip like imx6sx Gigbit FEC DMA
	support byte alignment, so there don't exist memory copy. So, the V2 version drop
	the suggestion.
	Anyway, thanks for Eric's response and suggestion.

V3:
* From David Laight's feedback:
	Decide to drop RX BD entry number change for the SW TSO patch set.
	I will generate one separate patch to increase RX BDs entry for interrupt coalescing feature which
	will be supported in my later patch set.

V4:
* From David Laight's feedback:
	Remove the conditional in .fec_enet_get_bd_index().

V5:
* Patch #4 update:
  From David Laight's feedback:
	"expect fec_enet_get_free_txdesc_num() to return one less than it does currently."
	Change the function:
	Return space available, 0..size-1.  it always leave one free entry. Which is same as linux circ_buf.

Thanks for Eric and ezequiel's help and idea.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

fba0e1a3

net: fec: Add software TSO support · 79f33912

Nimrod Andy authored Jun 12, 2014



Add software TSO support for FEC.
This feature allows to improve outbound throughput performance.

Tested on imx6dl sabresd board, running iperf tcp tests shows:
- 16.2% improvement comparing with FEC SG patch
- 82% improvement comparing with NO SG & TSO patch

$ ethtool -K eth0 tso on
$ iperf -c 10.192.242.167 -t 3 &
[  3] local 10.192.242.108 port 35388 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   181 MBytes   506 Mbits/sec

During the testing, CPU loading is 30%.
Since imx6dl FEC Bandwidth is limited to SOC system bus bandwidth, the
performance with SW TSO is a milestone.

CC: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Laight <David.Laight@ACULAB.COM>
CC: Li Frank <B20596@freescale.com>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

79f33912

net: fec: Add Scatter/gather support · 6e909283

Nimrod Andy authored Jun 12, 2014



Add Scatter/gather support for FEC.
This feature allows to improve outbound throughput performance.

Tested on imx6dl sabresd board:
Running iperf tests shows a 55.4% improvement.

$ ethtool -K eth0 sg off
$ iperf -c 10.192.242.167 -t 3 &
[  3] local 10.192.242.108 port 52618 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec  99.5 MBytes   278 Mbits/sec

$ ethtool -K eth0 sg on
$ iperf -c 10.192.242.167 -t 3 &
[  3] local 10.192.242.108 port 52617 connected with 10.192.242.167 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 3.0 sec   154 MBytes   432 Mbits/sec

CC: Li Frank <B20596@freescale.com>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e909283

net: fec: Increase buffer descriptor entry number · 55d0218a

Nimrod Andy authored Jun 12, 2014



In order to support SG, software TSO, let's increase BD entry number.

CC: Ezequiel Garcia <ezequiel.garcia@free-electrons.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55d0218a

net: fec: Factorize feature setting · 09d1e541

Nimrod Andy authored Jun 12, 2014



In order to enhance the code readable, let's factorize the
feature list.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

09d1e541

net: fec: Enable IP header hardware checksum · 96c50caa

Nimrod Andy authored Jun 12, 2014



IP header checksum is calcalated by network layer in default.
To support software TSO, it is better to use HW calculate the
IP header checksum.

FEC hw checksum feature request the checksum field in frame
is zero, otherwise the calculative CRC is not correct.

For segmentated TCP packet, HW calculate the IP header checksum again,
it doesn't bring any impact. For SW TSO, HW calculated checksum bring
better performance.

Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

96c50caa

net: fec: Factorize the .xmit transmit function · 61a4427b

Nimrod Andy authored Jun 12, 2014



Make the code more readable and easy to support other features like
SG, TSO, moving the common transmit function to one api.

And the patch also factorize the getting BD index to it own function.

CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Fugang Duan <B38611@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61a4427b

bridge: fix compile error when compiling without IPv6 support · 3993c4e1

Linus Lüssing authored Jun 12, 2014

Some fields in "struct net_bridge" aren't available when compiling the
kernel without IPv6 support. Therefore adding a check/macro to skip the
complaining code sections in that case.

Introduced by 2cd41431


("bridge: memorize and export selected IGMP/MLD querier port")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

3993c4e1

bridge: fix smatch warning / potential null pointer dereference · 6c03ee8b

Linus Lüssing authored Jun 12, 2014

"New smatch warnings:
  net/bridge/br_multicast.c:1368 br_ip6_multicast_query() error:
    we previously assumed 'group' could be null (see line 1349)"

In the rare (sort of broken) case of a query having a Maximum
Response Delay of zero, we could create a potential null pointer
dereference.

Fixing this by skipping the multicast specific MLD Query parsing again
if no multicast group address is available.

Introduced by dc4eb53a


("bridge: adhere to querier election mechanism specified by RFCs")

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Linus Lüssing <linus.luessing@web.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c03ee8b

via-rhine: fix full-duplex with autoneg disable · 17958438

François Cachereul authored Jun 12, 2014



With some specific configuration (VT6105M on Soekris 5510 and depending
on the device at the other end), fragmented packets were not transmitted
when forcing 100 full-duplex with autoneg disable.

This fix now write full-duplex chips register when forcing full or
half-duplex not only when autoneg is enable.

Signed-off-by: François Cachereul <f.cachereul@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

17958438

Merge branch 'bnx2x' · a4d3de0d

David S. Miller authored Jun 12, 2014



Yuval Mintz says:

====================
bnx2x: Bug fixes patch series

This patch series contains various bug fixes - 2 link related fixes,
one sriov-related issue and an additional fix for a theoretical bug
on new boards.

Please consider applying these patches to `net'.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

a4d3de0d

bnx2x: Enlarge the dorq threshold for VFs · f2cfa997

Ariel Elior authored Jun 12, 2014



A malicious VF might try to starve the other VFs & PF by creating
contineous doorbell floods. In order to negate this, HW has a threshold of
doorbells per client, which will stop the client doorbells from arriving
if crossed.

The threshold currently configured for VFs is too low - under extreme traffic
scenarios, it's possible for a VF to reach the threshold and thus for its
fastpath to stop working.

Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f2cfa997

bnx2x: Check for UNDI in uncommon branch · b17b0ca1

Yuval Mintz authored Jun 12, 2014



If L2FW utilized by the UNDI driver has the same version number as that
of the regular FW, a driver loading after UNDI and receiving an uncommon
answer from management will mistakenly assume the loaded FW matches its
own requirement and try to exist the flow via FLR.

Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b17b0ca1

bnx2x: Fix 1G-baseT link · a2755be5

Yaniv Rosner authored Jun 12, 2014



Set the phy access mode even in case of link-flap avoidance.

Signed-off-by: Yaniv Rosner <yaniv.rosner@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2755be5

bnx2x: Fix link for KR with swapped polarity lane · dad91ee4

Yaniv Rosner authored Jun 12, 2014



This avoids clearing the RX polarity setting in KR mode when polarity lane
is swapped, as otherwise this will result in failed link.

Signed-off-by: Yaniv Rosner <yaniv.rosner@qlogic.com>
Signed-off-by: Yuval Mintz <yuval.mintz@qlogic.com>
Signed-off-by: Ariel Elior <ariel.elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dad91ee4

sctp: Fix sk_ack_backlog wrap-around problem · d3217b15

Xufeng Zhang authored Jun 12, 2014



Consider the scenario:
For a TCP-style socket, while processing the COOKIE_ECHO chunk in
sctp_sf_do_5_1D_ce(), after it has passed a series of sanity check,
a new association would be created in sctp_unpack_cookie(), but afterwards,
some processing maybe failed, and sctp_association_free() will be called to
free the previously allocated association, in sctp_association_free(),
sk_ack_backlog value is decremented for this socket, since the initial
value for sk_ack_backlog is 0, after the decrement, it will be 65535,
a wrap-around problem happens, and if we want to establish new associations
afterward in the same socket, ABORT would be triggered since sctp deem the
accept queue as full.
Fix this issue by only decrementing sk_ack_backlog for associations in
the endpoint's list.

Fix-suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xufeng Zhang <xufeng.zhang@windriver.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d3217b15

Jun 12, 2014

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 902455e0

David S. Miller authored Jun 11, 2014



Conflicts:
	net/core/rtnetlink.c
	net/core/skbuff.c

Both conflicts were very simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>

902455e0

net/core: Add VF link state control policy · c5b46160

Doug Ledford authored Jun 11, 2014

Commit 1d8faf48

 (net/core: Add VF link state control) added VF link state
control to the netlink VF nested structure, but failed to add a proper entry
for the new structure into the VF policy table.  Add the missing entry so
the table and the actual data copied into the netlink nested struct are in
sync.

Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c5b46160

net/fsl: xgmac_mdio is dependent on OF_MDIO · 39f33367

Andy Fleming authored Jun 11, 2014



Signed-off-by: Shruti Kanetkar <Shruti@Freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

39f33367

net/fsl: Make xgmac_mdio read error message useful · 55fd3641

Shruti Kanetkar authored Jun 11, 2014



Print the device address, the register number and the PHY ID for
which the MDIO read operation failed

Signed-off-by: Shruti Kanetkar <Shruti@Freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55fd3641

net_sched: drr: warn when qdisc is not work conserving · 6e765a00

Florian Westphal authored Jun 11, 2014

The DRR scheduler requires that items on the active list are work
conserving, i.e. do not hold on to skbs for throttling purposes, etc.
Attaching e.g. tbf renders DRR useless because all other classes on the
active list are delayed as well.

So, warn users that this configuration won't work as expected; we
already do this in couple of other qdiscs, see e.g.

commit b00355db


('pkt_sched: sch_hfsc: sch_htb: Add non-work-conserving warning handler')

The 'const' change is needed to avoid compiler warning ("discards 'const'
qualifier from pointer target type").

tested with:
drr_hier() {
        parent=$1
        classes=$2
        for i in  $(seq 1 $classes); do
                classid=$parent$(printf %x $i)
                tc class add dev eth0 parent $parent classid $classid drr
		tc qdisc add dev eth0 parent $classid tbf rate 64kbit burst 256kbit limit 64kbit
        done
}
tc qdisc add dev eth0 root handle 1: drr
drr_hier 1: 32
tc filter add dev eth0 protocol all pref 1 parent 1: handle 1 flow hash keys dst perturb 1 divisor 32

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e765a00

Merge branch 'inet_csums' · f3591fd4

David S. Miller authored Jun 11, 2014



Tom Herbert says:

====================
net: Checksum offload changes - Part IV

I am working on overhauling RX checksum offload. Goals of this effort
are:

- Specify what exactly it means when driver returns CHECKSUM_UNNECESSARY
- Preserve CHECKSUM_COMPLETE through encapsulation layers
- Don't do skb_checksum more than once per packet
- Unify GRO and non-GRO csum verification as much as possible
- Unify the checksum functions (checksum_init)
- Simply code

What is in this fourth patch set:

- Preserve CHECKSUM_COMPLETE instead of changing it to
  CHECKSUM_UNNECESSARY. This allows correct reuse in validating multiple
  csums in a packet.
- When SW needs to compute the packet checksum, save it as
  CHECKSUM_COMPLETE. Also mark that checksum was compute by SW.
- Add skb_gro_postpull_rcsum to udp and vxlan to make GRO work with
  CHECKSUM_COMPLETE.

v2: Removed patch setting skb_encapsulation when validating checksum
    in tcp_gro_receive

Please review carefully and test if possible, mucking with basic
checksum functions is always a little precarious :-)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

f3591fd4

net: Add skb_gro_postpull_rcsum to udp and vxlan · 6bae1d4c

Tom Herbert authored Jun 10, 2014



Need to gro_postpull_rcsum for GRO to work with checksum complete.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6bae1d4c

net: Save software checksum complete · 7e3cead5

Tom Herbert authored Jun 10, 2014



In skb_checksum complete, if we need to compute the checksum for the
packet (via skb_checksum) save the result as CHECKSUM_COMPLETE.
Subsequent checksum verification can use this.

Also, added csum_complete_sw flag to distinguish between software and
hardware generated checksum complete, we should always be able to trust
the software computation.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e3cead5

net: Preserve CHECKSUM_COMPLETE at validation · 5d0c2b95

Tom Herbert authored Jun 10, 2014



Currently when the first checksum in a packet is validated using
CHECKSUM_COMPLETE, ip_summed is overwritten to be CHECKSUM_UNNECESSARY
so that any subsequent checksums in the packet are not correctly
validated.

This patch adds csum_valid flag in sk_buff and uses that to indicate
validated checksum instead of setting CHECKSUM_UNNECESSARY. The bit
is set accordingly in the skb_checksum_validate_* functions. The flag
is checked in skb_checksum_complete, so that validation is communicated
between checksum_init and checksum_complete sequence in TCP and UDP.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d0c2b95

Merge branch 'qlcnic-next' · 1054cc15

David S. Miller authored Jun 11, 2014



Shahed Shaikh says:

====================
This series contains an enhancement in the area of firmware minidump collection
and optimization of ring count validation function.

Please apply this series to net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

1054cc15

qlcnic: Update version to 5.3.60 · 038782d6

Shahed Shaikh authored Jun 11, 2014



Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

038782d6

qlcnic: Optimize ring count validations · 18e0d625

Shahed Shaikh authored Jun 11, 2014



- Check interrupt mode at the start of qlcnic_set_channels().
- Do not validate ring count if they are not going to change.

Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

18e0d625

qlcnic: Pre-allocate DMA buffer used for minidump collection · 4da005cf

Shahed Shaikh authored Jun 11, 2014



Pre-allocate the physically contiguous DMA buffer used for
minidump collection at driver load time, rather than at
run time, to minimize allocation failures. Driver will allocate
the buffer at load time if PEX DMA support capability is indicated
by the adapter.

Signed-off-by: Shahed Shaikh <shahed.shaikh@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4da005cf

ip_vti: fix sparse warnings for VTI_ISVTI · efd0f11d

Dmitry Popov authored Jun 11, 2014

This patch fixes the following sparse warnings:

net/ipv4/ip_tunnel.c:245:53: warning: restricted __be16 degrades to integer
net/ipv4/ip_vti.c:321:19: warning: incorrect type in assignment (different base types)
net/ipv4/ip_vti.c:321:19: expected restricted __be16 [addressable] [assigned] [usertype] i_flags
net/ipv4/ip_vti.c:321:19: got int
net/ipv4/ip_vti.c:447:24: warning: incorrect type in assignment (different base types)
net/ipv4/ip_vti.c:447:24: expected restricted __be16 [usertype] i_flags
net/ipv4/ip_vti.c:447:24: got int

Since VTI_ISVTI is always used with ip_tunnel_parm->i_flags (which is __be16),
we can __force cast VTI_ISVTI to __be16 in header file.

Signed-off-by: Dmitry Popov <ixaphire@qrator.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

efd0f11d

drivers: net: davinci_cpdma: double free on error · 2f87208e

Dan Carpenter authored Jun 11, 2014

We recently change the kzalloc() to devm_kzalloc() so freeing "ctlr"
here could lead to a double free.

Fixes: e1943128

 ('drivers: net: davinci_cpdma: Convert kzalloc() to devm_kzalloc().')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f87208e

amd-xgbe: unwind on error in xgbe_mdio_register() · 8fc908c3

Dan Carpenter authored Jun 11, 2014



There is a typo here so we return directly instead of unwinding.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8fc908c3

mrf24j40: add device managed APIs · 0aaf43f5

Varka Bhadram authored Jun 11, 2014



adds the device managed APIs so that no need worry about
freeing the resources.

Signed-off-by: Varka Bhadram <varkab@cdac.in>
Signed-off-by: David S. Miller <davem@davemloft.net>

0aaf43f5

ceph: remove bogus extern · f6479449

stephen hemminger authored Jun 10, 2014



Sparse complained about this bogus extern on definition of
a function.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f6479449

net: filter: document internal instruction encoding · 783e327b

Alexei Starovoitov authored Jun 10, 2014



This patch adds a description of eBPFs instruction encoding in order
to bring the documentation in line with the implementation.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

783e327b

net: filter: mention eBPF terminology as well · e4ad4032

Alexei Starovoitov authored Jun 10, 2014



Since the term eBPF is used anyway on mailing list discussions, lets
also document that in the main BPF documentation file and replace a
couple of occurrences with eBPF terminology to be more clear.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4ad4032

ipv4: fix a race in ip4_datagram_release_cb() · 9709674e

Eric Dumazet authored Jun 10, 2014

Alexey gave a AddressSanitizer[1] report that finally gave a good hint
at where was the origin of various problems already reported by Dormando
in the past [2]

Problem comes from the fact that UDP can have a lockless TX path, and
concurrent threads can manipulate sk_dst_cache, while another thread,
is holding socket lock and calls __sk_dst_set() in
ip4_datagram_release_cb() (this was added in linux-3.8)

It seems that all we need to do is to use sk_dst_check() and
sk_dst_set() so that all the writers hold same spinlock
(sk->sk_dst_lock) to prevent corruptions.

TCP stack do not need this protection, as all sk_dst_cache writers hold
the socket lock.

[1]
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel



AddressSanitizer: heap-use-after-free in ipv4_dst_check
Read of size 2 by thread T15453:
 [<ffffffff817daa3a>] ipv4_dst_check+0x1a/0x90 ./net/ipv4/route.c:1116
 [<ffffffff8175b789>] __sk_dst_check+0x89/0xe0 ./net/core/sock.c:531
 [<ffffffff81830a36>] ip4_datagram_release_cb+0x46/0x390 ??:0
 [<ffffffff8175eaea>] release_sock+0x17a/0x230 ./net/core/sock.c:2413
 [<ffffffff81830882>] ip4_datagram_connect+0x462/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Freed by thread T15455:
 [<ffffffff8178d9b8>] dst_destroy+0xa8/0x160 ./net/core/dst.c:251
 [<ffffffff8178de25>] dst_release+0x45/0x80 ./net/core/dst.c:280
 [<ffffffff818304c1>] ip4_datagram_connect+0xa1/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

Allocated by thread T15453:
 [<ffffffff8178d291>] dst_alloc+0x81/0x2b0 ./net/core/dst.c:171
 [<ffffffff817db3b7>] rt_dst_alloc+0x47/0x50 ./net/ipv4/route.c:1406
 [<     inlined    >] __ip_route_output_key+0x3e8/0xf70
__mkroute_output ./net/ipv4/route.c:1939
 [<ffffffff817dde08>] __ip_route_output_key+0x3e8/0xf70 ./net/ipv4/route.c:2161
 [<ffffffff817deb34>] ip_route_output_flow+0x14/0x30 ./net/ipv4/route.c:2249
 [<ffffffff81830737>] ip4_datagram_connect+0x317/0x5d0 ??:0
 [<ffffffff81846d06>] inet_dgram_connect+0x76/0xd0 ./net/ipv4/af_inet.c:534
 [<ffffffff817580ac>] SYSC_connect+0x15c/0x1c0 ./net/socket.c:1701
 [<ffffffff817596ce>] SyS_connect+0xe/0x10 ./net/socket.c:1682
 [<ffffffff818b0a29>] system_call_fastpath+0x16/0x1b
./arch/x86/kernel/entry_64.S:629

[2]
<4>[196727.311203] general protection fault: 0000 [#1] SMP
<4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
<4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
<4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
<4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
<4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
<4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
<4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
<4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
<4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
<4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
<4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
<4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[196727.311713] Stack:
<4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
<4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
<4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
<4>[196727.311885] Call Trace:
<4>[196727.311907]  <IRQ>
<4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
<4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
<4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
<4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
<4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
<4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
<4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
<4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
<4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
<4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
<4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
<4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
<4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
<4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
<4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
<4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
<4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
<4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
<4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
<4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
<4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
<4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
<4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
<4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
<4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
<4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
<4>[196727.312722]  <EOI>
<1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
<4>[196727.313100]  RSP <ffff885effd23a70>
<4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
<0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt

Reported-by: Alexey Preobrazhensky <preobr@google.com>
Reported-by: dormando <dormando@rydia.ne>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Fixes: 8141ed9f

 ("ipv4: Add a socket release callback for datagram sockets")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9709674e

net: filter: add test_bpf module under MAINTAINERS' networking section · a101ccd1

Daniel Borkmann authored Jun 10, 2014



Add lib/test_bpf.c entry to maintainers file under networking.
All changes were posted via netdev for review, so make sure
other people Cc it as well when they call get_maintainer.pl.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a101ccd1